osrs-hiscores

A quantitative analysis of the Old School Runescape hiscores.

This repository contributes the following:

Code for web scraping the OSRS hiscores, along with the resulting dataset.
Code for a machine learning pipeline which clusters the player population by account similarity.
An interactive web application for visualizing player results.

The dataset consists of the following files:

player-stats.csv: Skill levels in all 23 skills for the top 2 million OSRS accounts.
cluster-centroids.csv: Central values for clusters that emerge from partitioning player dataset into groups based on account similarity. Each centroid is a vector of values between 1-99 in "OSRS skill" space.
player-clusters.csv: Cluster IDs per player for three separate clustering runs, grouping similar accounts by looking at (i) all skills, (ii) combat skills only and (iii) non-combat skills only.
player-stats-raw.csv: Rank, level, xp, clues, minigame and boss stats for the top 2 million OSRS players. This file is the raw output from the scraping process (1.7 GB).

These files are not checked in to the repo due to file size constraints. They can be downloaded separately from Google Drive: https://bit.ly/osrs-hiscores-dataset

Player stats were collected from the official OSRS hiscores over a 24-hour period on July 21, 2022.

Project organization

├── LICENSE
├── Makefile         <- Top-level Makefile for building and running project.
├── README.md        <- The top-level README for developers using this project.
│
├── app              <- Application code and assets.
├── bin              <- Utility executables.
│
├── data
│   ├── final        <- The final, canonical data set.
│   ├── interim      <- Intermediate data that has been transformed.
│   └── raw          <- The original, immutable data dump.
│
├── ref              <- Reference files used in data processing.
├── scripts          <- Scripts for the stages of the data processing pipeline.
│
├── src
│   ├── analysis     <- Data science and analytics.
│   └── scrape       <- Scraping hiscores data.
│
├── test             <- Unit tests.
│
├── Procfile         <- Entry point for deployment as a Heroku application.
├── requirements.txt <- Dependencies file for reproducing the project environment.
├── runapp.py        <- Main script for Dash application.
└── setup.py         <- Setup file for installing this project through pip.

Usage

At a high level, this repository implements a data science pipeline:

scrape OSRS hiscores data
         ↓
cluster players by stats
         ↓
project clusters to 3D
         ↓
build application data

along with a Dash application for visualizing the results.

The stages of the data pipeline are driven by a Makefile with top-level make targets for each processing stage:

make init: set up project environment and install dependencies.
make scrape: scrape data from the official OSRS hiscores and transform into a cleaned dataset.
make cluster: cluster players into groups of similar accounts according to their stats. Uses k-means as the clustering algorithm, implemented by the faiss library.
make postprocess: project the cluster centroids from high-dimensional space to 3D for visualization purposes (UMAP is the algorithm used for dimensionality reduction). Compute quartiles for each cluster based on the player population it contains.
make build-app: build application data and database using all previous analytic results. This target will launch a MongoDB instance inside a Docker container at the URL localhost:27017 (by default).

Steps 2 and 3 can (and should) be skipped by simply running make download-dataset, which fetches the scraped data and clustering results from an S3 bucket. This requires an AWS account with credentials located in the ~/.aws directory.

To launch the application, run make run-app and visit the URL localhost:8050 in a web browser.

The final application can be built and run in one shot via make app, which uses downloaded data rather than scraping and clustering the data from scratch. The target make all is what was used to build the final results for this repo. If scraping data, note that high usage of the hiscores API may result in your IP being blocked. Please be sparing and respectful of Jagex's server resources in your usage of this code.

Run make help to see more top-level targets.

Configuration

A number of environment variables are set in order to configure the application.

OSRS_APPDATA_URI: path to application data .pkl file (S3 or local)
OSRS_MONGO_URI: URL at which MongoDB instance is running
OSRS_MONGO_COLL: store/retrieve player data from collection with this name

There are also environment variables defining filenames at each stage of the data pipeline.

Defaults for all environment variables are defined in .env.default and imported whenever a make target is run. If a file called .env exists, any settings there will override those in .env.default.

Dependencies

Python 3.9 or greater (download here)
Docker (download here)
AWS account with credentials installed in ~/.aws directory (create account here)

Methods

Data were scraped for the top 2 million players on the OSRS hiscores. Data consists of xp, rank, and level in each OSRS skill and overall, along with rank and score stats for clue scrolls, minigames and bosses.
Account data were deduplicated, sorted and subsampled to keep skill level columns only. After deduplication, 1999625 records remained. Each record is a length-23 vector giving an account's levels in the 23 OSRS skills.
Accounts were segmented into 2000 clusters based on similarity of skills for three different sets of feature columns, or dataset 'splits':
- all: all 23 OSRS skills
- cb: the 7 combat skills
- noncb: the 16 non-combat skills
For each split of the dataset, clustering produced a set of 2000 cluster centroids (with dimensionality 23, 7 or 16) and a cluster ID associated with each player. Clustering was performed with a standard implementation of k-means using the L2 distance.
Cluster centroids were projected from their ambient dimensionality to 3D space using UMAP. UMAP parameters n_neighbors=10 and min_dist=0.25 were used for splits all and noncb; n_neighbors=20 and min_dist=0.25 were used for split cb.
Quartiles (the 0, 25, 50, 75 and 100th percentiles) in each skill were computed by aggregating the accounts belonging to each cluster.
The clustering results were assembled into a serialized data file. Player stats were written to a database to provide quick result lookups. The final application makes use of these two resources.

Project ideas

Here are some ideas for data science projects.

Run the same analysis on the OSRS Ironman hiscores.
Create a method for identifying bot clusters within the dataset.
See how well you can predict one unknown skill given all other skills for an account. Is it easier for some skills than others, and can this be explained in terms of the game meta?
Perform hierarchical clustering to identify super-clusters or search for fine-grained structure within clusters. Annotating these clusters would be a step toward a true taxonomy of OSRS accounts.
Create a reverse lookup tool which, given a username, finds other accounts with similar stats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

osrs-hiscores

Project organization

Usage

Configuration

Dependencies

Methods

Project ideas

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 449 Commits
app		app
bin		bin
ref		ref
scripts		scripts
src		src
test		test
.env.default		.env.default
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
runapp.py		runapp.py
runtime.txt		runtime.txt
setup.py		setup.py

License

lukearend/osrs-hiscores

Folders and files

Latest commit

History

Repository files navigation

osrs-hiscores

Project organization

Usage

Configuration

Dependencies

Methods

Project ideas

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages