robot-detection

One of my client would like detect Robot Traffic in an e-commercial website

Problem Definition

For any online company, web traffic is a vital source of information. By using the logs our servers generate, we have source material to find answers to a wide array of questions about our business. You can think of analysing the logs to optimise sites and pages based on user behaviour or to detect and mitigate imminent threats. Depending on the analysis, we are interested in either traffic generated by real people, or bot traffic: non-human traffic (NHT). The aim is to identify the type of traffic we see coming onto our servers.

Data Overview

Robot accounts for ~70% of traffic
Traffic from California of USA accounts for ~70% of traffic
Most traffics (> 80%) are anonymized
Data are only available from 20-22 for 1 day, and also we don't know when the session ends, so it's unlikely time feature (or seasonality) plays an important role here
+95% traffic comes from Browsers and Robots, we can simplify the multi-label classification problem into binary classication between Robots vs. non-Robots. Any multi-label classication problem can be divided into multiple binary classification anyway, with one vs.rest strategy
The class Robot/non-Robot is imbalance ~0.6/1, so it's imbalance binary classification
It seems there're 3 consistent features: country, region, and visitor recognition type. In fact, country and region can be combined into 1 feature e.g. place

Approach

Simple Feature Engineering

First, we visualize top keywords frequency from URL.

We noted there're some distinction in product keywords between Robot and non-Robot. Traffic from Robot focuses on general, book or language topics, while traffic driven by non-Robot would center around ajax, or sports

We can approximately extract the product category feature from url by topic modeling using Latent Dirichlet Allocation, divided into 10 different product topics. The assumption is that bot/robot browsing are only interested in some specific products.

Each bubble represents a topic. The area of these topic bubles is proportional to the amount of words that belong to each topic across the dictionary. The bubles are plotted using a multidimensional scaling algorithm (dimensional reduction) based on the words they comprise, so topics that are closer together have more words in common.
Blue bars represent the overall frequency of the term across the entire corpus. Salient is a specific metric for topic identity. Higher saliency values indicate that a word is more useful for identifying a specific topic.
Red bars estimate frequency of the term in a selected topic
Relevance aims to highlight terms that are not only important within a specific topic but also distinctive to that topic compared to their general frequency in the corpus.

Now, with the new feature (topic), we can check Crammers' V-based association and entropy analysis:

We see obvisouly location features are the most predictive here, to distinguish between Robots and nonRobots
product_topic has the highest entropy (least predictive), but it's noted it also means there's some improvement needed in text and topic modeling
It is obvious place and robots are highly associated. This means, e.g. if a traffic is from US_CA, it's highly likely Robot traffic
There's no strong evidence and association that Robot only targets some specific product
Since random forest scikit-learn does not deal with categorical string, we would encode categories like visitor_recognition_type or place by traffic frequency

Modeling

For simplicity, the data were split into train/test data before modeling, assuming all columns/features are still the same in the test set. However it's noted some corner cases which can happen in reality, e.g.:

There're new countries/regions in the test set which are unavailable in the training set
Or new product categories in the test set

We'll compare 2 models: KNN vs Random Forest

KNN Classifier

KNN (k = 5) provides a deterministic values as it essentially returns the fraction of neighbors in each class.
In most cases, the predicted probabilities are influenced by the class labels of the k nearest neighbors, and in many cases, there is a clear majority class among the neighbors.
Although AUC of precision-recall curve and ROC curves and KS stat are high, the score of 0.96 (in KS) can be observed as the difference between the two classes of predictions, along the x axis at 60% of sample. As can be seen, this distance only occurs for up to ~60% of the data, since the maximum separation starts at a threshold of 0.6, along the x axis, but then decreasing immediately afterwards.

The model (both Class 0 and Class 1) significantly outperforms the baseline lift.
Class 1 shows better performance than Class 0, reaching maximum gain faster.
In calibration, the model deviates from perfect calibration, showing some overconfidence in its predictions.

Random Forest Classifier

With lower weighted log loss, random forest indicates more confident and accurate predictions, or more wide range of prediction probability
The maximum separation distance between 2 classes is 0.96, at the 30%th of sample, covering for ~70% of data, and also works better with bottom 40% of data compared with kNN. Reaching maximum KS statistic earlier than kNN suggests it separates classes more efficiently.

Similar to KNN, model (both Class 0 and Class 1) significantly outperforms the baseline.
Class 1 shows better performance than Class 0, reaching maximum gain faster.
In calibration, the model deviates from perfect calibration, showing some overconfidence in its predictions.

In general, Random Forest shows slightly better overall performance, particularly in terms of calibration and prediction confidence. It appears to handle the class imbalance more effectively, making it the preferable choice for this specific classification task.

As expected, the most importance predictive feature is the location of traffic (country_region). We simply suspect Robot or non-Robot based on its location

SUMMARY AND CONCLUSIONS

It suggests that the Random Forest model captures the differences in a larger portion of the dataset (which aggregates predictions from multiple trees, might capture broader patterns and allowing it to identify more generalized patterns). One could argue that the Random Forest model is better at distinguishing between the predicted and true distributions across a broader range, while kNN is sensitive to outliers/noiser and local pattern
kNN likely suffers a lot when the imbalance class is more serious
One-hot encoding sparsed the data which can hurt distance-based algorithm like kNN, also the curse of dimensionality
Obviously there're some needed improvement for product topic modeling, or we could have sort of category in the ad database

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis_and_modeling.ipynb		analysis_and_modeling.ipynb
helper.py		helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robot-detection

Problem Definition

Data Overview

Approach

Simple Feature Engineering

Modeling

KNN Classifier

Random Forest Classifier

SUMMARY AND CONCLUSIONS

About

Languages

License

nvlinhvn/robot-detection

Folders and files

Latest commit

History

Repository files navigation

robot-detection

Problem Definition

Data Overview

Approach

Simple Feature Engineering

Modeling

KNN Classifier

Random Forest Classifier

SUMMARY AND CONCLUSIONS

About

Topics

Resources

License

Stars

Watchers

Forks

Languages