This repository serves as an implementation of a gender classification algorithm based on website traffic with ML and DL techniques. The project aims on estimating the gender (m/f) based on web traffic data encompassing the user_id, path_id and the timestamp of website calls. The algorithm is developed based on a dataset consisting of approximately 2.5M website calls of over 13.5k distinct users.
Figure 1: Data visualizations of gender-specific call behaviour in the TrainVal dataset over time of day (left) and website path (right).
Environments:
- Python 3.8
- PyTorch 2.0.1
Install the package:
pip install -r requirements.txt
-
Download train and test data here
-
Move csv-files to
data/
-
Run visualization.ipynb to create a split of train and validation data
- Run
main.py
:
python main.py
-
Modify the model and training parameters via command line flags
-
Possible parameter flags are provided by running
python main.py --help
In this section results of the classifier models on the validation set are presented. Despite achieving the highest accuracy on the training data the Random Forest Classifier reaches the lowest validation performance with an accuracy of 83%. The SGD classifier as well as the neural network achieve a validation accuracy of 86%. However, these results are achieved if each website call is classified individually (see Figure 3, left). If the user voting is applied on top of the independent results the validation accuracy of each classifier is increased to 100% (see Figure 3, right).
Figure 2: Loss (left) and accuracy (right) of neural network training.
Model | Train Accuracy | Val Accuracy | Val Accuracy (w/ user voting) |
---|---|---|---|
Random Forest Classifier | 0.89 | 0.83 | 1.00 |
SGD Classifier | 0.86 | 0.86 | 1.00 |
Neural Network | 0.86 | 0.86 | 1.00 |
Figure 3: Precision-recall curve of SGD classifier without a user voting (left) and with a user voting (right).