Skip to content

This is the code for predicting geolocation of tweets trainning on token frequency using Decision Tree and Naïve Bayes. @TheUniversityOfMelbourne @pancak3 all rights reserved.

License

Notifications You must be signed in to change notification settings

pancak3/GeolocatonPredictor-ML-NB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Geolocaton Predictor

This is the code for predicting geolocaton of tweets trainning on token frequencies using Decision Tree and Naïve Bayes.

Implementation

Feature selection

In util/preprocessing/merge.py,

  • feature_filter shows it drops single character features like [a, b, ..., n]
  • merge shows it intuitively merges similar features like [aha, ahah, ..., ahahahaha] and [taco, tacos]

Classifier Combination

In preprocess/merge.py,

Instance manipulation

In util/train.py,

  • complement_nb shows it uses bagging to generate multiple training datasets.
  • complement_nb also shows it uses 42-Fold Cross Validation to generate multiple training datasets.

Algorithm manipulation

In util/train.py,

  • complement_nb also shows it uses GridSearchCV to generate multiple classifiers and select the best based on accuracy.

Dataset

Requirements

  • python3+
pip install -r requirements.txt  

Usage

Note: The code will remove the old models and results every time running. MAKE SURE you have saved your satisfying models..

Train

python run.py -t datasets/train-best200.csv datasets/dev-best200.csv  

the output would be like:

INFO:root:[*] Merging datasets/train-best200.csv   
 42%|████████         | 1006/2396 [00:05<00:20, 92.03 users/s]  
...  
...  
[*] Saved models/0.8126_2019-10-02_20:02  
[*] Accuracy: 0.8125955095803455  
 precision    recall   f_scoreCalifornia   0.618944  0.835128  0.710966  
NewYork      0.899371  0.854647  0.876439  
Georgia      0.788070  0.622080  0.695305  
weighted     0.827448  0.812596  0.814974  

Predict

python run.py -p models/ datasets/dev-best200.csv   

the output would be like:

...  
INFO:root:[*] Saved results/final_results.csv  
INFO:root:[*] Time costs in seconds:  
 PredictTime_cost  11.98s  

Score

python run.py -s results/final_results.csv  datasets/dev-best200.csv  

the output would be like:

[*] Accuracy: 0.8224697308099213  
 precision    recall   f_scoreCalifornia   0.653035  0.852199  0.739441  
NewYork      0.747993  0.647940  0.694381  
Georgia      0.909456  0.858296  0.883136  
weighted     0.833854  0.822470  0.824577  
INFO:root:[*] Time costs in seconds:  
 ScoreTime_cost  1.48s  
  

Train&Predict&Score

python run.py \
 -t datasets/train-best200.csv datasets/dev-best200.csv \
 -p models/ datasets/dev-best200.csv \
 -s results/final_results.csv datasets/dev-best200.csv

Help

python run.py -h  

Used libraries

  • sklearn for easily using Complement Naive Bayes, some feature selectors and other learning tools.
  • pandas, numpy for easily handling data.
  • tqdm for showing the process of loop.
  • joblib for dumping/loading memory to/from disk.
  • nltk for capturing word types on the purpose of feature filtering

License

See LICENSE file.

About

This is the code for predicting geolocation of tweets trainning on token frequency using Decision Tree and Naïve Bayes. @TheUniversityOfMelbourne @pancak3 all rights reserved.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages