Skip to content

chasedehan/BoostARoota

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoostARoota

A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers)

Why Create Another Algorithm?

Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.

I am proposing and demonstrating a feature selection algorithm (called BoostARoota) in a similar spirit to Boruta utilizing XGBoost as the base model rather than a Random Forest. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets. While the spirit is similar to Boruta, BoostARoota takes a slightly different approach for the removal of attributes that executes much faster.

Installation

Easiest way is to use pip:

$ pip install boostaroota

Usage

This module is built for use in a similar manner to sklearn with fit(), transform(), etc. In order to use the package, it does require X to be one-hot-encoded(OHE), so using the pandas function pd.get_dummies(X) may be helpful as it determines which variables are categorical and converts them into dummy variables. This package does rely on pandas under the hood so data must be passed in as a pandas dataframe.

Assuming you have X and Y split, you can run the following:

from boostaroota import BoostARoota
import pandas as pd

#OHE the variables - BoostARoota may break if not done
x = pd.getdummies(x)
#Specify the evaluation metric: can use whichever you like as long as recognized by XGBoost
  #EXCEPTION: multi-class currently only supports "mlogloss" so much be passed in as eval_metric
br = BoostARoota(metric='logloss')

#Fit the model for the subset of variables
br.fit(x, y)

#Can look at the important variables - will return a pandas series
br.keep_vars_

#Then modify dataframe to only include the important variables
br.transform(x)

It's really that simple! Of course, as we build more functionality there may be a few more Keep in mind that since you are OHE, if you have a numeric variable that is imported by python as a character, pd.get_dummies() will convert those numeric into many columns. This can cause your DataFrame to explode in size, giving unexpected results and high run times.

###New as of 1/22/2018, can insert any sklearn tree-based learner into BoostARoota Please be aware that this hasn't been fully tested out for which parameters (cutoff, iterations, etc) are optimal. Currently, that will require some trial and error on the user's part.

For example, to use another classifer, you will initialize the object and then pass that object into the BoostARoota object like so:

from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()

br = BoostARoota(clf=clf)
new_train = br.fit_transform(x, y)

You can also view a complete demo here.

Usage - Choosing Parameters

The default parameters are optimally chosen for the widest range of input dataframes. However, there are cases where other values could be more optimal.

  • clf [default=None] - optional, recommended to leave empty
    • Will default to xgboost if left empty
    • For use with any tree based learner from sklearn.
      • The default parameters are not optimal and will require user experimentation.
  • cutoff [default=4] - float (cutoff > 0)
    • Adjustment to removal cutoff from the feature importances
      • Larger values will be more conservative - if values are set too high, a small number of features may end up being removed.
      • Smaller values will be more aggressive; as long as the value is above zero (can be a float)
  • iters [default=10] - int (iters > 0)
    • The number of iterations to average for the feature importances
      • While it will run, don't want to set this value at 1 as there is quite a bit of random variation
      • Smaller values will run faster as it is running through XGBoost a smaller number of times
      • Scales linearly. iters=4 takes 2x time of iters=2 and 4x time of iters=1
  • max_rounds [default=100] - int (max_rounds > 0)
    • The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features
      • Default is set high enough that it really shouldn't be reached under normal circumstances
      • You would want to set this value low if you felt that it was aggressively removing variables.
  • delta [default=0.1] - float (0 < delta <= 1)
    • Stopping criteria for whether another round is started
      • Regardless of this value, will not progress past max_rounds
      • A value of 0.1 means that at least 10% of the features must be removed in order to move onto the next round
      • Setting higher values will make it more difficult to move to follow on rounds (ex. setting at 1 guarantees only one round)
      • Setting too low of a delta may result in eliminating too many features and would be constrained by max_rounds
  • silent [default=False] - boolean
    • Set to True if don't want to see the BoostARoota output printed. Will still show any errors or warnings that may occur.

How it works