diff --git a/docs/api/feature_elimination.md b/docs/api/feature_elimination.md index cdad230..b3a88a3 100644 --- a/docs/api/feature_elimination.md +++ b/docs/api/feature_elimination.md @@ -2,7 +2,6 @@ This module focuses on feature elimination and it contains two classes: -- [ShapRFECV][probatus.feature_elimination.feature_elimination.ShapRFECV]: Perform Backwards Recursive Feature Elimination, using SHAP feature importance. It supports binary classification models and hyperparameter optimization at every feature elimination step. -- [EarlyStoppingShapRFECV][probatus.feature_elimination.feature_elimination.EarlyStoppingShapRFECV]: adds support to early stopping of the model fitting process. It can be an alternative regularization technique to hyperparameter optimization of the number of base trees in gradient boosted tree models. Particularly useful when dealing with large datasets. +- [ShapRFECV][probatus.feature_elimination.feature_elimination.ShapRFECV]: Perform Backwards Recursive Feature Elimination, using SHAP feature importance. It supports binary classification, regression models and hyperparameter optimization at every feature elimination step. Also for LGBM, XGBoost and CatBoost it support early stopping of the model fitting process. It can be an alternative regularization technique to hyperparameter optimization of the number of base trees in gradient boosted tree models. Particularly useful when dealing with large datasets. ::: probatus.feature_elimination.feature_elimination diff --git a/docs/tutorials/nb_shap_feature_elimination.ipynb b/docs/tutorials/nb_shap_feature_elimination.ipynb index 3d99472..73d9bdd 100644 --- a/docs/tutorials/nb_shap_feature_elimination.ipynb +++ b/docs/tutorials/nb_shap_feature_elimination.ipynb @@ -32,7 +32,7 @@ "\n", "- Removing lowest [SHAP](https://shap.readthedocs.io/en/latest/) importance feature does not always translate to choosing the feature with the lowest impact on a model's performance. Shap importance illustrates how strongly a given feature affects the output of the model, while disregarding correctness of this prediction.\n", "- Currently, the functionality only supports tree-based & linear binary classifiers, in the future the scope might be extended.\n", - "- For large datasets, performing hyperparameter optimization can be very computationally expensive. For gradient boosted tree models, one alternative is to use early stopping of the training step. For this, see [EarlyStoppingShapRFECV](#EarlyStoppingShapRFECV)\n", + "- For large datasets, performing hyperparameter optimization can be very computationally expensive. For gradient boosted tree models, one alternative is to use early stopping of the training step. For this use the parameters early_stopping_rounds and eval_metric.\n", "\n", "## Setup the dataset\n", "\n", @@ -11232,13 +11232,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## EarlyStoppingShapRFECV\n", + "## Early Stopping ShapRFECV\n", "\n", "[Early stopping](https://en.wikipedia.org/wiki/Early_stopping) is a type of regularization, common in [gradient boosted trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting). Supported packages are: [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html), [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) and [CatBoost](https://catboost.ai/en/docs/). It consists of measuring how well the model performs after each base learner is added to the ensemble tree, using a relevant scoring metric. If this metric does not improve after a certain number of training steps, the training can be stopped before the maximum number of base learners is reached. \n", "\n", "Early stopping is thus a way of mitigating overfitting in a relatively cheaply, without having to find the ideal regularization hyperparameters. It is particularly useful for handling large datasets, since it reduces the number of training steps which can decrease the modelling time.\n", "\n", - "`EarlyStoppingShapRFECV` is a child of `ShapRFECV` with limited support for early stopping and the example below shows how to use it with LightGBM." + "Early Stopping requires parameters early_stopping_rounds eval_metric in `ShapRFECV` class and at the moment only supports the three aforementioned libraries. See the example below how to use it with LightGBM." ] }, { @@ -192329,12 +192329,12 @@ ], "source": [ "%%timeit -n 10\n", - "from probatus.feature_elimination import EarlyStoppingShapRFECV\n", + "from probatus.feature_elimination import ShapRFECV\n", "\n", "model = lightgbm.LGBMClassifier(n_estimators=200, max_depth=3)\n", "\n", "# Run feature elimination\n", - "shap_elimination = EarlyStoppingShapRFECV(\n", + "shap_elimination = ShapRFECV(\n", " model=search, step=0.2, cv=10, scoring=\"roc_auc\", eval_metric=\"auc\", early_stopping_rounds=5, n_jobs=3\n", ")\n", "report = shap_elimination.fit_compute(X, y)" @@ -192370,7 +192370,7 @@ "source": [ "As it is hinted in the example above, with large datasets and simple base learners, early stopping can be a much faster alternative to hyperparameter optimization of the ideal number of trees.\n", "\n", - "Note that although `EarlyStoppingShapRFECV` supports hyperparameter search models as input, early stopping is used only during the Shapley value estimation step, and not during hyperparameter search. For this reason, _if you are not using early stopping, you should use the parent class, `ShapRFECV`, instead of `EarlyStoppingShapRFECV`_." + "Note that although Early Stopping `ShapRFECV` supports hyperparameter search models as input, early stopping is used only during the Shapley value estimation step, and not during hyperparameter search." ] } ], diff --git a/probatus/feature_elimination/early_stopping_feature_elimination.py b/probatus/feature_elimination/early_stopping_feature_elimination.py index fa1afe1..b76645e 100644 --- a/probatus/feature_elimination/early_stopping_feature_elimination.py +++ b/probatus/feature_elimination/early_stopping_feature_elimination.py @@ -1,8 +1,4 @@ import warnings - -from probatus.utils import ( - shap_calc, -) from probatus.feature_elimination import ShapRFECV @@ -167,6 +163,15 @@ def __init__( and [LightGBM](https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters). Note that `eval_metric` is an argument of the model's fit method and it is different from `scoring`. """ # noqa + # TODO: This deprecation warning will removed when it's decided that this class can be deleted. + warnings.warn( + "The separate EarlyStoppingShapRFECV class is going to be deprecated" + " in a later version of Probatus, since its now part of the" + " ShapRFECV class. Please adjust your imported class name from" + " 'EarlyStoppingShapRFECV' to 'ShapRFECV'.", + DeprecationWarning, + ) + super().__init__( model, step=step, @@ -176,361 +181,6 @@ def __init__( n_jobs=n_jobs, verbose=verbose, random_state=random_state, + early_stopping_rounds=early_stopping_rounds, + eval_metric=eval_metric, ) - - if self.search_model and self.verbose > 0: - warnings.warn( - "Early stopping will be used only during Shapley value" - " estimation step, and not for hyperparameter" - " optimization." - ) - - if not isinstance(early_stopping_rounds, int) or early_stopping_rounds <= 0: - raise ValueError( - f"The current value of early_stopping_rounds =" - f" {early_stopping_rounds} is not allowed." - f" It needs to be a positive integer." - ) - - self.early_stopping_rounds = early_stopping_rounds - self.eval_metric = eval_metric - - def _get_fit_params_lightGBM( - self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None - ): - """Get the fit parameters for for a LightGBM Model. - - Args: - - X_train (pd.DataFrame): - Train Dataset used in CV. - - y_train (pd.Series): - Train labels for X. - - X_val (pd.DataFrame): - Validation Dataset used in CV. - - y_val (pd.Series): - Validation labels for X. - - sample_weight (pd.Series, np.ndarray, list, optional): - array-like of shape (n_samples,) - only use if the model you're using supports - sample weighting (check the corresponding scikit-learn documentation). - Array of weights that are assigned to individual samples. - Note that they're only used for fitting of the model, not during evaluation of metrics. - If not provided, then each sample is given unit weight. - - train_index (np.array): - Positions of train folds samples. - - val_index (np.array): - Positions of validation fold samples. - - Raises: - ValueError: if the model is not supported. - - Returns: - dict: fit parameters - """ - from lightgbm import early_stopping, log_evaluation - - fit_params = { - "X": X_train, - "y": y_train, - "eval_set": [(X_val, y_val)], - "eval_metric": self.eval_metric, - "callbacks": [ - early_stopping(self.early_stopping_rounds, first_metric_only=True), - log_evaluation(1 if self.verbose >= 2 else 0), - ], - } - - if sample_weight is not None: - fit_params["sample_weight"] = sample_weight.iloc[train_index] - fit_params["eval_sample_weight"] = [sample_weight.iloc[val_index]] - - return fit_params - - def _get_fit_params_XGBoost( - self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None - ): - """Get the fit parameters for for a XGBoost Model. - - Args: - - X_train (pd.DataFrame): - Train Dataset used in CV. - - y_train (pd.Series): - Train labels for X. - - X_val (pd.DataFrame): - Validation Dataset used in CV. - - y_val (pd.Series): - Validation labels for X. - - sample_weight (pd.Series, np.ndarray, list, optional): - array-like of shape (n_samples,) - only use if the model you're using supports - sample weighting (check the corresponding scikit-learn documentation). - Array of weights that are assigned to individual samples. - Note that they're only used for fitting of the model, not during evaluation of metrics. - If not provided, then each sample is given unit weight. - - train_index (np.array): - Positions of train folds samples. - - val_index (np.array): - Positions of validation fold samples. - - Raises: - ValueError: if the model is not supported. - - Returns: - dict: fit parameters - """ - fit_params = { - "X": X_train, - "y": y_train, - "eval_set": [(X_val, y_val)], - } - if sample_weight is not None: - fit_params["sample_weight"] = sample_weight.iloc[train_index] - fit_params["eval_sample_weight"] = [sample_weight.iloc[val_index]] - - return fit_params - - def _get_fit_params_CatBoost( - self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None - ): - """Get the fit parameters for for a CatBoost Model. - - Args: - - X_train (pd.DataFrame): - Train Dataset used in CV. - - y_train (pd.Series): - Train labels for X. - - X_val (pd.DataFrame): - Validation Dataset used in CV. - - y_val (pd.Series): - Validation labels for X. - - sample_weight (pd.Series, np.ndarray, list, optional): - array-like of shape (n_samples,) - only use if the model you're using supports - sample weighting (check the corresponding scikit-learn documentation). - Array of weights that are assigned to individual samples. - Note that they're only used for fitting of the model, not during evaluation of metrics. - If not provided, then each sample is given unit weight. - - train_index (np.array): - Positions of train folds samples. - - val_index (np.array): - Positions of validation fold samples. - - Raises: - ValueError: if the model is not supported. - - Returns: - dict: fit parameters - """ - from catboost import Pool - - cat_features = [col for col in X_train.select_dtypes(include=["category"]).columns] - fit_params = { - "X": Pool(X_train, y_train, cat_features=cat_features), - "eval_set": Pool(X_val, y_val, cat_features=cat_features), - # Evaluation metric should be passed during initialization - } - if sample_weight is not None: - fit_params["X"].set_weight(sample_weight.iloc[train_index]) - fit_params["eval_set"].set_weight(sample_weight.iloc[val_index]) - - return fit_params - - def _get_fit_params( - self, model, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None - ): - """Get the fit parameters for the specified classifier or regressor. - - Args: - model (classifier or regressor): - Model to be fitted on the train folds. - - X_train (pd.DataFrame): - Train Dataset used in CV. - - y_train (pd.Series): - Train labels for X. - - X_val (pd.DataFrame): - Validation Dataset used in CV. - - y_val (pd.Series): - Validation labels for X. - - sample_weight (pd.Series, np.ndarray, list, optional): - array-like of shape (n_samples,) - only use if the model you're using supports - sample weighting (check the corresponding scikit-learn documentation). - Array of weights that are assigned to individual samples. - Note that they're only used for fitting of the model, not during evaluation of metrics. - If not provided, then each sample is given unit weight. - - train_index (np.array): - Positions of train folds samples. - - val_index (np.array): - Positions of validation fold samples. - - Raises: - ValueError: if the model is not supported. - - Returns: - dict: fit parameters - """ - # The lightgbm and xgboost imports are temporarily placed here, until the tests on - # macOS have been fixed. - - try: - from lightgbm import LGBMModel - - if isinstance(model, LGBMModel): - return self._get_fit_params_lightGBM( - X_train=X_train, - y_train=y_train, - X_val=X_val, - y_val=y_val, - sample_weight=sample_weight, - train_index=train_index, - val_index=val_index, - ) - except ImportError: - pass - - try: - from xgboost.sklearn import XGBModel - - if isinstance(model, XGBModel): - return self._get_fit_params_XGBoost( - X_train=X_train, - y_train=y_train, - X_val=X_val, - y_val=y_val, - sample_weight=sample_weight, - train_index=train_index, - val_index=val_index, - ) - except ImportError: - pass - - try: - from catboost import CatBoost - - if isinstance(model, CatBoost): - return self._get_fit_params_CatBoost( - X_train=X_train, - y_train=y_train, - X_val=X_val, - y_val=y_val, - sample_weight=sample_weight, - train_index=train_index, - val_index=val_index, - ) - except ImportError: - pass - - raise ValueError("Model type not supported") - - def _get_feature_shap_values_per_fold( - self, - X, - y, - model, - train_index, - val_index, - sample_weight=None, - **shap_kwargs, - ): - """ - This function calculates the shap values on validation set, and Train and Val score. - - Args: - X (pd.DataFrame): - Dataset used in CV. - - y (pd.Series): - Labels for X. - - sample_weight (pd.Series, np.ndarray, list, optional): - array-like of shape (n_samples,) - only use if the model you're using supports - sample weighting (check the corresponding scikit-learn documentation). - Array of weights that are assigned to individual samples. - Note that they're only used for fitting of the model, not during evaluation of metrics. - If not provided, then each sample is given unit weight. - - model: - Classifier or regressor to be fitted on the train folds. - - train_index (np.array): - Positions of train folds samples. - - val_index (np.array): - Positions of validation fold samples. - - **shap_kwargs: - keyword arguments passed to - [shap.Explainer](https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html#shap.Explainer). - It also enables `approximate` and `check_additivity` parameters, passed while calculating SHAP values. - The `approximate=True` causes less accurate, but faster SHAP values calculation, while - `check_additivity=False` disables the additivity check inside SHAP. - Returns: - (np.array, float, float): - Tuple with the results: Shap Values on validation fold, train score, validation score. - """ - X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :] - y_train, y_val = y.iloc[train_index], y.iloc[val_index] - - fit_params = self._get_fit_params( - model=model, - X_train=X_train, - y_train=y_train, - X_val=X_val, - y_val=y_val, - sample_weight=sample_weight, - train_index=train_index, - val_index=val_index, - ) - - # Due to deprecation issues (compatibility with Sklearn) set some params - # like below, instead of through fit(). - try: - from xgboost.sklearn import XGBModel - - if isinstance(model, XGBModel): - model.set_params(eval_metric=self.eval_metric, early_stopping_rounds=self.early_stopping_rounds) - except ImportError: - pass - - try: - from catboost import CatBoost - - if isinstance(model, CatBoost): - model.set_params(early_stopping_rounds=self.early_stopping_rounds) - except ImportError: - pass - - # Train the model - model = model.fit(**fit_params) - - # Score the model - score_train = self.scorer.score(model, X_train, y_train) - score_val = self.scorer.score(model, X_val, y_val) - - # Compute SHAP values - shap_values = shap_calc(model, X_val, verbose=self.verbose, random_state=self.random_state, **shap_kwargs) - return shap_values, score_train, score_val diff --git a/probatus/feature_elimination/feature_elimination.py b/probatus/feature_elimination/feature_elimination.py index 8994262..b5e303b 100644 --- a/probatus/feature_elimination/feature_elimination.py +++ b/probatus/feature_elimination/feature_elimination.py @@ -111,6 +111,8 @@ def __init__( n_jobs=-1, verbose=0, random_state=None, + early_stopping_rounds=None, + eval_metric=None, ): """ This method initializes the class. @@ -163,6 +165,19 @@ def __init__( Random state set at each round of feature elimination. If it is None, the results will not be reproducible and in random search at each iteration a different hyperparameters might be tested. For reproducible results set it to an integer. + + early_stopping_rounds (int, optional): + Number of rounds with constant performance after which the model fitting stops. This is passed to the + fit method of the model for Shapley values estimation, but not for hyperparameter search. Only + supported by some models, such as XGBoost, LightGBM and CatBoost. Only recommended when dealing with large sets of data. + + eval_metric (str, optional): + Metric for scoring fitting rounds and activating early stopping. This is passed to the + fit method of the model for Shapley values estimation, but not for hyperparameter search. Only + supported by some models, such as [XGBoost](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters) + and [LightGBM](https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters). + Note that `eval_metric` is an argument of the model's fit method and it is different from `scoring`. + Only recommended when dealing with large sets of data. """ # noqa self.model = model self.search_model = isinstance(model, BaseSearchCV) @@ -173,8 +188,49 @@ def __init__( self.n_jobs = n_jobs self.verbose = verbose self.random_state = random_state + + # Enable early stopping behavior + if early_stopping_rounds: + if not eval_metric: + warnings.warn( + "Running early stopping, requires both 'early_stopping_rounds' and 'eval_metric' as" + " parameters to be provided and supports only 'XGBoost', 'LGBM' and 'CatBoost'." + ) + + if not isinstance(early_stopping_rounds, int) or early_stopping_rounds <= 0: + raise ValueError(f"early_stopping_rounds must be a positive integer; got {early_stopping_rounds}.") + + if not self._check_if_model_is_compatible_with_early_stopping(model): + raise ValueError("Only 'XGBoost', 'LGBM' and 'CatBoost' supported for early stopping.") + + self.early_stopping_rounds = early_stopping_rounds + self.eval_metric = eval_metric + self.report_df = pd.DataFrame() + def _check_if_model_is_compatible_with_early_stopping(self, model): + """ + Check if the model or the estimator of the cv is compatible with early stopping. + + Returns: + (bool): + bool if true or false based on compatibility. + """ + libraries = [("lightgbm", "LGBMModel"), ("xgboost.sklearn", "XGBModel"), ("catboost", "CatBoost")] + + if isinstance(model, BaseSearchCV): + model = model.estimator + + for lib, class_name in libraries: + try: + module = __import__(lib, fromlist=[class_name]) + if isinstance(model, getattr(module, class_name)): + return True + except ImportError: + pass + + return False + def compute(self): """ Checks if fit() method has been run. @@ -401,19 +457,35 @@ def fit( else: current_model = clone(self.model) - # Perform CV to estimate feature importance with SHAP - results_per_fold = Parallel(n_jobs=self.n_jobs)( - delayed(self._get_feature_shap_values_per_fold)( - X=current_X, - y=self.y, - model=current_model, - train_index=train_index, - val_index=val_index, - sample_weight=sample_weight, - **shap_kwargs, + # Early stopping enabled (or not) + if not (self.early_stopping_rounds and self.eval_metric): + # Perform CV to estimate feature importance with SHAP + results_per_fold = Parallel(n_jobs=self.n_jobs)( + delayed(self._get_feature_shap_values_per_fold)( + X=current_X, + y=self.y, + model=current_model, + train_index=train_index, + val_index=val_index, + sample_weight=sample_weight, + **shap_kwargs, + ) + for train_index, val_index in self.cv.split(current_X, self.y, groups) + ) + else: + # Perform CV to estimate feature importance with SHAP + results_per_fold = Parallel(n_jobs=self.n_jobs)( + delayed(self._get_feature_shap_values_per_fold_early_stopping)( + X=current_X, + y=self.y, + model=current_model, + train_index=train_index, + val_index=val_index, + sample_weight=sample_weight, + **shap_kwargs, + ) + for train_index, val_index in self.cv.split(current_X, self.y, groups) ) - for train_index, val_index in self.cv.split(current_X, self.y, groups) - ) if self.y.nunique() == 2 or is_regressor(current_model): shap_values = np.concatenate([current_result[0] for current_result in results_per_fold], axis=0) @@ -942,3 +1014,338 @@ def _get_feature_ranking(self): ranking = [features_eliminated_dict[col] for col in self.column_names] return ranking + + def _get_fit_params_lightGBM( + self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None + ): + """Get the fit parameters for for a LightGBM Model. + + Args: + + X_train (pd.DataFrame): + Train Dataset used in CV. + + y_train (pd.Series): + Train labels for X. + + X_val (pd.DataFrame): + Validation Dataset used in CV. + + y_val (pd.Series): + Validation labels for X. + + sample_weight (pd.Series, np.ndarray, list, optional): + array-like of shape (n_samples,) - only use if the model you're using supports + sample weighting (check the corresponding scikit-learn documentation). + Array of weights that are assigned to individual samples. + Note that they're only used for fitting of the model, not during evaluation of metrics. + If not provided, then each sample is given unit weight. + + train_index (np.array): + Positions of train folds samples. + + val_index (np.array): + Positions of validation fold samples. + + Raises: + ValueError: if the model is not supported. + + Returns: + dict: fit parameters + """ + from lightgbm import early_stopping, log_evaluation + + fit_params = { + "X": X_train, + "y": y_train, + "eval_set": [(X_val, y_val)], + "eval_metric": self.eval_metric, + "callbacks": [ + early_stopping(self.early_stopping_rounds, first_metric_only=True), + log_evaluation(1 if self.verbose >= 2 else 0), + ], + } + + if sample_weight is not None: + fit_params["sample_weight"] = sample_weight.iloc[train_index] + fit_params["eval_sample_weight"] = [sample_weight.iloc[val_index]] + + return fit_params + + def _get_fit_params_XGBoost( + self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None + ): + """Get the fit parameters for for a XGBoost Model. + + Args: + + X_train (pd.DataFrame): + Train Dataset used in CV. + + y_train (pd.Series): + Train labels for X. + + X_val (pd.DataFrame): + Validation Dataset used in CV. + + y_val (pd.Series): + Validation labels for X. + + sample_weight (pd.Series, np.ndarray, list, optional): + array-like of shape (n_samples,) - only use if the model you're using supports + sample weighting (check the corresponding scikit-learn documentation). + Array of weights that are assigned to individual samples. + Note that they're only used for fitting of the model, not during evaluation of metrics. + If not provided, then each sample is given unit weight. + + train_index (np.array): + Positions of train folds samples. + + val_index (np.array): + Positions of validation fold samples. + + Raises: + ValueError: if the model is not supported. + + Returns: + dict: fit parameters + """ + fit_params = { + "X": X_train, + "y": y_train, + "eval_set": [(X_val, y_val)], + } + if sample_weight is not None: + fit_params["sample_weight"] = sample_weight.iloc[train_index] + fit_params["eval_sample_weight"] = [sample_weight.iloc[val_index]] + + return fit_params + + def _get_fit_params_CatBoost( + self, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None + ): + """Get the fit parameters for for a CatBoost Model. + + Args: + + X_train (pd.DataFrame): + Train Dataset used in CV. + + y_train (pd.Series): + Train labels for X. + + X_val (pd.DataFrame): + Validation Dataset used in CV. + + y_val (pd.Series): + Validation labels for X. + + sample_weight (pd.Series, np.ndarray, list, optional): + array-like of shape (n_samples,) - only use if the model you're using supports + sample weighting (check the corresponding scikit-learn documentation). + Array of weights that are assigned to individual samples. + Note that they're only used for fitting of the model, not during evaluation of metrics. + If not provided, then each sample is given unit weight. + + train_index (np.array): + Positions of train folds samples. + + val_index (np.array): + Positions of validation fold samples. + + Raises: + ValueError: if the model is not supported. + + Returns: + dict: fit parameters + """ + from catboost import Pool + + cat_features = [col for col in X_train.select_dtypes(include=["category"]).columns] + fit_params = { + "X": Pool(X_train, y_train, cat_features=cat_features), + "eval_set": Pool(X_val, y_val, cat_features=cat_features), + # Evaluation metric should be passed during initialization + } + if sample_weight is not None: + fit_params["X"].set_weight(sample_weight.iloc[train_index]) + fit_params["eval_set"].set_weight(sample_weight.iloc[val_index]) + + return fit_params + + def _get_fit_params( + self, model, X_train, y_train, X_val, y_val, sample_weight=None, train_index=None, val_index=None + ): + """Get the fit parameters for the specified classifier or regressor. + + Args: + model (classifier or regressor): + Model to be fitted on the train folds. + + X_train (pd.DataFrame): + Train Dataset used in CV. + + y_train (pd.Series): + Train labels for X. + + X_val (pd.DataFrame): + Validation Dataset used in CV. + + y_val (pd.Series): + Validation labels for X. + + sample_weight (pd.Series, np.ndarray, list, optional): + array-like of shape (n_samples,) - only use if the model you're using supports + sample weighting (check the corresponding scikit-learn documentation). + Array of weights that are assigned to individual samples. + Note that they're only used for fitting of the model, not during evaluation of metrics. + If not provided, then each sample is given unit weight. + + train_index (np.array): + Positions of train folds samples. + + val_index (np.array): + Positions of validation fold samples. + + Raises: + ValueError: if the model is not supported. + + Returns: + dict: fit parameters + """ + try: + from lightgbm import LGBMModel + + if isinstance(model, LGBMModel): + return self._get_fit_params_lightGBM( + X_train=X_train, + y_train=y_train, + X_val=X_val, + y_val=y_val, + sample_weight=sample_weight, + train_index=train_index, + val_index=val_index, + ) + except ImportError: + pass + + try: + from xgboost.sklearn import XGBModel + + if isinstance(model, XGBModel): + return self._get_fit_params_XGBoost( + X_train=X_train, + y_train=y_train, + X_val=X_val, + y_val=y_val, + sample_weight=sample_weight, + train_index=train_index, + val_index=val_index, + ) + except ImportError: + pass + + try: + from catboost import CatBoost + + if isinstance(model, CatBoost): + return self._get_fit_params_CatBoost( + X_train=X_train, + y_train=y_train, + X_val=X_val, + y_val=y_val, + sample_weight=sample_weight, + train_index=train_index, + val_index=val_index, + ) + except ImportError: + pass + + raise ValueError("Model type not supported") + + def _get_feature_shap_values_per_fold_early_stopping( + self, + X, + y, + model, + train_index, + val_index, + sample_weight=None, + **shap_kwargs, + ): + """ + This function calculates the shap values on validation set, and Train and Val score. + + Args: + X (pd.DataFrame): + Dataset used in CV. + + y (pd.Series): + Labels for X. + + sample_weight (pd.Series, np.ndarray, list, optional): + array-like of shape (n_samples,) - only use if the model you're using supports + sample weighting (check the corresponding scikit-learn documentation). + Array of weights that are assigned to individual samples. + Note that they're only used for fitting of the model, not during evaluation of metrics. + If not provided, then each sample is given unit weight. + + model: + Classifier or regressor to be fitted on the train folds. + + train_index (np.array): + Positions of train folds samples. + + val_index (np.array): + Positions of validation fold samples. + + **shap_kwargs: + keyword arguments passed to + [shap.Explainer](https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html#shap.Explainer). + It also enables `approximate` and `check_additivity` parameters, passed while calculating SHAP values. + The `approximate=True` causes less accurate, but faster SHAP values calculation, while + `check_additivity=False` disables the additivity check inside SHAP. + Returns: + (np.array, float, float): + Tuple with the results: Shap Values on validation fold, train score, validation score. + """ + X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :] + y_train, y_val = y.iloc[train_index], y.iloc[val_index] + + fit_params = self._get_fit_params( + model=model, + X_train=X_train, + y_train=y_train, + X_val=X_val, + y_val=y_val, + sample_weight=sample_weight, + train_index=train_index, + val_index=val_index, + ) + + try: + from xgboost.sklearn import XGBModel + + if isinstance(model, XGBModel): + model.set_params(eval_metric=self.eval_metric, early_stopping_rounds=self.early_stopping_rounds) + except ImportError: + pass + + try: + from catboost import CatBoost + + if isinstance(model, CatBoost): + model.set_params(early_stopping_rounds=self.early_stopping_rounds) + except ImportError: + pass + + # Train the model + model = model.fit(**fit_params) + + # Score the model + score_train = self.scorer.score(model, X_train, y_train) + score_val = self.scorer.score(model, X_val, y_val) + + # Compute SHAP values + shap_values = shap_calc(model, X_val, verbose=self.verbose, random_state=self.random_state, **shap_kwargs) + return shap_values, score_train, score_val diff --git a/pyproject.toml b/pyproject.toml index dd767c4..9ad37c2 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "probatus" -version = "3.1.1" +version = "3.1.2" requires-python= ">=3.9" description = "Validation of regression & classifiers and data used to develop them" readme = { file = "README.md", content-type = "text/markdown" } @@ -29,9 +29,7 @@ dependencies = [ "scikit-learn>=0.22.2", "pandas>=1.0.0", "matplotlib>=3.1.1", - "scipy>=1.4.0", "joblib>=0.13.2", - "tqdm>=4.41.0", "shap>=0.43.0", "numpy>=1.23.2,<2.0.0", "numba>=0.57.0", @@ -49,15 +47,12 @@ Changelog = "https://github.com/ing-bank/probatus/blob/main/CHANGELOG.md" dev = [ "black>=19.10b0", - "pre-commit>=2.5.0", "mypy>=0.770", "pytest>=6.0.0", "pytest-cov>=2.10.0", "pyflakes", - "seaborn>=0.9.0", "joblib>=0.13.2", "jupyter>=1.0.0", - "tabulate>=0.8.7", "nbconvert>=6.0.7", "pre-commit>=2.7.1", "isort>=5.12.0", @@ -66,7 +61,6 @@ dev = [ "lightgbm>=3.3.0", "catboost>=1.2", "xgboost>=1.5.0", - "scipy>=1.4.0", ] docs = [ "mkdocs>=1.5.3", diff --git a/tests/feature_elimination/test_feature_elimination.py b/tests/feature_elimination/test_feature_elimination.py index 3c93af9..f356110 100644 --- a/tests/feature_elimination/test_feature_elimination.py +++ b/tests/feature_elimination/test_feature_elimination.py @@ -10,7 +10,7 @@ from sklearn.svm import SVC from xgboost import XGBClassifier, XGBRegressor -from probatus.feature_elimination import EarlyStoppingShapRFECV, ShapRFECV +from probatus.feature_elimination import ShapRFECV, EarlyStoppingShapRFECV from probatus.utils import preprocess_labels @@ -377,7 +377,7 @@ def test_shap_rfe_early_stopping_XGBoost(XGBoost_classifier, complex_data, rando X, y = complex_data X["f1_categorical"] = X["f1_categorical"].astype(float) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( XGBoost_classifier, random_state=random_state, step=1, @@ -398,7 +398,7 @@ def test_shap_rfe_early_stopping_XGBoost(XGBoost_classifier, complex_data, rando def test_shap_rfe_early_stopping_CatBoost(complex_data_with_categorical, catboost_classifier, random_state): X, y = complex_data_with_categorical - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( catboost_classifier, random_state=random_state, step=1, @@ -422,7 +422,7 @@ def test_shap_rfe_randomized_search_early_stopping_lightGBM(complex_data, random "max_depth": [3, 4, 5], } search = RandomizedSearchCV(model, param_grid, cv=2, n_iter=2, random_state=random_state) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( search, step=1, cv=10, @@ -446,14 +446,12 @@ def test_get_feature_shap_values_per_fold_early_stopping_lightGBM(complex_data, X, y = complex_data y = preprocess_labels(y, y_name="y", index=X.index) - shap_elimination = EarlyStoppingShapRFECV( - model, early_stopping_rounds=5, scoring="roc_auc", random_state=random_state - ) + shap_elimination = ShapRFECV(model, early_stopping_rounds=5, scoring="roc_auc", random_state=random_state) ( shap_values, train_score, test_score, - ) = shap_elimination._get_feature_shap_values_per_fold( + ) = shap_elimination._get_feature_shap_values_per_fold_early_stopping( X, y, model, @@ -471,14 +469,14 @@ def test_get_feature_shap_values_per_fold_early_stopping_CatBoost( X, y = complex_data_with_categorical y = preprocess_labels(y, y_name="y", index=X.index) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( catboost_classifier, early_stopping_rounds=5, scoring="roc_auc", random_state=random_state ) ( shap_values, train_score, test_score, - ) = shap_elimination._get_feature_shap_values_per_fold( + ) = shap_elimination._get_feature_shap_values_per_fold_early_stopping( X, y, catboost_classifier, @@ -494,14 +492,14 @@ def test_get_feature_shap_values_per_fold_early_stopping_XGBoost(XGBoost_classif X, y = complex_data y = preprocess_labels(y, y_name="y", index=X.index) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( XGBoost_classifier, early_stopping_rounds=5, scoring="roc_auc", random_state=random_state ) ( shap_values, train_score, test_score, - ) = shap_elimination._get_feature_shap_values_per_fold( + ) = shap_elimination._get_feature_shap_values_per_fold_early_stopping( X, y, XGBoost_classifier, @@ -516,7 +514,7 @@ def test_get_feature_shap_values_per_fold_early_stopping_XGBoost(XGBoost_classif def test_EarlyStoppingShapRFECV_no_categorical(complex_data, random_state): model = LGBMClassifier(n_estimators=50, max_depth=3, num_leaves=3, random_state=random_state) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( model=model, step=0.33, cv=5, @@ -557,7 +555,7 @@ def test_LightGBM_stratified_kfold(random_state): for _ in range(n_iter): skf = StratifiedKFold(n_folds, shuffle=True, random_state=random_state) - shap_elimination = EarlyStoppingShapRFECV( + shap_elimination = ShapRFECV( model=model, step=1 / (n_iter + 1), cv=skf,