import pandas as pd
import numpy as np
df = pd.read_csv("./data.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bankrupt? 6819 non-null int64
1 ROA(C) before interest and depreciation before interest 6819 non-null float64
2 ROA(A) before interest and % after tax 6819 non-null float64
3 ROA(B) before interest and depreciation after tax 6819 non-null float64
4 Operating Gross Margin 6819 non-null float64
5 Realized Sales Gross Margin 6819 non-null float64
6 Operating Profit Rate 6819 non-null float64
7 Pre-tax net Interest Rate 6819 non-null float64
8 After-tax net Interest Rate 6819 non-null float64
9 Non-industry income and expenditure/revenue 6819 non-null float64
10 Continuous interest rate (after tax) 6819 non-null float64
11 Operating Expense Rate 6819 non-null float64
12 Research and development expense rate 6819 non-null float64
13 Cash flow rate 6819 non-null float64
14 Interest-bearing debt interest rate 6819 non-null float64
15 Tax rate (A) 6819 non-null float64
16 Net Value Per Share (B) 6819 non-null float64
17 Net Value Per Share (A) 6819 non-null float64
18 Net Value Per Share (C) 6819 non-null float64
19 Persistent EPS in the Last Four Seasons 6819 non-null float64
20 Cash Flow Per Share 6819 non-null float64
21 Revenue Per Share (Yuan ¥) 6819 non-null float64
22 Operating Profit Per Share (Yuan ¥) 6819 non-null float64
23 Per Share Net profit before tax (Yuan ¥) 6819 non-null float64
24 Realized Sales Gross Profit Growth Rate 6819 non-null float64
25 Operating Profit Growth Rate 6819 non-null float64
26 After-tax Net Profit Growth Rate 6819 non-null float64
27 Regular Net Profit Growth Rate 6819 non-null float64
28 Continuous Net Profit Growth Rate 6819 non-null float64
29 Total Asset Growth Rate 6819 non-null float64
30 Net Value Growth Rate 6819 non-null float64
31 Total Asset Return Growth Rate Ratio 6819 non-null float64
32 Cash Reinvestment % 6819 non-null float64
33 Current Ratio 6819 non-null float64
34 Quick Ratio 6819 non-null float64
35 Interest Expense Ratio 6819 non-null float64
36 Total debt/Total net worth 6819 non-null float64
37 Debt ratio % 6819 non-null float64
38 Net worth/Assets 6819 non-null float64
39 Long-term fund suitability ratio (A) 6819 non-null float64
40 Borrowing dependency 6819 non-null float64
41 Contingent liabilities/Net worth 6819 non-null float64
42 Operating profit/Paid-in capital 6819 non-null float64
43 Net profit before tax/Paid-in capital 6819 non-null float64
44 Inventory and accounts receivable/Net value 6819 non-null float64
45 Total Asset Turnover 6819 non-null float64
46 Accounts Receivable Turnover 6819 non-null float64
47 Average Collection Days 6819 non-null float64
48 Inventory Turnover Rate (times) 6819 non-null float64
49 Fixed Assets Turnover Frequency 6819 non-null float64
50 Net Worth Turnover Rate (times) 6819 non-null float64
51 Revenue per person 6819 non-null float64
52 Operating profit per person 6819 non-null float64
53 Allocation rate per person 6819 non-null float64
54 Working Capital to Total Assets 6819 non-null float64
55 Quick Assets/Total Assets 6819 non-null float64
56 Current Assets/Total Assets 6819 non-null float64
57 Cash/Total Assets 6819 non-null float64
58 Quick Assets/Current Liability 6819 non-null float64
59 Cash/Current Liability 6819 non-null float64
60 Current Liability to Assets 6819 non-null float64
61 Operating Funds to Liability 6819 non-null float64
62 Inventory/Working Capital 6819 non-null float64
63 Inventory/Current Liability 6819 non-null float64
64 Current Liabilities/Liability 6819 non-null float64
65 Working Capital/Equity 6819 non-null float64
66 Current Liabilities/Equity 6819 non-null float64
67 Long-term Liability to Current Assets 6819 non-null float64
68 Retained Earnings to Total Assets 6819 non-null float64
69 Total income/Total expense 6819 non-null float64
70 Total expense/Assets 6819 non-null float64
71 Current Asset Turnover Rate 6819 non-null float64
72 Quick Asset Turnover Rate 6819 non-null float64
73 Working capitcal Turnover Rate 6819 non-null float64
74 Cash Turnover Rate 6819 non-null float64
75 Cash Flow to Sales 6819 non-null float64
76 Fixed Assets to Assets 6819 non-null float64
77 Current Liability to Liability 6819 non-null float64
78 Current Liability to Equity 6819 non-null float64
79 Equity to Long-term Liability 6819 non-null float64
80 Cash Flow to Total Assets 6819 non-null float64
81 Cash Flow to Liability 6819 non-null float64
82 CFO to Assets 6819 non-null float64
83 Cash Flow to Equity 6819 non-null float64
84 Current Liability to Current Assets 6819 non-null float64
85 Liability-Assets Flag 6819 non-null int64
86 Net Income to Total Assets 6819 non-null float64
87 Total assets to GNP price 6819 non-null float64
88 No-credit Interval 6819 non-null float64
89 Gross Profit to Sales 6819 non-null float64
90 Net Income to Stockholder's Equity 6819 non-null float64
91 Liability to Equity 6819 non-null float64
92 Degree of Financial Leverage (DFL) 6819 non-null float64
93 Interest Coverage Ratio (Interest expense to EBIT) 6819 non-null float64
94 Net Income Flag 6819 non-null int64
95 Equity to Liability 6819 non-null float64
dtypes: float64(93), int64(3)
memory usage: 5.0 MB
df["Bankrupt?"].value_counts(normalize=True).plot(kind='bar')
<AxesSubplot:>
df["Bankruptdesc"] = df["Bankrupt?"].map({
0 : "Not Bankrupt",
1 : "Bankrupt"
})
df["Bankruptdesc"].value_counts(normalize=True).plot(kind='bar')
<AxesSubplot:>
from sklearn.utils import resample
# Example: Under-sampling the majority class
minority_class = df[df['Bankrupt?'] == 1]
majority_class = df[df['Bankrupt?'] == 0]
majority_class_downsampled = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)
balanced_df = pd.concat([majority_class_downsampled, minority_class])
balanced_df.drop("Bankruptdesc", axis=1, inplace=True)
balanced_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Bankrupt? | ROA(C) before interest and depreciation before interest | ROA(A) before interest and % after tax | ROA(B) before interest and depreciation after tax | Operating Gross Margin | Realized Sales Gross Margin | Operating Profit Rate | Pre-tax net Interest Rate | After-tax net Interest Rate | Non-industry income and expenditure/revenue | ... | Net Income to Total Assets | Total assets to GNP price | No-credit Interval | Gross Profit to Sales | Net Income to Stockholder's Equity | Liability to Equity | Degree of Financial Leverage (DFL) | Interest Coverage Ratio (Interest expense to EBIT) | Net Income Flag | Equity to Liability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2236 | 0 | 0.471945 | 0.540667 | 0.523636 | 0.607518 | 0.607518 | 0.999034 | 0.797471 | 0.809381 | 0.303531 | ... | 0.801098 | 0.002294 | 0.623034 | 0.607520 | 0.840529 | 0.281733 | 0.026791 | 0.565159 | 1 | 0.023960 |
5538 | 0 | 0.507093 | 0.554187 | 0.558702 | 0.611972 | 0.611986 | 0.999132 | 0.797542 | 0.809430 | 0.303451 | ... | 0.808595 | 0.002987 | 0.625407 | 0.611967 | 0.840884 | 0.277676 | 0.026885 | 0.565571 | 1 | 0.042552 |
4593 | 0 | 0.503924 | 0.550425 | 0.556936 | 0.605788 | 0.605788 | 0.999048 | 0.797459 | 0.809370 | 0.303481 | ... | 0.806939 | 0.001287 | 0.624203 | 0.605785 | 0.840704 | 0.276773 | 0.026811 | 0.565249 | 1 | 0.056514 |
6315 | 0 | 0.451275 | 0.498528 | 0.503346 | 0.598438 | 0.598438 | 0.998972 | 0.797280 | 0.809214 | 0.303326 | ... | 0.771691 | 0.008026 | 0.623244 | 0.598436 | 0.837267 | 0.284545 | 0.026559 | 0.563737 | 1 | 0.020046 |
4205 | 0 | 0.533418 | 0.613607 | 0.596445 | 0.613939 | 0.612743 | 0.999142 | 0.797527 | 0.809452 | 0.303401 | ... | 0.837382 | 0.000646 | 0.623888 | 0.613937 | 0.843529 | 0.280201 | 0.026804 | 0.565218 | 1 | 0.027769 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6591 | 1 | 0.418515 | 0.433984 | 0.461427 | 0.612750 | 0.612750 | 0.998864 | 0.796902 | 0.808857 | 0.302892 | ... | 0.725750 | 0.000487 | 0.623730 | 0.612747 | 0.828067 | 0.292648 | 0.026666 | 0.564481 | 1 | 0.015620 |
6640 | 1 | 0.196802 | 0.211023 | 0.221425 | 0.598056 | 0.598056 | 0.998933 | 0.796144 | 0.808149 | 0.301423 | ... | 0.519388 | 0.017588 | 0.623465 | 0.598051 | 0.856906 | 0.259280 | 0.026769 | 0.565052 | 1 | 0.003946 |
6641 | 1 | 0.337640 | 0.254307 | 0.378446 | 0.590842 | 0.590842 | 0.998869 | 0.796943 | 0.808897 | 0.302953 | ... | 0.557733 | 0.000847 | 0.623302 | 0.590838 | 0.726888 | 0.336515 | 0.026777 | 0.565092 | 1 | 0.011797 |
6642 | 1 | 0.340028 | 0.344636 | 0.380213 | 0.581466 | 0.581466 | 0.998372 | 0.796292 | 0.808283 | 0.302857 | ... | 0.641804 | 0.000376 | 0.623497 | 0.581461 | 0.765967 | 0.337315 | 0.026722 | 0.564807 | 1 | 0.011777 |
6728 | 1 | 0.492176 | 0.544320 | 0.533326 | 0.618105 | 0.618105 | 0.999083 | 0.797456 | 0.809338 | 0.303401 | ... | 0.800780 | 0.000517 | 0.623737 | 0.618104 | 0.840533 | 0.282763 | 0.027033 | 0.566098 | 1 | 0.022209 |
440 rows × 96 columns
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Separate features from target
X = balanced_df.iloc[:,1:]
y = balanced_df.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
X_train, X_test, y_train, y_test = X_train.values, X_test.values, y_train.values, y_test.values
# Standardise the whole dataset
std_scaler = StandardScaler().fit(X_train)
def preprocessor(X):
D = np.copy(X)
D = std_scaler.transform(D)
return D
preprocess_transformer = FunctionTransformer(preprocessor)
preprocess_transformer
FunctionTransformer(func=<function preprocessor at 0x00000220B49CAE58>)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
p1 = Pipeline([('scaler', preprocess_transformer),
('Logistic Regression', LogisticRegression())])
p1
Pipeline(steps=[('scaler',
FunctionTransformer(func=<function preprocessor at 0x00000220B49CAE58>)),
('Logistic Regression', LogisticRegression())])
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, accuracy_score
def fit_and_print(p, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
# Fit the transformer
p.fit(X_train, y_train)
# Predict the train and test outputs
test_prediction =p.predict(X_test)
# Print the errors
print("Accuracy Score: "+str(accuracy_score(test_prediction, y_test)*100))
print("Precision Score: "+str(precision_score(test_prediction, y_test)*100))
print("Recall Score: "+str(recall_score(test_prediction, y_test)*100))
print("roc_auc_score: "+str(accuracy_score(test_prediction, y_test)*100))
print("\nConfusion Matrix:\n", confusion_matrix(test_prediction, y_test))
fit_and_print(p1)
Accuracy Score: 78.18181818181819
Precision Score: 80.76923076923077
Recall Score: 75.0
roc_auc_score: 78.18181818181819
Confusion Matrix:
[[44 10]
[14 42]]
X_test.shape
(110, 95)
from sklearn.utils import resample
# Example: Over-sampling the minority class
minority_class = df[df['Bankrupt?'] == 1]
majority_class = df[df['Bankrupt?'] == 0]
minority_class_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)
balanced_df = pd.concat([majority_class, minority_class_upsampled])
from imblearn.over_sampling import RandomOverSampler
over_smapler = RandomOverSampler()
y = df.iloc[:,0]
X = df.iloc[:,1:]
X_imb_res, y_imb_res = over_smapler.fit_resample(X, y)
X_imb_res.drop('Bankruptdesc', axis=1, inplace=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X_imb_res, y_imb_res, test_size=0.25)
X_train, X_test, y_train, y_test = X_train.values, X_test.values, y_train.values, y_test.values
# Standardise the whole dataset
std_scaler = StandardScaler().fit(X_train)
def preprocessor(X):
D = np.copy(X)
D = std_scaler.transform(D)
return D
preprocess_transformer = FunctionTransformer(preprocessor)
preprocess_transformer
FunctionTransformer(func=<function preprocessor at 0x0000027695A16438>)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
p1 = Pipeline([('scaler', preprocess_transformer),
('Logistic Regression', LogisticRegression())])
p1
Pipeline(steps=[('scaler',
FunctionTransformer(func=<function preprocessor at 0x0000027695A16438>)),
('Logistic Regression', LogisticRegression())])
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, accuracy_score
def fit_and_print(p, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
# Fit the transformer
p.fit(X_train, y_train)
# Predict the train and test outputs
test_prediction =p.predict(X_test)
# Print the errors
print("Accuracy Score: "+str(accuracy_score(test_prediction, y_test)*100))
print("Precision Score: "+str(precision_score(test_prediction, y_test)*100))
print("Recall Score: "+str(recall_score(test_prediction, y_test)*100))
print("roc_auc_score: "+str(accuracy_score(test_prediction, y_test)*100))
print("\nConfusion Matrix:\n", confusion_matrix(test_prediction, y_test))
fit_and_print(p1)
Accuracy Score: 87.2121212121212
Precision Score: 87.40390301596689
Recall Score: 87.61114404267931
roc_auc_score: 87.2121212121212
Confusion Matrix:
[[1400 213]
[ 209 1478]]
C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
p1 = Pipeline([('scaler', preprocess_transformer),
('Logistic Regression', LogisticRegression())])
p1
Pipeline(steps=[('scaler',
FunctionTransformer(func=<function preprocessor at 0x0000027695A16438>)),
('Logistic Regression', LogisticRegression())])
balanced_df.drop("Bankruptdesc", axis=1, inplace=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Separate features from target
X = balanced_df.iloc[:,1:]
y = balanced_df.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
X_train, X_test, y_train, y_test = X_train.values, X_test.values, y_train.values, y_test.values
# Standardise the whole dataset
std_scaler = StandardScaler().fit(X_train)
def preprocessor(X):
D = np.copy(X)
D = std_scaler.transform(D)
return D
p2 = Pipeline([('scaler', preprocess_transformer),
('Logistic Regression', LogisticRegression())])
p2
Pipeline(steps=[('scaler',
FunctionTransformer(func=<function preprocessor at 0x0000027695A16438>)),
('Logistic Regression', LogisticRegression())])
balanced_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Bankrupt? | ROA(C) before interest and depreciation before interest | ROA(A) before interest and % after tax | ROA(B) before interest and depreciation after tax | Operating Gross Margin | Realized Sales Gross Margin | Operating Profit Rate | Pre-tax net Interest Rate | After-tax net Interest Rate | Non-industry income and expenditure/revenue | ... | Net Income to Total Assets | Total assets to GNP price | No-credit Interval | Gross Profit to Sales | Net Income to Stockholder's Equity | Liability to Equity | Degree of Financial Leverage (DFL) | Interest Coverage Ratio (Interest expense to EBIT) | Net Income Flag | Equity to Liability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | 0.390923 | 0.445704 | 0.436158 | 0.619950 | 0.619950 | 0.998993 | 0.797012 | 0.808960 | 0.302814 | ... | 0.736619 | 0.018372 | 0.623655 | 0.619949 | 0.829980 | 0.292504 | 0.026622 | 0.564200 | 1 | 0.015663 |
7 | 0 | 0.508361 | 0.570922 | 0.559077 | 0.601738 | 0.601717 | 0.999009 | 0.797449 | 0.809362 | 0.303545 | ... | 0.815350 | 0.010005 | 0.623843 | 0.601739 | 0.841459 | 0.278607 | 0.027031 | 0.566089 | 1 | 0.034889 |
8 | 0 | 0.488519 | 0.545137 | 0.543284 | 0.603612 | 0.603612 | 0.998961 | 0.797414 | 0.809338 | 0.303584 | ... | 0.803647 | 0.000824 | 0.623977 | 0.603613 | 0.840487 | 0.276423 | 0.026891 | 0.565592 | 1 | 0.065826 |
9 | 0 | 0.495686 | 0.550916 | 0.542963 | 0.599209 | 0.599209 | 0.999001 | 0.797404 | 0.809320 | 0.303483 | ... | 0.804195 | 0.005798 | 0.623865 | 0.599205 | 0.840688 | 0.279388 | 0.027243 | 0.566668 | 1 | 0.030801 |
10 | 0 | 0.482475 | 0.567543 | 0.538198 | 0.614026 | 0.614026 | 0.998978 | 0.797535 | 0.809460 | 0.303759 | ... | 0.814111 | 0.076972 | 0.623687 | 0.614021 | 0.841337 | 0.278356 | 0.026971 | 0.565892 | 1 | 0.036572 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3595 | 1 | 0.102325 | 0.121511 | 0.112212 | 0.585876 | 0.585876 | 0.998786 | 0.796917 | 0.808863 | 0.303082 | ... | 0.481836 | 0.000692 | 0.623637 | 0.585875 | 0.798728 | 0.287020 | 0.026771 | 0.565059 | 1 | 0.018077 |
5 | 1 | 0.388680 | 0.415177 | 0.419134 | 0.590171 | 0.590251 | 0.998758 | 0.796903 | 0.808771 | 0.303116 | ... | 0.710420 | 0.005278 | 0.622605 | 0.590172 | 0.829939 | 0.285087 | 0.026675 | 0.564538 | 1 | 0.019534 |
2908 | 1 | 0.446497 | 0.495475 | 0.493763 | 0.606473 | 0.606423 | 0.998907 | 0.797282 | 0.809207 | 0.303468 | ... | 0.768935 | 0.000564 | 0.623540 | 0.606475 | 0.836306 | 0.287988 | 0.026529 | 0.563500 | 1 | 0.017506 |
2001 | 1 | 0.438795 | 0.090166 | 0.464586 | 0.540776 | 0.540776 | 0.997789 | 0.790787 | 0.802967 | 0.294457 | ... | 0.411809 | 0.011098 | 0.625487 | 0.540775 | 0.996912 | 0.209222 | 0.026779 | 0.565098 | 1 | 0.008753 |
105 | 1 | 0.504363 | 0.562255 | 0.555330 | 0.604859 | 0.604859 | 0.999060 | 0.797438 | 0.809347 | 0.303419 | ... | 0.805041 | 0.001932 | 0.623060 | 0.604856 | 0.841229 | 0.286906 | 0.027821 | 0.567646 | 1 | 0.018150 |
13198 rows × 96 columns
fit_and_print(p2)
Accuracy Score: 87.57575757575758
Precision Score: 87.87699586043762
Recall Score: 87.87699586043762
roc_auc_score: 87.57575757575758
Confusion Matrix:
[[1404 205]
[ 205 1486]]
C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
X_test.shape
(3300, 95)
from sklearn.ensemble import RandomForestClassifier
p22 = Pipeline([('scaler', preprocess_transformer),
('RFC', RandomForestClassifier())])
fit_and_print(p22)
Accuracy Score: 99.51515151515152
Precision Score: 100.0
Recall Score: 99.06268306971295
roc_auc_score: 99.51515151515152
Confusion Matrix:
[[1593 0]
[ 16 1691]]
from sklearn.ensemble import RandomForestClassifier
p2 = Pipeline([('scaler', preprocess_transformer),
('RFC', RandomForestClassifier(max_depth=4, n_estimators=30))])
fit_and_print(p2)
Accuracy Score: 90.24242424242425
Precision Score: 91.1886457717327
Recall Score: 89.91253644314868
roc_auc_score: 90.24242424242425
Confusion Matrix:
[[1436 149]
[ 173 1542]]
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# Assuming X and y are your input features and target variable
# Make sure to replace X and y with your actual data
X = X # Assuming your features are in the DataFrame df
y = y # Assuming your target variable is in the last column of the DataFrame df
# Create the RandomForestClassifier and StandardScaler objects
rnd_clf = RandomForestClassifier()
sc = StandardScaler()
# Scale the input features
X = sc.fit_transform(X)
# Train the RandomForestClassifier
rnd_clf.fit(X, y)
# Print feature importances in descending order
importances = rnd_clf.feature_importances_
feature_names = df.iloc[:, :-1].columns.to_list()
# Sort in descending order
sorted_indices = sorted(range(len(importances)), key=lambda k: importances[k], reverse=True)
for index in sorted_indices:
print(f"{feature_names[index]}: {importances[index]}")
Non-industry income and expenditure/revenue: 0.05581583133371924
Net Value Per Share (C): 0.05127552474984275
Total debt/Total net worth: 0.04510590011508451
Long-term fund suitability ratio (A): 0.040205099670874656
Operating profit/Paid-in capital: 0.0384035785531207
Retained Earnings to Total Assets: 0.03804893278667903
Long-term Liability to Current Assets: 0.037189874259123276
Debt ratio %: 0.02814477129176818
Net Income Flag: 0.027159162804339124
Operating Profit Per Share (Yuan ¥): 0.026960313747819837
Gross Profit to Sales: 0.026863400995622202
Liability to Equity: 0.02632049467460024
After-tax net Interest Rate: 0.02592208994755924
Liability-Assets Flag: 0.02592101077174149
Quick Ratio: 0.023257273906572683
Degree of Financial Leverage (DFL): 0.022292443344409607
Bankrupt?: 0.022214192498964204
Net Income to Stockholder's Equity: 0.02134846021818064
Pre-tax net Interest Rate: 0.018894298248464496
Working Capital/Equity: 0.015643743106852622
Net Value Per Share (B): 0.014712632416698672
Current Liability to Liability: 0.013734634722013831
Current Assets/Total Assets: 0.01310294774317883
ROA(A) before interest and % after tax: 0.013026389526470528
Operating Funds to Liability: 0.012531769139480757
Tax rate (A): 0.01181761887173788
Realized Sales Gross Margin: 0.01157667594570073
Operating Profit Rate: 0.011566053174528398
Cash Flow to Equity: 0.01086834411911465
Cash/Current Liability: 0.01040591060776479
ROA(C) before interest and depreciation before interest: 0.010015778794580215
Continuous Net Profit Growth Rate: 0.00996676521002102
Allocation rate per person: 0.00981219257267893
Revenue per person: 0.009803281684947658
Net Value Per Share (A): 0.009787590398113331
Contingent liabilities/Net worth: 0.008899426653385133
Quick Asset Turnover Rate: 0.008407858030713575
Inventory and accounts receivable/Net value: 0.007195059341598852
ROA(B) before interest and depreciation after tax: 0.007168925576983346
Current Liability to Assets: 0.006849976556819282
Operating Gross Margin: 0.006788063272947664
Current Liability to Equity: 0.006729061670668956
Current Liabilities/Liability: 0.006469173424157348
Working capitcal Turnover Rate: 0.006059759738521245
Total income/Total expense: 0.006003947112408653
Total assets to GNP price: 0.005725420896322807
Research and development expense rate: 0.005588325652771956
Fixed Assets to Assets: 0.005540592889583615
Working Capital to Total Assets: 0.005507209730990111
Equity to Long-term Liability: 0.005480093967858423
Cash Turnover Rate: 0.005239145083133424
After-tax Net Profit Growth Rate: 0.00516779238034476
Cash Flow to Total Assets: 0.005100973114940835
Cash Flow to Liability: 0.004986466786373029
Net worth/Assets: 0.004966563027644056
Net profit before tax/Paid-in capital: 0.004952164909168239
Per Share Net profit before tax (Yuan ¥): 0.0047903135410478845
Fixed Assets Turnover Frequency: 0.00477929232969892
Regular Net Profit Growth Rate: 0.004713533765965679
Operating Profit Growth Rate: 0.004561468790103572
Realized Sales Gross Profit Growth Rate: 0.0045495232481785565
Quick Assets/Total Assets: 0.004515883867112918
Inventory/Current Liability: 0.004479512312303007
Interest-bearing debt interest rate: 0.004397359956932059
Persistent EPS in the Last Four Seasons: 0.00435642694762858
CFO to Assets: 0.004345100762480923
Net Value Growth Rate: 0.0043411882382223075
Total Asset Return Growth Rate Ratio: 0.0041107588835234
Revenue Per Share (Yuan ¥): 0.003885813915934481
Current Asset Turnover Rate: 0.0038043468306167482
No-credit Interval: 0.003542723595875594
Operating Expense Rate: 0.00317019917833392
Borrowing dependency: 0.0031413045419915746
Average Collection Days: 0.0025001855513587406
Inventory Turnover Rate (times): 0.0023565274735233817
Continuous interest rate (after tax): 0.002004277695889444
Total expense/Assets: 0.0009536195584182139
Inventory/Working Capital: 0.0008973610747245117
Quick Assets/Current Liability: 0.0003419753382132506
Total Asset Turnover: 0.00029035859699845317
Cash flow rate: 0.00018624042956620734
Interest Expense Ratio: 0.00017523469075273607
Current Liabilities/Equity: 0.00016069894141924193
Net Income to Total Assets: 4.0149466725760034e-05
Current Liability to Current Assets: 1.7093508828299946e-05
Current Ratio: 1.690491027944913e-05
Operating profit per person: 1.3493418750803063e-05
Cash/Total Assets: 1.1559627304396454e-05
Net Worth Turnover Rate (times): 5.752951860284022e-06
Cash Flow Per Share: 2.8342897323806707e-06
Total Asset Growth Rate: 0.0
Cash Reinvestment %: 0.0
Accounts Receivable Turnover: 0.0
Cash Flow to Sales: 0.0
Interest Coverage Ratio (Interest expense to EBIT): 0.0
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(kernel='linear', random_state=0)
dec_clf = DecisionTreeClassifier()
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf), ('dec', dec_clf)],
voting='hard')
p3 = Pipeline([('scaler', preprocess_transformer),
('VCL', voting_clf)])
fit_and_print(p3)
C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Accuracy Score: 93.54545454545455
Precision Score: 90.00591366055588
Recall Score: 97.19029374201787
roc_auc_score: 93.54545454545455
Confusion Matrix:
[[1565 169]
[ 44 1522]]
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(
SVC(kernel='linear', random_state=0), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)
p4 = Pipeline([('scaler', preprocess_transformer),
('VCL', bag_clf)])
fit_and_print(p4)
Accuracy Score: 87.24242424242425
Precision Score: 88.5866351271437
Recall Score: 86.79026651216685
roc_auc_score: 87.24242424242425
Confusion Matrix:
[[1381 193]
[ 228 1498]]
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5)
p5 = Pipeline([('scaler', preprocess_transformer),
('AdaCL', ada_clf)])
fit_and_print(p5)
Accuracy Score: 95.45454545454545
Precision Score: 97.22057953873447
Recall Score: 94.10417859187177
roc_auc_score: 95.45454545454545
Confusion Matrix:
[[1506 47]
[ 103 1644]]
from sklearn.ensemble import GradientBoostingClassifier
p6 = Pipeline([('scaler', preprocess_transformer),
('GBC', GradientBoostingClassifier())])
fit_and_print(p6)
Accuracy Score: 95.87878787878788
Precision Score: 98.93554109994086
Recall Score: 93.41150195421552
roc_auc_score: 95.87878787878788
Confusion Matrix:
[[1491 18]
[ 118 1673]]
from xgboost import XGBClassifier
p7 = Pipeline([('scaler', preprocess_transformer),
('XGBC', XGBClassifier())])
fit_and_print(p7)
Accuracy Score: 99.33333333333333
Precision Score: 100.0
Recall Score: 98.71570344424985
roc_auc_score: 99.33333333333333
Confusion Matrix:
[[1587 0]
[ 22 1691]]
# Standardize the features
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fit the XGBoost model
rnd_clf = RandomForestClassifier()
rnd_clf.fit(X_train, y_train)
y_pred = rnd_clf.predict(X_test)
y_pred_proba = rnd_clf.predict_proba(X_test)
# Create a DataFrame for y_test and y_pred
data_dict = {"Actual": y_test, "Prediction": y_pred}
results_df = pd.DataFrame(data_dict)
results_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Actual | Prediction | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 0 | 0 |
3 | 1 | 1 |
4 | 0 | 0 |
... | ... | ... |
3295 | 1 | 1 |
3296 | 1 | 1 |
3297 | 0 | 0 |
3298 | 1 | 1 |
3299 | 1 | 1 |
3300 rows × 2 columns
Confusion Matrix:
- A confusion matrix provides a tabular summary of the performance of a classification algorithm.
- It compares the predicted values against the actual values, breaking them down into true positives, true negatives, false positives, and false negatives.
- You can use libraries like scikit-learn to compute and visualize the confusion matrix.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
- The ROC curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false positive rate (1- specificity).
- It is useful for assessing the performance of a classification model at various threshold settings.
- Scikit-learn provides functions to compute and plot ROC curves.
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
- The precision-recall curve is another way to assess the performance of a binary classification model, particularly when dealing with imbalanced datasets.
- It plots precision against recall for different thresholds.
- Scikit-learn provides functions to compute and plot precision-recall curves.
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba[:,0])
avg_precision = average_precision_score(y_test, y_pred_proba[:,0])
plt.plot(recall, precision, label=f'Avg. Precision = {avg_precision:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')
plt.show()
In the context of detecting rare cases of burglary in a supermarket, it's generally more important to have a high recall than precision. Here's why:
-
Recall (Sensitivity or True Positive Rate): Recall measures the ability of a model to capture all the relevant instances of the positive class (burglary in this case) among all actual positive instances. High recall means the model is good at identifying most of the actual cases of burglary.
- Importance: In the context of detecting rare events like burglary, missing a true positive (failure to detect an actual case) can have severe consequences. Therefore, having high recall helps in minimizing false negatives, which are instances where the model fails to identify an actual case of burglary.
-
Precision (Positive Predictive Value): Precision measures the accuracy of the model when it predicts the positive class. It is the ratio of true positives to the total number of predicted positives. High precision means that when the model predicts a positive case, it is likely to be correct.
- Importance: While precision is important, it may be acceptable to have some false positives (incorrectly predicting burglary) as long as the recall is high. False positives might lead to inconvenience or additional investigation but are generally less critical than missing actual cases (false negatives) in the context of detecting rare events.
In summary, for the specific scenario of detecting rare cases of burglary in a supermarket, prioritize high recall to ensure that the model identifies as many actual cases as possible, even if it means accepting a certain level of false positives.
Precision becomes important in the context of the supermarket example when the cost or consequences associated with false positives (incorrectly predicting burglary) is high. Here are some scenarios where precision becomes more crucial:
-
Resource Allocation: If the supermarket security or management has limited resources for investigating or responding to potential burglary incidents, high precision is essential. High precision ensures that when the model predicts a burglary, it is more likely to be a true positive, reducing the wasted resources spent on false alarms.
-
Customer Experience: False alarms or unnecessary security interventions can inconvenience customers and impact their shopping experience. If precision is high, the likelihood of causing unnecessary disruptions to regular customers is reduced, leading to a better overall customer experience.
-
Legal Implications: False accusations of burglary can have legal consequences. High precision is crucial to minimize the risk of accusing innocent individuals or taking actions based on false positives that may lead to legal issues.
-
Cost of Investigation: Investigating potential burglary incidents, even if they turn out to be false alarms, incurs costs. High precision reduces the number of false positives, lowering the overall cost of unnecessary investigations and interventions.
In summary, precision is particularly important in situations where the cost, inconvenience, or potential negative consequences associated with false positives are high. Balancing precision and recall is often a trade-off, and the choice depends on the specific priorities and constraints of the supermarket's security objectives and operational considerations.