0. Purpose of the Project¶

The Corporate Credit Rating evaluates the likelihood of a company repaying its debt, i.e., its default probability. This notebook is to compare several models predicting corporate credit ratings. This ratings determines the cost of capital, and aids in investment decision-making. Also, it helps financial institutions (banks, insurance companies, etc.) manage financial risks and contributes to market stability.

This rating is done by credit rating agencies such as S&P, Moody's and Fitch. They represent credit risk using similar rating scales, i.e., AAA, AA, A, BBB. If they downgrade in rating, bond price decreases and interest rate increases. In constrast, if they upgrade in rating, bond price increases and interest rate decreases. Their impact on financial market is so significant that more funds flow to companies with higher credit ratings and various macroeconomic indicators are also influenced.

The purpose of the project is to compare several models for corporate credit rating and suggest the best model for it. The data is originated from Kaggle competition Corporate Credit Rating. There are 2029 credit ratings from major agencies like S&P, which are evaluated between 2010 and 2016. The subject of the rating is the companies listed on Nasdaq and NYSE, and there is no missing value.

1. Install and Import Packages¶

In this data analysis, we need several packages for data imbalance and other algorithms such as xgboost and lightgbm. After then, we import packages from sklearn, imblearn,xgboost, lightgbm, and other basic packages such as pandas and matplotlib.pyplot.

In [ ]:

pip install imblearn

Requirement already satisfied: imblearn in c:\users\user\anaconda3\lib\site-packages (0.0)Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: imbalanced-learn in c:\users\user\anaconda3\lib\site-packages (from imblearn) (0.12.4)
Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.0.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: scipy>=1.5.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.7.3)
Requirement already satisfied: numpy>=1.17.3 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.22.0)

In [ ]:

pip install xgboost

Requirement already satisfied: xgboost in c:\users\user\anaconda3\lib\site-packages (2.1.3)
Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.7.3)
Requirement already satisfied: numpy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.22.0)
Note: you may need to restart the kernel to use updated packages.

In [ ]:

pip install lightgbm

Requirement already satisfied: lightgbm in c:\users\user\anaconda3\lib\site-packages (4.5.0)
Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.7.3)
Requirement already satisfied: numpy>=1.17.0 in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.22.0)
Note: you may need to restart the kernel to use updated packages.

In [ ]:

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from imblearn.over_sampling import ADASYN

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
import lightgbm as lgb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. EDA & Data Preprocess¶

The data corporate_rating.csv has 31 columns, including 6 categorical and 25 numerical data.

The numerical data can be categorized into 4 section. First one is Liquidity, the ability of assets to be converted into cash. This consists of such variables as currentRatio, quickRatio, cashRatio, daysOfSalesOutstanding... Second one is Profitability Indicator, including grossProfitMargin, operatingProfitMargin, pretaxProfitMargin... The third one is Debt indicator, such as debtRatio, debtEquityRatio ... The last section is Cash Flow, which consists of operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare... There are some other columns not in those sections, such as assetTurnover (asset turnover ratio).

The rest of 6 categorical data are comprised of Rating, Name, Symbol, Rating Agency Name, Date, and Sector.

In [ ]:

cor_rate_data = pd.read_csv('corporate_rating.csv')
cor_rate_data.head()

Out[ ]:

	Rating	Name	Symbol	Rating Agency Name	Date	Sector	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	...	effectiveTaxRate	freeCashFlowOperatingCashFlowRatio	freeCashFlowPerShare	cashPerShare	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover
0	A	Whirlpool Corporation	WHR	Egan-Jones Ratings Company	11/27/2015	Consumer Durables	0.945894	0.426395	0.099690	44.203245	...	0.202716	0.437551	6.810673	9.809403	4.008012	0.049351	7.057088	15.565438	0.058638	3.906655
1	BBB	Whirlpool Corporation	WHR	Egan-Jones Ratings Company	2/13/2014	Consumer Durables	1.033559	0.498234	0.203120	38.991156	...	0.074155	0.541997	8.625473	17.402270	3.156783	0.048857	6.460618	15.914250	0.067239	4.002846
2	BBB	Whirlpool Corporation	WHR	Fitch Ratings	3/6/2015	Consumer Durables	0.963703	0.451505	0.122099	50.841385	...	0.214529	0.513185	9.693487	13.103448	4.094575	0.044334	10.491970	18.888889	0.074426	3.483510
3	BBB	Whirlpool Corporation	WHR	Fitch Ratings	6/15/2012	Consumer Durables	1.019851	0.510402	0.176116	41.161738	...	1.816667	-0.147170	-1.015625	14.440104	3.630950	-0.012858	4.080741	6.901042	0.028394	4.581150
4	BBB	Whirlpool Corporation	WHR	Standard & Poor's Ratings Services	10/24/2016	Consumer Durables	0.957844	0.495432	0.141608	47.761126	...	0.166966	0.451372	7.135348	14.257556	4.012780	0.053770	8.293505	15.808147	0.058065	3.857790

5 rows × 31 columns

(1) Rating Agency¶

There are 5 rating agencies, including S&P, Egan-Jones, Moody's, Fitch, and DBRS.

In [ ]:

cor_rate_data['Rating Agency Name'].value_counts().to_frame()

Out[ ]:

	Rating Agency Name
Standard & Poor's Ratings Services	744
Egan-Jones Ratings Company	603
Moody's Investors Service	579
Fitch Ratings	100
DBRS	3

The following mapping code is to simplify the name of the rating agencies so that its barplot becomes tidy.

In [ ]:

agency_mapping = {"Standard & Poor's Ratings Services" : "S&P",
                  "Egan-Jones Ratings Company" : "Egan-Jones",
                  "Moody's Investors Service" : "Moody's",
                  "Fitch Ratings" : "Fitch",
                  "DBRS" : "DBRS"}

cor_rate_data['Rating Agency Name'] = cor_rate_data['Rating Agency Name'].map(agency_mapping)
cor_rate_data['Rating Agency Name'].value_counts().to_frame()

Out[ ]:

	Rating Agency Name
S&P	744
Egan-Jones	603
Moody's	579
Fitch	100
DBRS	3

The evaluations of the companies are mainly done by 3 agencies; S&P, Egan-Jones, and Moody's.

In [ ]:

plt.figure(figsize = (10,6))
cor_rate_data['Rating Agency Name'].value_counts().plot(kind = 'bar',
                                                        color=plt.cm.tab20.colors[:len(cor_rate_data['Rating Agency Name'].value_counts())])
plt.title('Frequency of Rating Agencies', fontsize = 14)
plt.xlabel('Name of Agencies')
plt.ylabel('Counts')
plt.show()

Instead of deleting the agency column, I determined to check how the agencies affect the rating of the companies. However, simply encoding the agencies based on the number of agencies (0 to 4) may not fit the conditions of some models such as LDA and QDA, because they require the explanatory variables to follow normal distribution. So, I generated dummy variables (one-hot encoding) for rating agencies.

In [ ]:

cor_rate_data = pd.get_dummies(cor_rate_data,
                               columns = ['Rating Agency Name'],
                               prefix = 'Agency',
                               drop_first=True,
                               dtype = float)
cor_rate_data.head()

Out[ ]:

	Rating	Name	Symbol	Date	Sector	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	...	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover	Agency_Egan-Jones	Agency_Fitch	Agency_S&P
0	A	Whirlpool Corporation	WHR	11/27/2015	Consumer Durables	0.945894	0.426395	0.099690	44.203245	0.037480	...	4.008012	0.049351	7.057088	15.565438	0.058638	3.906655	1.0	0.0	0.0
1	BBB	Whirlpool Corporation	WHR	2/13/2014	Consumer Durables	1.033559	0.498234	0.203120	38.991156	0.044062	...	3.156783	0.048857	6.460618	15.914250	0.067239	4.002846	1.0	0.0	0.0
2	BBB	Whirlpool Corporation	WHR	3/6/2015	Consumer Durables	0.963703	0.451505	0.122099	50.841385	0.032709	...	4.094575	0.044334	10.491970	18.888889	0.074426	3.483510	0.0	1.0	0.0
3	BBB	Whirlpool Corporation	WHR	6/15/2012	Consumer Durables	1.019851	0.510402	0.176116	41.161738	0.020894	...	3.630950	-0.012858	4.080741	6.901042	0.028394	4.581150	0.0	1.0	0.0
4	BBB	Whirlpool Corporation	WHR	10/24/2016	Consumer Durables	0.957844	0.495432	0.141608	47.761126	0.042861	...	4.012780	0.053770	8.293505	15.808147	0.058065	3.857790	0.0	0.0	1.0

5 rows × 34 columns

(2) Sectors and other categorical variables¶

Following code and barplot show the number of sectors, including energy, health care, finance etc.

In [ ]:

cor_rate_data['Sector'].value_counts().to_frame()

Out[ ]:

	Sector
Energy	294
Basic Industries	260
Consumer Services	250
Technology	234
Capital Goods	233
Public Utilities	211
Health Care	171
Consumer Non-Durables	132
Consumer Durables	74
Transportation	63
Miscellaneous	57
Finance	50

In [ ]:

plt.figure(figsize = (12,6))
cor_rate_data['Sector'].value_counts().plot(kind = 'bar', color = 'orange')
plt.title('Distribution Of Sectors')
plt.ylabel('Counts')
plt.show()

Including 'Sector' column, I deleted rest of the categorical variables except for 'Rating'.

In [ ]:

cor_rate_data = cor_rate_data.drop(columns = ['Sector','Name','Date', 'Symbol'])
cor_rate_data.head()

Out[ ]:

	Rating	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	...	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover	Agency_Egan-Jones	Agency_Fitch	Agency_S&P
0	A	0.945894	0.426395	0.099690	44.203245	0.037480	0.049351	0.176631	0.061510	0.041189	...	4.008012	0.049351	7.057088	15.565438	0.058638	3.906655	1.0	0.0	0.0
1	BBB	1.033559	0.498234	0.203120	38.991156	0.044062	0.048857	0.175715	0.066546	0.053204	...	3.156783	0.048857	6.460618	15.914250	0.067239	4.002846	1.0	0.0	0.0
2	BBB	0.963703	0.451505	0.122099	50.841385	0.032709	0.044334	0.170843	0.059783	0.032497	...	4.094575	0.044334	10.491970	18.888889	0.074426	3.483510	0.0	1.0	0.0
3	BBB	1.019851	0.510402	0.176116	41.161738	0.020894	-0.012858	0.138059	0.042430	0.025690	...	3.630950	-0.012858	4.080741	6.901042	0.028394	4.581150	0.0	1.0	0.0
4	BBB	0.957844	0.495432	0.141608	47.761126	0.042861	0.053770	0.177720	0.065354	0.046363	...	4.012780	0.053770	8.293505	15.808147	0.058065	3.857790	0.0	0.0	1.0

5 rows × 30 columns

(3) Ratings & Oversampling through ADASYN¶

The following table shows the number of ratings. While some ratings including BBB,BB,A have sufficiently many numbers, others such as AAA, CC, C, and D have significantly low number: 7,5,2, and 1.

In [ ]:

rating_order = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B', 'CCC', 'CC', 'C' , 'D' ]
cor_rate_data['Rating'].value_counts().reindex(rating_order).to_frame().T

Out[ ]:

	AAA	AA	A	BBB	BB	B	CCC	CC	C	D
Rating	7	89	398	671	490	302	64	5	2	1

So, my purpose is to oversample the data of extremely low number of ratings. Before oversampling, I substituted the characters into numbers; AAA to 0, AA to 1 and so on. The oversampling method do not work well when there is only 1 data in a category, like D. So, I assigned same number to D as C. The number of companies that are classified as BBB was the largest; 671 companies are rated as BBB. So I tried to adjust all the ratings to have 671 data respectively.

In [ ]:

rating_mapping = {'AAA': 0, 'AA': 1, 'A': 2, 'BBB' : 3, 'BB' : 4, 'B' : 5, 'CCC' : 6, 'CC' : 7, 'C' : 8, 'D' : 8}
cor_rate_data['Rating'] = cor_rate_data['Rating'].map(rating_mapping).fillna(method='ffill')
cor_rate_data.head()

Out[ ]:

	Rating	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	...	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover	Agency_Egan-Jones	Agency_Fitch	Agency_S&P
0	2	0.945894	0.426395	0.099690	44.203245	0.037480	0.049351	0.176631	0.061510	0.041189	...	4.008012	0.049351	7.057088	15.565438	0.058638	3.906655	1.0	0.0	0.0
1	3	1.033559	0.498234	0.203120	38.991156	0.044062	0.048857	0.175715	0.066546	0.053204	...	3.156783	0.048857	6.460618	15.914250	0.067239	4.002846	1.0	0.0	0.0
2	3	0.963703	0.451505	0.122099	50.841385	0.032709	0.044334	0.170843	0.059783	0.032497	...	4.094575	0.044334	10.491970	18.888889	0.074426	3.483510	0.0	1.0	0.0
3	3	1.019851	0.510402	0.176116	41.161738	0.020894	-0.012858	0.138059	0.042430	0.025690	...	3.630950	-0.012858	4.080741	6.901042	0.028394	4.581150	0.0	1.0	0.0
4	3	0.957844	0.495432	0.141608	47.761126	0.042861	0.053770	0.177720	0.065354	0.046363	...	4.012780	0.053770	8.293505	15.808147	0.058065	3.857790	0.0	0.0	1.0

5 rows × 30 columns

Let's delve deeper into ADASYN. This method requires the number of nearest neighbors k (K-nearest neighbors). In this analysis, I set the number k = 2. In ADASYN, the majority class refers to the class with largest amount of data among all other classes, while minority class means the rest of the classes. For the $i$-th data point $x_i$ in the minority class, calculate the ratio: $$r_i = \frac{\text{The number of majority-class neighbors of }x_i}{k}\in[0,1]$$ Then, normalize $r_i$: $g_i = \frac{r_i}{\sum_{j\in\text{minority group}} r_j}$, and determine the number of synthetic samples to generate, $G_i$: $$G_i = g_i*G$$ where $G =$ (Number of majority class samples - Number of minority class samples). For each $x_i$, randomly select one of its $k$ neighbors $x_*$: $$x_{new} = x_i + \lambda(x_* - x_i)$$ where $\lambda \sim \text{Unif}(0,1)$ (interpolation).

In [ ]:

X = cor_rate_data.drop(columns=['Rating'])
y = cor_rate_data['Rating']
sampling_strategy = {label: 671 for label in set(y)}

adasyn = ADASYN(sampling_strategy = sampling_strategy, n_neighbors=2, random_state=2023311161)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

y_resampled.value_counts().to_frame()

Out[ ]:

	Rating
5	721
4	690
6	672
8	672
3	671
0	671
7	670
2	666
1	654

The following code and barplot show the total number of ratings.

In [ ]:

rating_remapping = {value: key for key, value in rating_mapping.items()} # switch key and value of rating_mapping dictionary
rating_remapping[8] = 'C'

rating_order2 = [x for x in rating_order if x != 'D']
y_resampled_rating = y_resampled.map(rating_remapping)
y_resampled_rating.value_counts().reindex(rating_order2).to_frame().T

Out[ ]:

	AAA	AA	A	BBB	BB	B	CCC	CC	C
Rating	671	654	666	671	690	721	672	670	672

In [ ]:

plt.figure(figsize = (12,6))
y_resampled_rating.value_counts().reindex(rating_order2).plot(kind = 'bar')
plt.ylabel('counts')
plt.title('The Number of Ratings')
plt.show()

Now, combine the resampled rating data and resampled numerical (explanatory) data.

In [ ]:

cor_rate_data_adasyn = pd.concat([y_resampled, X_resampled], axis = 1)
cor_rate_data_adasyn.head()

Out[ ]:

	Rating	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	...	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover	Agency_Egan-Jones	Agency_Fitch	Agency_S&P
0	2	0.945894	0.426395	0.099690	44.203245	0.037480	0.049351	0.176631	0.061510	0.041189	...	4.008012	0.049351	7.057088	15.565438	0.058638	3.906655	1.0	0.0	0.0
1	3	1.033559	0.498234	0.203120	38.991156	0.044062	0.048857	0.175715	0.066546	0.053204	...	3.156783	0.048857	6.460618	15.914250	0.067239	4.002846	1.0	0.0	0.0
2	3	0.963703	0.451505	0.122099	50.841385	0.032709	0.044334	0.170843	0.059783	0.032497	...	4.094575	0.044334	10.491970	18.888889	0.074426	3.483510	0.0	1.0	0.0
3	3	1.019851	0.510402	0.176116	41.161738	0.020894	-0.012858	0.138059	0.042430	0.025690	...	3.630950	-0.012858	4.080741	6.901042	0.028394	4.581150	0.0	1.0	0.0
4	3	0.957844	0.495432	0.141608	47.761126	0.042861	0.053770	0.177720	0.065354	0.046363	...	4.012780	0.053770	8.293505	15.808147	0.058065	3.857790	0.0	0.0	1.0

5 rows × 30 columns

(4) Split data into Train and Test dataset¶

The ratio of train and test data is 80 : 20.

In [ ]:

data_train, data_test = train_test_split(cor_rate_data_adasyn, test_size=0.2, random_state = 2023311161)
X_train, y_train = data_train.iloc[:,2:30], data_train.iloc[:,0]
X_test, y_test = data_test.iloc[:,2:30], data_test.iloc[:,0]
X_test.head()

Out[ ]:

	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	returnOnCapitalEmployed	returnOnEquity	...	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover	Agency_Egan-Jones	Agency_Fitch	Agency_Moody's	Agency_S&P
2371	2.278170	0.111674	71.634958	0.144557	0.061013	0.640281	0.203167	0.072946	0.045269	0.169617	...	2.305414	0.061013	17.261475	3.695267	0.327851	4.956799	0.000000	0.000000	0.0	1.000000
534	1.038975	0.380729	63.727017	0.059743	0.092966	0.192734	0.103564	0.046122	0.090953	0.109802	...	2.380685	0.092966	8.102552	3.403397	0.079016	6.936468	0.000000	0.000000	1.0	0.000000
1071	1.229724	0.380645	46.646221	0.068476	0.103081	0.199890	0.118506	0.053858	0.096087	0.149840	...	2.782151	0.103081	6.363313	3.485643	0.127798	10.028686	0.000000	0.000000	1.0	0.000000
5606	0.394453	0.026197	53.924592	0.196575	0.147173	0.998726	-0.193032	0.050472	0.043942	-0.404301	...	-11.054333	0.147173	20.605994	3.182294	0.083168	3.631803	0.014864	0.000000	0.0	0.985136
4943	0.549608	0.308271	23.011114	-1.083307	-1.194222	0.509708	-1.100582	-0.579432	-0.725569	1.409955	...	-2.537786	-1.194222	-4.132979	5.529796	0.146863	3.738619	0.000000	0.534308	0.0	0.465692

5 rows × 28 columns

3. Machine Learning Models¶

This section compares several machine learning models to compare how well each model predicts the rating of the companies. We'll provide two measures for the prediction performance; accuracy and weighted $f_1$ score.

In [ ]:

average = 'weighted'

(1) Ordered Logistic Regression¶

In [ ]:

# OrderedModel in statsmodels is too poor.

LR_model = LogisticRegression(random_state=2023311161 ,
                              multi_class='multinomial',
                              solver='newton-cg',
                              max_iter=5000)

LR_model = LR_model.fit(X_train, y_train)
y_pred_LR = LR_model.predict(X_test)
acc_olr = accuracy_score(y_test, y_pred_LR)
print("Ordered Logistic Regression Accuracy :",acc_olr) # 0.5935

f1_olr = f1_score(y_test, y_pred_LR, average = average)
print("Ordered Logistic Regression f1 score :", f1_olr) # 0.5709

Ordered Logistic Regression Accuracy : 0.5935960591133005
Ordered Logistic Regression f1 score : 0.5709746447753064

C:\Users\user\anaconda3\lib\site-packages\sklearn\utils\optimize.py:210: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations.
  warnings.warn(

(2) LDA¶

In [ ]:

LDA_model = LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)
y_pred_LDA = LDA_model.predict(X_test)
acc_LDA = accuracy_score(y_test, y_pred_LDA)
print("LDA Accuracy :",acc_LDA)

f1_LDA = f1_score(y_test, y_pred_LDA, average = average)
print("LDA f1 score :", f1_LDA)

LDA Accuracy : 0.48932676518883417
LDA f1 score : 0.4515765543613118

(3) QDA¶

In [ ]:

QDA_model = QuadraticDiscriminantAnalysis()
QDA_model.fit(X_train,y_train)
y_pred_QDA = QDA_model.predict(X_test)
acc_QDA = accuracy_score(y_test, y_pred_QDA)
print("QDA Accuracy :",acc_QDA)

f1_QDA = f1_score(y_test, y_pred_QDA, average = average)
print("QDA f1 score :", f1_QDA)

QDA Accuracy : 0.535303776683087
QDA f1 score : 0.4714921035655885

C:\Users\user\anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:878: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")

(4) KNN¶

In [ ]:

KNN_model = KNeighborsClassifier(n_neighbors = 3)
KNN_model.fit(X_train,y_train)
y_pred_KNN = KNN_model.predict(X_test)
acc_KNN = accuracy_score(y_test, y_pred_KNN)
print("KNN Accuracy :",acc_KNN)

f1_KNN = f1_score(y_test, y_pred_KNN, average = average)
print("KNN f1 score :", f1_KNN)

KNN Accuracy : 0.7996715927750411
KNN f1 score : 0.7858740734584811

(5) SVM¶

In [ ]:

SVC_model = svm.SVC(gamma= 'auto')
SVC_model.fit(X_train, y_train)
y_pred_SVM = SVC_model.predict(X_test)
acc_SVM = accuracy_score(y_test, y_pred_SVM)
print("SVM Accuracy :",acc_SVM)

f1_SVM = f1_score(y_test, y_pred_SVM, average = average)
print("SVM f1 score :", f1_SVM)

SVM Accuracy : 0.7955665024630542
SVM f1 score : 0.8121719550099245

(6) Random Forest¶

In [ ]:

RF_model = RandomForestClassifier(random_state=2023311161)
RF_model.fit(X_train,y_train)
y_pred_RF = RF_model.predict(X_test)
acc_RF = accuracy_score(y_test, y_pred_RF)
print("Random Forest Accuracy :",acc_RF)

f1_RF = f1_score(y_test, y_pred_RF, average = average)
print("Random Forest f1 score :", f1_RF)

Random Forest Accuracy : 0.8538587848932676
Random Forest f1 score : 0.8512928695960994

(7) Gradient Boosting¶

In [ ]:

GBT_model = GradientBoostingClassifier(random_state=2023311161)
GBT_model.fit(X_train, y_train)
y_pred_GBT = GBT_model.predict(X_test)
acc_GBT = accuracy_score(y_test, y_pred_GBT)
print("GBT Accuracy :",acc_GBT)

f1_GBT = f1_score(y_test, y_pred_GBT, average = average)
print("GBT f1 score :", f1_GBT)

GBT Accuracy : 0.7857142857142857
GBT f1 score : 0.7805326021015632

(8) XGBoost¶

In [ ]:

xgb_model = XGBClassifier(objective='multi:softmax', num_class=len(y_train.unique()), random_state=2023311161)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
acc_xgb = accuracy_score(y_test, y_pred_xgb)
print("XGBoost accuracy :", acc_xgb)

f1_xgb = f1_score(y_test, y_pred_xgb, average = average)
print("XGBoost f1 score :", f1_xgb)

XGBoost accuracy : 0.8702791461412152
XGBoost f1 score : 0.8690958617082526

(9) LightGBM¶

In [ ]:

lgb_model = lgb.LGBMClassifier(random_state=2023311161)
lgb_model.fit(X_train, y_train)
y_pred_lightGBM = lgb_model.predict(X_test)
acc_lightGBM = accuracy_score(y_test, y_pred_lightGBM)
print("LightGBM accuracy :", acc_lightGBM)

f1_lightGBM = f1_score(y_test, y_pred_lightGBM, average = average)
print("LightGBM f1 score :", f1_lightGBM)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001712 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7066
[LightGBM] [Info] Number of data points in the train set: 4869, number of used features: 28
[LightGBM] [Info] Start training from score -2.221548
[LightGBM] [Info] Start training from score -2.219655
[LightGBM] [Info] Start training from score -2.225343
[LightGBM] [Info] Start training from score -2.195378
[LightGBM] [Info] Start training from score -2.202785
[LightGBM] [Info] Start training from score -2.131070
[LightGBM] [Info] Start training from score -2.214000
[LightGBM] [Info] Start training from score -2.188025
[LightGBM] [Info] Start training from score -2.180726
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LightGBM accuracy : 0.8735632183908046
LightGBM f1 score : 0.8737808912916817

4. Results¶

(1) Accuracy¶

The following barplot compares the accuracies of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest accuracies among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low accuracies; around 50 ~ 60%.

In [ ]:

models = ['Ordered Logistic', 'LDA','QDA', 'KNN','SVC','Random Forest','Gradient Boosting','XGBoost','LightGBM']
accuracy = [acc_olr, acc_LDA, acc_QDA, acc_KNN, acc_SVM, acc_RF, acc_GBT, acc_xgb, acc_lightGBM]
df_acc = pd.DataFrame({'model' : models, 'accuracy' : accuracy})
df_acc = df_acc.sort_values('accuracy', ascending = True)

plt.figure(figsize = (12,6))
ax = sns.barplot(data=df_acc, x='model', y='accuracy',  palette='viridis')

# Add accuracy values on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height():.3f}',  # Format with two decimals
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=10)
plt.ylabel('Accuracy')
plt.title('The accuracies for distinct models')
plt.show()

(2) weighted average $F_1$ score¶

This section suggests different measure, the weighted average $F_1$ score. Suppose $C$ is the total number of classes, $n_k$ is the number of samples in class $k$, $w_k := n_k/\sum_{j=1}^C n_j$ is the weight of class $k$. The precision and recall, another type of measure for performance, are defined as follows: $$P_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Positive}_k}$$ $$R_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Negative}_k}$$ Then, the class-wise $F_1$ score is $F_{1,(k)} = 2\frac{P_k\cdot R_k}{P_k + R_k}$, and thus the weighted average $F_1$ score is $$F_{1,weighted} := \sum_{k=1}^C w_k\cdot F_{1,(k)}$$

The following barplot compares the weighted $F_1$ scores of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest weighted $F_1$ scores among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low weighted $F_1$ scores; around 45 ~ 57%.

In [ ]:

f1_scores = [f1_olr, f1_LDA, f1_QDA, f1_KNN, f1_SVM, f1_RF, f1_GBT, f1_xgb, f1_lightGBM]
df_f1 = pd.DataFrame({'model' : models, 'f1_score' : f1_scores})
df_f1 = df_f1.sort_values('f1_score', ascending = True)

plt.figure(figsize = (12,6))
ax = sns.barplot(data=df_f1, x='model', y='f1_score', color='skyblue')

for p in ax.patches:
    ax.annotate(f'{p.get_height():.3f}',  # Format with two decimals
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=10)
plt.ylabel('f1 score')
plt.title('The f1 scores for distinct models')
plt.show()

(3) Feature Importance¶

This barplot shows how important each variable is.

In [ ]:

df_LGB_importance = pd.DataFrame({'feature' : X_train.columns,
                                  'importance' : lgb_model.feature_importances_})
df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = True)

plt.figure(figsize = (16,10))
bars = plt.barh(df_LGB_importance['feature'], df_LGB_importance['importance'], color = 'green')
for bar, value in zip(bars, df_LGB_importance['importance']):
    plt.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height() / 2,
             f'{value:.0f}', va='center', fontsize=9, color='black')
plt.xlabel('Feature Importance')
plt.title('Feature Importance - LightGBM')
plt.show()

The following table shows the top 5 most important variables.

In [ ]:

df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = False)
df_LGB_importance['feature'].iloc[:5].to_frame().T

Out[ ]:

	2	17	0	23	20
feature	daysOfSalesOutstanding	cashPerShare	quickRatio	payablesTurnover	enterpriseValueMultiple

5. Other Comments¶

Among all the models, the LDA,QDA, and ordered logistic regression models show relatively low performace. This section discusses on some probable reasons of it.

(1) Outliers¶

Most of the columns of numerical variables have extreme values. For example, the minimum and the values between Q1 and Q3 of the column currentRatio are -0.93 and 1.07 ~ 2.16, but its maximum is 1725.5. These outlier values are likely to distort the models' performace.

Especially, the LDA as well as QDA assumes the explanatory variable to follow normal distribution.

In [ ]:

cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].describe()

Out[ ]:

	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	returnOnCapitalEmployed	...	effectiveTaxRate	freeCashFlowOperatingCashFlowRatio	freeCashFlowPerShare	cashPerShare	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover
count	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	2029.000000	...	2029.000000	2029.000000	2.029000e+03	2.029000e+03	2029.000000	2029.000000	2029.000000	2.029000e+03	2029.000000	2029.000000
mean	3.529607	2.653986	0.667364	333.795606	0.278447	0.431483	0.497968	0.587322	-37.517928	-73.974193	...	0.397572	0.409550	5.094719e+03	4.227549e+03	3.323579	0.437454	48.287985	6.515123e+03	1.447653	38.002718
std	44.052361	32.944817	3.583943	4447.839583	6.064134	8.984982	0.525307	11.224622	1166.172220	2350.275719	...	10.595075	3.796488	1.469156e+05	1.224000e+05	87.529866	8.984299	529.118961	1.775290e+05	19.483294	758.923588
min	-0.932005	-1.893266	-0.192736	-811.845623	-101.845815	-124.343612	-14.800817	-124.343612	-40213.178290	-87162.162160	...	-100.611015	-120.916010	-4.912742e+03	-1.915035e+01	-2555.419643	-124.343612	-3749.921337	-1.195049e+04	-4.461837	-76.662850
25%	1.071930	0.602825	0.130630	22.905093	0.021006	0.025649	0.233127	0.044610	0.019176	0.028112	...	0.146854	0.271478	4.119924e-01	1.566038e+00	2.046822	0.028057	6.238066	2.356735e+00	0.073886	2.205912
50%	1.493338	0.985679	0.297493	42.374120	0.064753	0.084965	0.414774	0.107895	0.045608	0.074421	...	0.300539	0.644529	2.131742e+00	3.686513e+00	2.652456	0.087322	9.274398	4.352584e+00	0.133050	5.759722
75%	2.166891	1.453820	0.624906	59.323563	0.114807	0.144763	0.849693	0.176181	0.077468	0.135036	...	0.370653	0.836949	4.230253e+00	8.086152e+00	3.658331	0.149355	12.911759	7.319759e+00	0.240894	9.480892
max	1725.505005	1139.541703	125.917417	115961.637400	198.517873	309.694856	2.702533	410.182214	0.487826	2.439504	...	429.926282	34.594086	5.753380e+06	4.786803e+06	2562.871795	309.694856	11153.607090	6.439270e+06	688.526591	20314.880400

8 rows × 25 columns

(2) Multicollinearity¶

If there is multicollinearity between covariances, this would lead to poor performance of forecasting. The following table shows the correlations of all variables.

In [ ]:

cor_rate_data_correlation = cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].corr()
cor_rate_data_correlation

Out[ ]:

	currentRatio	quickRatio	cashRatio	daysOfSalesOutstanding	netProfitMargin	pretaxProfitMargin	grossProfitMargin	operatingProfitMargin	returnOnAssets	returnOnCapitalEmployed	...	effectiveTaxRate	freeCashFlowOperatingCashFlowRatio	freeCashFlowPerShare	cashPerShare	companyEquityMultiplier	ebitPerRevenue	enterpriseValueMultiple	operatingCashFlowPerShare	operatingCashFlowSalesRatio	payablesTurnover
currentRatio	1.000000	0.104329	0.736042	0.000020	0.003561	-0.002453	0.067326	-0.001224	0.001623	0.001583	...	-0.026461	0.005863	-0.001307	-0.001266	0.000118	-0.002489	-0.002013	-0.001453	-0.001439	-0.002160
quickRatio	0.104329	1.000000	0.125848	0.736630	-0.001931	-0.002208	-0.028343	-0.001820	0.001539	0.001508	...	-0.003940	0.001713	-0.001427	-0.000929	-0.000365	-0.002241	-0.002843	-0.001587	-0.001453	-0.001759
cashRatio	0.736042	0.125848	1.000000	0.006616	-0.007936	-0.006837	-0.050355	-0.000723	0.003326	0.003280	...	-0.024406	-0.013719	-0.001778	-0.001652	-0.001883	-0.006883	-0.007945	-0.002293	0.004459	-0.006915
daysOfSalesOutstanding	0.000020	0.736630	0.006616	1.000000	0.262299	0.278222	-0.068181	0.291002	0.002416	0.002364	...	-0.000926	0.003267	-0.002155	-0.000834	-0.000242	0.278204	-0.019808	-0.002283	0.406126	-0.003026
netProfitMargin	0.003561	-0.001931	-0.007936	0.262299	1.000000	0.991241	-0.099540	0.971483	0.001577	0.001535	...	0.001500	-0.004830	-0.000708	-0.000716	-0.001185	0.991185	-0.003665	-0.000779	0.785592	-0.001681
pretaxProfitMargin	-0.002453	-0.002208	-0.006837	0.278222	0.991241	1.000000	-0.135662	0.992001	0.001637	0.001598	...	-0.000097	-0.005518	-0.000976	-0.000977	-0.001148	0.999975	-0.003867	-0.001080	0.831778	-0.001760
grossProfitMargin	0.067326	-0.028343	-0.050355	-0.068181	-0.099540	-0.135662	1.000000	-0.121829	-0.030745	-0.030128	...	0.018387	0.016868	0.012775	0.012781	0.010026	-0.135715	-0.008007	0.011131	-0.105564	0.034244
operatingProfitMargin	-0.001224	-0.001820	-0.000723	0.291002	0.971483	0.992001	-0.121829	1.000000	0.001675	0.001634	...	0.000173	-0.005949	-0.001205	-0.001205	-0.000962	0.992018	-0.003699	-0.001305	0.871686	-0.002042
returnOnAssets	0.001623	0.001539	0.003326	0.002416	0.001577	0.001637	-0.030745	0.001675	1.000000	0.995426	...	0.001214	0.029072	-0.001283	0.001115	0.002629	0.001659	0.002742	0.002275	0.002385	0.001616
returnOnCapitalEmployed	0.001583	0.001508	0.003280	0.002364	0.001535	0.001598	-0.030128	0.001634	0.995426	1.000000	...	0.001183	0.029901	-0.001162	0.001090	0.002559	0.001619	0.002883	0.002145	0.002334	0.001581
returnOnEquity	-0.001644	-0.001562	-0.003402	-0.002444	-0.001554	-0.001623	0.031101	-0.001669	-0.995371	-0.981650	...	-0.001216	-0.027981	0.001368	-0.001147	-0.002498	-0.001644	-0.002975	-0.002419	-0.002412	-0.001628
assetTurnover	-0.001951	-0.001854	-0.003991	-0.002887	-0.001831	-0.001909	0.036758	-0.001974	-0.822513	-0.802676	...	-0.001444	-0.022836	0.000736	-0.001329	-0.003321	-0.001934	-0.003633	-0.002390	-0.002834	-0.001926
fixedAssetTurnover	-0.001944	-0.001845	-0.003940	-0.002884	-0.001830	-0.001907	0.036731	-0.001973	-0.810064	-0.788302	...	-0.001442	-0.022273	0.000733	-0.001328	-0.003324	-0.001933	-0.003634	-0.002395	-0.002834	-0.001925
debtEquityRatio	0.000093	-0.000366	-0.001877	-0.000231	-0.001183	-0.001149	0.010051	-0.000964	0.002262	0.002200	...	0.003429	0.000507	-0.012039	-0.009003	0.999995	-0.001148	0.010820	-0.015896	-0.000781	-0.000212
debtRatio	0.005856	0.002901	-0.008638	0.001022	-0.022625	-0.024115	0.018127	-0.024509	-0.051972	-0.051081	...	-0.001024	-0.058294	-0.011503	-0.013107	0.007100	-0.024150	0.051639	-0.004820	-0.021119	-0.004237
effectiveTaxRate	-0.026461	-0.003940	-0.024406	-0.000926	0.001500	-0.000097	0.018387	0.000173	0.001214	0.001183	...	1.000000	0.005496	-0.001528	-0.001562	0.003449	-0.000029	-0.004558	-0.001771	-0.000724	-0.000206
freeCashFlowOperatingCashFlowRatio	0.005863	0.001713	-0.013719	0.003267	-0.004830	-0.005518	0.016868	-0.005949	0.029072	0.029901	...	0.005496	1.000000	0.003497	0.003662	0.000510	-0.005449	-0.002982	0.003602	-0.003585	-0.003625
freeCashFlowPerShare	-0.001307	-0.001427	-0.001778	-0.002155	-0.000708	-0.000976	0.012775	-0.001205	-0.001283	-0.001162	...	-0.001528	0.003497	1.000000	0.997277	-0.012076	-0.000999	-0.003156	0.992371	-0.002066	-0.001476
cashPerShare	-0.001266	-0.000929	-0.001652	-0.000834	-0.000716	-0.000977	0.012781	-0.001205	0.001115	0.001090	...	-0.001562	0.003662	0.997277	1.000000	-0.009027	-0.001000	-0.003172	0.986459	-0.001514	-0.001281
companyEquityMultiplier	0.000118	-0.000365	-0.001883	-0.000242	-0.001185	-0.001148	0.010026	-0.000962	0.002629	0.002559	...	0.003449	0.000510	-0.012076	-0.009027	1.000000	-0.001146	0.010804	-0.015945	-0.000804	-0.000207
ebitPerRevenue	-0.002489	-0.002241	-0.006883	0.278204	0.991185	0.999975	-0.135715	0.992018	0.001659	0.001619	...	-0.000029	-0.005449	-0.000999	-0.001000	-0.001146	1.000000	-0.003925	-0.001104	0.831789	-0.001790
enterpriseValueMultiple	-0.002013	-0.002843	-0.007945	-0.019808	-0.003665	-0.003867	-0.008007	-0.003699	0.002742	0.002883	...	-0.004558	-0.002982	-0.003156	-0.003172	0.010804	-0.003925	1.000000	-0.003337	-0.005003	-0.001183
operatingCashFlowPerShare	-0.001453	-0.001587	-0.002293	-0.002283	-0.000779	-0.001080	0.011131	-0.001305	0.002275	0.002145	...	-0.001771	0.003602	0.992371	0.986459	-0.015945	-0.001104	-0.003337	1.000000	-0.002220	-0.000302
operatingCashFlowSalesRatio	-0.001439	-0.001453	0.004459	0.406126	0.785592	0.831778	-0.105564	0.871686	0.002385	0.002334	...	-0.000724	-0.003585	-0.002066	-0.001514	-0.000804	0.831789	-0.005003	-0.002220	1.000000	-0.003337
payablesTurnover	-0.002160	-0.001759	-0.006915	-0.003026	-0.001681	-0.001760	0.034244	-0.002042	0.001616	0.001581	...	-0.000206	-0.003625	-0.001476	-0.001281	-0.000207	-0.001790	-0.001183	-0.000302	-0.003337	1.000000

25 rows × 25 columns

The following image plot visualizes the correlation between each variables. This shows that there are some variables that have strong correlation (whether positive or negative) with other ones. For example, netProfitMargin column has strong correlation with pretaxProfitMargin (0.991241), operatingProfitMargin (0.971483), ebitPerRevenue (0.991185), and operatingCashFlowSalesRatio (0.785592).

In [ ]:

plt.figure(figsize=(16, 12))
plt.imshow(cor_rate_data_correlation, cmap='coolwarm', interpolation='none', aspect='auto')
plt.colorbar(label='Correlation Coefficient')
plt.xticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index, rotation=90)
plt.yticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index)
plt.title("Correlation Matrix", fontsize=30)
plt.tight_layout()
plt.show()

This multicollinearity has severe impact on QDA, because QDA relies on the covariance matrices of the classes. Also, the multicollinearity can cause the ordered logistic model to fail to converge, leading to unreliable estimates.

6. Conclusion¶

The models based on gradient boosting, including LightGBM and XGBoost, as well as Random Forest, have the best performance of forecasting the ratings of corporate credit, without dealing with outliers and multicollinearity. If such methods as LDA and Ordered Logistic Regression are necessary, handling the outliers (i.e., treating them as missing values) and deleting columns with high correlation would lead to higher performance than before.