0. Purpose of the Project¶
The Corporate Credit Rating evaluates the likelihood of a company repaying its debt, i.e., its
default probability. This notebook is to compare several models predicting corporate credit ratings. This ratings determines the cost of capital, and aids in investment decision-making. Also, it helps financial institutions (banks, insurance companies, etc.) manage financial risks and contributes to market stability.
This rating is done by credit rating agencies such as S&P, Moody's and Fitch. They represent credit risk using similar rating scales, i.e., AAA, AA, A, BBB. If they downgrade in rating, bond price decreases and interest rate increases. In constrast, if they upgrade in rating, bond price increases and interest rate decreases. Their impact on financial market is so significant that more funds flow to companies with higher credit ratings and various macroeconomic indicators are also influenced.
The purpose of the project is to compare several models for corporate credit rating and suggest the best model for it. The data is originated from Kaggle competition Corporate Credit Rating. There are 2029 credit ratings from major agencies like S&P, which are evaluated between 2010 and 2016. The subject of the rating is the companies listed on Nasdaq and NYSE, and there is no missing value.
1. Install and Import Packages¶
In this data analysis, we need several packages for data imbalance and other algorithms such as xgboost and lightgbm. After then, we import packages from sklearn, imblearn,xgboost, lightgbm, and other basic packages such as pandas and matplotlib.pyplot.
pip install imblearn
Requirement already satisfied: imblearn in c:\users\user\anaconda3\lib\site-packages (0.0)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: imbalanced-learn in c:\users\user\anaconda3\lib\site-packages (from imblearn) (0.12.4) Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.0.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: scipy>=1.5.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.7.3) Requirement already satisfied: numpy>=1.17.3 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.22.0)
pip install xgboost
Requirement already satisfied: xgboost in c:\users\user\anaconda3\lib\site-packages (2.1.3) Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.7.3) Requirement already satisfied: numpy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.22.0) Note: you may need to restart the kernel to use updated packages.
pip install lightgbm
Requirement already satisfied: lightgbm in c:\users\user\anaconda3\lib\site-packages (4.5.0) Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.7.3) Requirement already satisfied: numpy>=1.17.0 in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.22.0) Note: you may need to restart the kernel to use updated packages.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from imblearn.over_sampling import ADASYN
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2. EDA & Data Preprocess¶
The data corporate_rating.csv has 31 columns, including 6 categorical and 25 numerical data.
The numerical data can be categorized into 4 section. First one is Liquidity, the ability of assets to be converted into cash. This consists of such variables as currentRatio, quickRatio, cashRatio, daysOfSalesOutstanding... Second one is Profitability Indicator, including grossProfitMargin, operatingProfitMargin, pretaxProfitMargin... The third one is Debt indicator, such as debtRatio, debtEquityRatio ... The last section is Cash Flow, which consists of operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare... There are some other columns not in those sections, such as assetTurnover (asset turnover ratio).
The rest of 6 categorical data are comprised of Rating, Name, Symbol, Rating Agency Name, Date, and Sector.
cor_rate_data = pd.read_csv('corporate_rating.csv')
cor_rate_data.head()
| Rating | Name | Symbol | Rating Agency Name | Date | Sector | currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | ... | effectiveTaxRate | freeCashFlowOperatingCashFlowRatio | freeCashFlowPerShare | cashPerShare | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | Whirlpool Corporation | WHR | Egan-Jones Ratings Company | 11/27/2015 | Consumer Durables | 0.945894 | 0.426395 | 0.099690 | 44.203245 | ... | 0.202716 | 0.437551 | 6.810673 | 9.809403 | 4.008012 | 0.049351 | 7.057088 | 15.565438 | 0.058638 | 3.906655 |
| 1 | BBB | Whirlpool Corporation | WHR | Egan-Jones Ratings Company | 2/13/2014 | Consumer Durables | 1.033559 | 0.498234 | 0.203120 | 38.991156 | ... | 0.074155 | 0.541997 | 8.625473 | 17.402270 | 3.156783 | 0.048857 | 6.460618 | 15.914250 | 0.067239 | 4.002846 |
| 2 | BBB | Whirlpool Corporation | WHR | Fitch Ratings | 3/6/2015 | Consumer Durables | 0.963703 | 0.451505 | 0.122099 | 50.841385 | ... | 0.214529 | 0.513185 | 9.693487 | 13.103448 | 4.094575 | 0.044334 | 10.491970 | 18.888889 | 0.074426 | 3.483510 |
| 3 | BBB | Whirlpool Corporation | WHR | Fitch Ratings | 6/15/2012 | Consumer Durables | 1.019851 | 0.510402 | 0.176116 | 41.161738 | ... | 1.816667 | -0.147170 | -1.015625 | 14.440104 | 3.630950 | -0.012858 | 4.080741 | 6.901042 | 0.028394 | 4.581150 |
| 4 | BBB | Whirlpool Corporation | WHR | Standard & Poor's Ratings Services | 10/24/2016 | Consumer Durables | 0.957844 | 0.495432 | 0.141608 | 47.761126 | ... | 0.166966 | 0.451372 | 7.135348 | 14.257556 | 4.012780 | 0.053770 | 8.293505 | 15.808147 | 0.058065 | 3.857790 |
5 rows × 31 columns
(1) Rating Agency¶
There are 5 rating agencies, including S&P, Egan-Jones, Moody's, Fitch, and DBRS.
cor_rate_data['Rating Agency Name'].value_counts().to_frame()
| Rating Agency Name | |
|---|---|
| Standard & Poor's Ratings Services | 744 |
| Egan-Jones Ratings Company | 603 |
| Moody's Investors Service | 579 |
| Fitch Ratings | 100 |
| DBRS | 3 |
The following mapping code is to simplify the name of the rating agencies so that its barplot becomes tidy.
agency_mapping = {"Standard & Poor's Ratings Services" : "S&P",
"Egan-Jones Ratings Company" : "Egan-Jones",
"Moody's Investors Service" : "Moody's",
"Fitch Ratings" : "Fitch",
"DBRS" : "DBRS"}
cor_rate_data['Rating Agency Name'] = cor_rate_data['Rating Agency Name'].map(agency_mapping)
cor_rate_data['Rating Agency Name'].value_counts().to_frame()
| Rating Agency Name | |
|---|---|
| S&P | 744 |
| Egan-Jones | 603 |
| Moody's | 579 |
| Fitch | 100 |
| DBRS | 3 |
The evaluations of the companies are mainly done by 3 agencies; S&P, Egan-Jones, and Moody's.
plt.figure(figsize = (10,6))
cor_rate_data['Rating Agency Name'].value_counts().plot(kind = 'bar',
color=plt.cm.tab20.colors[:len(cor_rate_data['Rating Agency Name'].value_counts())])
plt.title('Frequency of Rating Agencies', fontsize = 14)
plt.xlabel('Name of Agencies')
plt.ylabel('Counts')
plt.show()
Instead of deleting the agency column, I determined to check how the agencies affect the rating of the companies. However, simply encoding the agencies based on the number of agencies (0 to 4) may not fit the conditions of some models such as LDA and QDA, because they require the explanatory variables to follow normal distribution. So, I generated dummy variables (one-hot encoding) for rating agencies.
cor_rate_data = pd.get_dummies(cor_rate_data,
columns = ['Rating Agency Name'],
prefix = 'Agency',
drop_first=True,
dtype = float)
cor_rate_data.head()
| Rating | Name | Symbol | Date | Sector | currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | ... | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | Agency_Egan-Jones | Agency_Fitch | Agency_Moody's | Agency_S&P | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | Whirlpool Corporation | WHR | 11/27/2015 | Consumer Durables | 0.945894 | 0.426395 | 0.099690 | 44.203245 | 0.037480 | ... | 4.008012 | 0.049351 | 7.057088 | 15.565438 | 0.058638 | 3.906655 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | BBB | Whirlpool Corporation | WHR | 2/13/2014 | Consumer Durables | 1.033559 | 0.498234 | 0.203120 | 38.991156 | 0.044062 | ... | 3.156783 | 0.048857 | 6.460618 | 15.914250 | 0.067239 | 4.002846 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | BBB | Whirlpool Corporation | WHR | 3/6/2015 | Consumer Durables | 0.963703 | 0.451505 | 0.122099 | 50.841385 | 0.032709 | ... | 4.094575 | 0.044334 | 10.491970 | 18.888889 | 0.074426 | 3.483510 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | BBB | Whirlpool Corporation | WHR | 6/15/2012 | Consumer Durables | 1.019851 | 0.510402 | 0.176116 | 41.161738 | 0.020894 | ... | 3.630950 | -0.012858 | 4.080741 | 6.901042 | 0.028394 | 4.581150 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | BBB | Whirlpool Corporation | WHR | 10/24/2016 | Consumer Durables | 0.957844 | 0.495432 | 0.141608 | 47.761126 | 0.042861 | ... | 4.012780 | 0.053770 | 8.293505 | 15.808147 | 0.058065 | 3.857790 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 34 columns
(2) Sectors and other categorical variables¶
Following code and barplot show the number of sectors, including energy, health care, finance etc.
cor_rate_data['Sector'].value_counts().to_frame()
| Sector | |
|---|---|
| Energy | 294 |
| Basic Industries | 260 |
| Consumer Services | 250 |
| Technology | 234 |
| Capital Goods | 233 |
| Public Utilities | 211 |
| Health Care | 171 |
| Consumer Non-Durables | 132 |
| Consumer Durables | 74 |
| Transportation | 63 |
| Miscellaneous | 57 |
| Finance | 50 |
plt.figure(figsize = (12,6))
cor_rate_data['Sector'].value_counts().plot(kind = 'bar', color = 'orange')
plt.title('Distribution Of Sectors')
plt.ylabel('Counts')
plt.show()
Including 'Sector' column, I deleted rest of the categorical variables except for 'Rating'.
cor_rate_data = cor_rate_data.drop(columns = ['Sector','Name','Date', 'Symbol'])
cor_rate_data.head()
| Rating | currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | ... | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | Agency_Egan-Jones | Agency_Fitch | Agency_Moody's | Agency_S&P | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 0.945894 | 0.426395 | 0.099690 | 44.203245 | 0.037480 | 0.049351 | 0.176631 | 0.061510 | 0.041189 | ... | 4.008012 | 0.049351 | 7.057088 | 15.565438 | 0.058638 | 3.906655 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | BBB | 1.033559 | 0.498234 | 0.203120 | 38.991156 | 0.044062 | 0.048857 | 0.175715 | 0.066546 | 0.053204 | ... | 3.156783 | 0.048857 | 6.460618 | 15.914250 | 0.067239 | 4.002846 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | BBB | 0.963703 | 0.451505 | 0.122099 | 50.841385 | 0.032709 | 0.044334 | 0.170843 | 0.059783 | 0.032497 | ... | 4.094575 | 0.044334 | 10.491970 | 18.888889 | 0.074426 | 3.483510 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | BBB | 1.019851 | 0.510402 | 0.176116 | 41.161738 | 0.020894 | -0.012858 | 0.138059 | 0.042430 | 0.025690 | ... | 3.630950 | -0.012858 | 4.080741 | 6.901042 | 0.028394 | 4.581150 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | BBB | 0.957844 | 0.495432 | 0.141608 | 47.761126 | 0.042861 | 0.053770 | 0.177720 | 0.065354 | 0.046363 | ... | 4.012780 | 0.053770 | 8.293505 | 15.808147 | 0.058065 | 3.857790 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 30 columns
(3) Ratings & Oversampling through ADASYN¶
The following table shows the number of ratings. While some ratings including BBB,BB,A have sufficiently many numbers, others such as AAA, CC, C, and D have significantly low number: 7,5,2, and 1.
rating_order = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B', 'CCC', 'CC', 'C' , 'D' ]
cor_rate_data['Rating'].value_counts().reindex(rating_order).to_frame().T
| AAA | AA | A | BBB | BB | B | CCC | CC | C | D | |
|---|---|---|---|---|---|---|---|---|---|---|
| Rating | 7 | 89 | 398 | 671 | 490 | 302 | 64 | 5 | 2 | 1 |
So, my purpose is to oversample the data of extremely low number of ratings. Before oversampling, I substituted the characters into numbers; AAA to 0, AA to 1 and so on. The oversampling method do not work well when there is only 1 data in a category, like D. So, I assigned same number to D as C. The number of companies that are classified as BBB was the largest; 671 companies are rated as BBB. So I tried to adjust all the ratings to have 671 data respectively.
rating_mapping = {'AAA': 0, 'AA': 1, 'A': 2, 'BBB' : 3, 'BB' : 4, 'B' : 5, 'CCC' : 6, 'CC' : 7, 'C' : 8, 'D' : 8}
cor_rate_data['Rating'] = cor_rate_data['Rating'].map(rating_mapping).fillna(method='ffill')
cor_rate_data.head()
| Rating | currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | ... | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | Agency_Egan-Jones | Agency_Fitch | Agency_Moody's | Agency_S&P | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0.945894 | 0.426395 | 0.099690 | 44.203245 | 0.037480 | 0.049351 | 0.176631 | 0.061510 | 0.041189 | ... | 4.008012 | 0.049351 | 7.057088 | 15.565438 | 0.058638 | 3.906655 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 3 | 1.033559 | 0.498234 | 0.203120 | 38.991156 | 0.044062 | 0.048857 | 0.175715 | 0.066546 | 0.053204 | ... | 3.156783 | 0.048857 | 6.460618 | 15.914250 | 0.067239 | 4.002846 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | 3 | 0.963703 | 0.451505 | 0.122099 | 50.841385 | 0.032709 | 0.044334 | 0.170843 | 0.059783 | 0.032497 | ... | 4.094575 | 0.044334 | 10.491970 | 18.888889 | 0.074426 | 3.483510 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | 3 | 1.019851 | 0.510402 | 0.176116 | 41.161738 | 0.020894 | -0.012858 | 0.138059 | 0.042430 | 0.025690 | ... | 3.630950 | -0.012858 | 4.080741 | 6.901042 | 0.028394 | 4.581150 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 3 | 0.957844 | 0.495432 | 0.141608 | 47.761126 | 0.042861 | 0.053770 | 0.177720 | 0.065354 | 0.046363 | ... | 4.012780 | 0.053770 | 8.293505 | 15.808147 | 0.058065 | 3.857790 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 30 columns
Let's delve deeper into ADASYN. This method requires the number of nearest neighbors k (K-nearest neighbors). In this analysis, I set the number k = 2. In ADASYN, the majority class refers to the class with largest amount of data among all other classes, while minority class means the rest of the classes. For the $i$-th data point $x_i$ in the minority class, calculate the ratio:
$$r_i = \frac{\text{The number of majority-class neighbors of }x_i}{k}\in[0,1]$$
Then, normalize $r_i$: $g_i = \frac{r_i}{\sum_{j\in\text{minority group}} r_j}$, and determine the number of synthetic samples to generate, $G_i$:
$$G_i = g_i*G$$
where $G =$ (Number of majority class samples - Number of minority class samples). For each $x_i$, randomly select one of its $k$ neighbors $x_*$:
$$x_{new} = x_i + \lambda(x_* - x_i)$$
where $\lambda \sim \text{Unif}(0,1)$ (interpolation).
X = cor_rate_data.drop(columns=['Rating'])
y = cor_rate_data['Rating']
sampling_strategy = {label: 671 for label in set(y)}
adasyn = ADASYN(sampling_strategy = sampling_strategy, n_neighbors=2, random_state=2023311161)
X_resampled, y_resampled = adasyn.fit_resample(X, y)
y_resampled.value_counts().to_frame()
| Rating | |
|---|---|
| 5 | 721 |
| 4 | 690 |
| 6 | 672 |
| 8 | 672 |
| 3 | 671 |
| 0 | 671 |
| 7 | 670 |
| 2 | 666 |
| 1 | 654 |
The following code and barplot show the total number of ratings.
rating_remapping = {value: key for key, value in rating_mapping.items()} # switch key and value of rating_mapping dictionary
rating_remapping[8] = 'C'
rating_order2 = [x for x in rating_order if x != 'D']
y_resampled_rating = y_resampled.map(rating_remapping)
y_resampled_rating.value_counts().reindex(rating_order2).to_frame().T
| AAA | AA | A | BBB | BB | B | CCC | CC | C | |
|---|---|---|---|---|---|---|---|---|---|
| Rating | 671 | 654 | 666 | 671 | 690 | 721 | 672 | 670 | 672 |
plt.figure(figsize = (12,6))
y_resampled_rating.value_counts().reindex(rating_order2).plot(kind = 'bar')
plt.ylabel('counts')
plt.title('The Number of Ratings')
plt.show()
Now, combine the resampled rating data and resampled numerical (explanatory) data.
cor_rate_data_adasyn = pd.concat([y_resampled, X_resampled], axis = 1)
cor_rate_data_adasyn.head()
| Rating | currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | ... | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | Agency_Egan-Jones | Agency_Fitch | Agency_Moody's | Agency_S&P | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0.945894 | 0.426395 | 0.099690 | 44.203245 | 0.037480 | 0.049351 | 0.176631 | 0.061510 | 0.041189 | ... | 4.008012 | 0.049351 | 7.057088 | 15.565438 | 0.058638 | 3.906655 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | 3 | 1.033559 | 0.498234 | 0.203120 | 38.991156 | 0.044062 | 0.048857 | 0.175715 | 0.066546 | 0.053204 | ... | 3.156783 | 0.048857 | 6.460618 | 15.914250 | 0.067239 | 4.002846 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | 3 | 0.963703 | 0.451505 | 0.122099 | 50.841385 | 0.032709 | 0.044334 | 0.170843 | 0.059783 | 0.032497 | ... | 4.094575 | 0.044334 | 10.491970 | 18.888889 | 0.074426 | 3.483510 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | 3 | 1.019851 | 0.510402 | 0.176116 | 41.161738 | 0.020894 | -0.012858 | 0.138059 | 0.042430 | 0.025690 | ... | 3.630950 | -0.012858 | 4.080741 | 6.901042 | 0.028394 | 4.581150 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 3 | 0.957844 | 0.495432 | 0.141608 | 47.761126 | 0.042861 | 0.053770 | 0.177720 | 0.065354 | 0.046363 | ... | 4.012780 | 0.053770 | 8.293505 | 15.808147 | 0.058065 | 3.857790 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 30 columns
(4) Split data into Train and Test dataset¶
The ratio of train and test data is 80 : 20.
data_train, data_test = train_test_split(cor_rate_data_adasyn, test_size=0.2, random_state = 2023311161)
X_train, y_train = data_train.iloc[:,2:30], data_train.iloc[:,0]
X_test, y_test = data_test.iloc[:,2:30], data_test.iloc[:,0]
X_test.head()
| quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | returnOnCapitalEmployed | returnOnEquity | ... | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | Agency_Egan-Jones | Agency_Fitch | Agency_Moody's | Agency_S&P | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2371 | 2.278170 | 0.111674 | 71.634958 | 0.144557 | 0.061013 | 0.640281 | 0.203167 | 0.072946 | 0.045269 | 0.169617 | ... | 2.305414 | 0.061013 | 17.261475 | 3.695267 | 0.327851 | 4.956799 | 0.000000 | 0.000000 | 0.0 | 1.000000 |
| 534 | 1.038975 | 0.380729 | 63.727017 | 0.059743 | 0.092966 | 0.192734 | 0.103564 | 0.046122 | 0.090953 | 0.109802 | ... | 2.380685 | 0.092966 | 8.102552 | 3.403397 | 0.079016 | 6.936468 | 0.000000 | 0.000000 | 1.0 | 0.000000 |
| 1071 | 1.229724 | 0.380645 | 46.646221 | 0.068476 | 0.103081 | 0.199890 | 0.118506 | 0.053858 | 0.096087 | 0.149840 | ... | 2.782151 | 0.103081 | 6.363313 | 3.485643 | 0.127798 | 10.028686 | 0.000000 | 0.000000 | 1.0 | 0.000000 |
| 5606 | 0.394453 | 0.026197 | 53.924592 | 0.196575 | 0.147173 | 0.998726 | -0.193032 | 0.050472 | 0.043942 | -0.404301 | ... | -11.054333 | 0.147173 | 20.605994 | 3.182294 | 0.083168 | 3.631803 | 0.014864 | 0.000000 | 0.0 | 0.985136 |
| 4943 | 0.549608 | 0.308271 | 23.011114 | -1.083307 | -1.194222 | 0.509708 | -1.100582 | -0.579432 | -0.725569 | 1.409955 | ... | -2.537786 | -1.194222 | -4.132979 | 5.529796 | 0.146863 | 3.738619 | 0.000000 | 0.534308 | 0.0 | 0.465692 |
5 rows × 28 columns
3. Machine Learning Models¶
This section compares several machine learning models to compare how well each model predicts the rating of the companies. We'll provide two measures for the prediction performance; accuracy and weighted $f_1$ score.
average = 'weighted'
(1) Ordered Logistic Regression¶
# OrderedModel in statsmodels is too poor.
LR_model = LogisticRegression(random_state=2023311161 ,
multi_class='multinomial',
solver='newton-cg',
max_iter=5000)
LR_model = LR_model.fit(X_train, y_train)
y_pred_LR = LR_model.predict(X_test)
acc_olr = accuracy_score(y_test, y_pred_LR)
print("Ordered Logistic Regression Accuracy :",acc_olr) # 0.5935
f1_olr = f1_score(y_test, y_pred_LR, average = average)
print("Ordered Logistic Regression f1 score :", f1_olr) # 0.5709
Ordered Logistic Regression Accuracy : 0.5935960591133005 Ordered Logistic Regression f1 score : 0.5709746447753064
C:\Users\user\anaconda3\lib\site-packages\sklearn\utils\optimize.py:210: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations. warnings.warn(
(2) LDA¶
LDA_model = LinearDiscriminantAnalysis()
LDA_model.fit(X_train,y_train)
y_pred_LDA = LDA_model.predict(X_test)
acc_LDA = accuracy_score(y_test, y_pred_LDA)
print("LDA Accuracy :",acc_LDA)
f1_LDA = f1_score(y_test, y_pred_LDA, average = average)
print("LDA f1 score :", f1_LDA)
LDA Accuracy : 0.48932676518883417 LDA f1 score : 0.4515765543613118
(3) QDA¶
QDA_model = QuadraticDiscriminantAnalysis()
QDA_model.fit(X_train,y_train)
y_pred_QDA = QDA_model.predict(X_test)
acc_QDA = accuracy_score(y_test, y_pred_QDA)
print("QDA Accuracy :",acc_QDA)
f1_QDA = f1_score(y_test, y_pred_QDA, average = average)
print("QDA f1 score :", f1_QDA)
QDA Accuracy : 0.535303776683087 QDA f1 score : 0.4714921035655885
C:\Users\user\anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:878: UserWarning: Variables are collinear
warnings.warn("Variables are collinear")
(4) KNN¶
KNN_model = KNeighborsClassifier(n_neighbors = 3)
KNN_model.fit(X_train,y_train)
y_pred_KNN = KNN_model.predict(X_test)
acc_KNN = accuracy_score(y_test, y_pred_KNN)
print("KNN Accuracy :",acc_KNN)
f1_KNN = f1_score(y_test, y_pred_KNN, average = average)
print("KNN f1 score :", f1_KNN)
KNN Accuracy : 0.7996715927750411 KNN f1 score : 0.7858740734584811
(5) SVM¶
SVC_model = svm.SVC(gamma= 'auto')
SVC_model.fit(X_train, y_train)
y_pred_SVM = SVC_model.predict(X_test)
acc_SVM = accuracy_score(y_test, y_pred_SVM)
print("SVM Accuracy :",acc_SVM)
f1_SVM = f1_score(y_test, y_pred_SVM, average = average)
print("SVM f1 score :", f1_SVM)
SVM Accuracy : 0.7955665024630542 SVM f1 score : 0.8121719550099245
(6) Random Forest¶
RF_model = RandomForestClassifier(random_state=2023311161)
RF_model.fit(X_train,y_train)
y_pred_RF = RF_model.predict(X_test)
acc_RF = accuracy_score(y_test, y_pred_RF)
print("Random Forest Accuracy :",acc_RF)
f1_RF = f1_score(y_test, y_pred_RF, average = average)
print("Random Forest f1 score :", f1_RF)
Random Forest Accuracy : 0.8538587848932676 Random Forest f1 score : 0.8512928695960994
(7) Gradient Boosting¶
GBT_model = GradientBoostingClassifier(random_state=2023311161)
GBT_model.fit(X_train, y_train)
y_pred_GBT = GBT_model.predict(X_test)
acc_GBT = accuracy_score(y_test, y_pred_GBT)
print("GBT Accuracy :",acc_GBT)
f1_GBT = f1_score(y_test, y_pred_GBT, average = average)
print("GBT f1 score :", f1_GBT)
GBT Accuracy : 0.7857142857142857 GBT f1 score : 0.7805326021015632
(8) XGBoost¶
xgb_model = XGBClassifier(objective='multi:softmax', num_class=len(y_train.unique()), random_state=2023311161)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
acc_xgb = accuracy_score(y_test, y_pred_xgb)
print("XGBoost accuracy :", acc_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb, average = average)
print("XGBoost f1 score :", f1_xgb)
XGBoost accuracy : 0.8702791461412152 XGBoost f1 score : 0.8690958617082526
(9) LightGBM¶
lgb_model = lgb.LGBMClassifier(random_state=2023311161)
lgb_model.fit(X_train, y_train)
y_pred_lightGBM = lgb_model.predict(X_test)
acc_lightGBM = accuracy_score(y_test, y_pred_lightGBM)
print("LightGBM accuracy :", acc_lightGBM)
f1_lightGBM = f1_score(y_test, y_pred_lightGBM, average = average)
print("LightGBM f1 score :", f1_lightGBM)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001712 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 7066 [LightGBM] [Info] Number of data points in the train set: 4869, number of used features: 28 [LightGBM] [Info] Start training from score -2.221548 [LightGBM] [Info] Start training from score -2.219655 [LightGBM] [Info] Start training from score -2.225343 [LightGBM] [Info] Start training from score -2.195378 [LightGBM] [Info] Start training from score -2.202785 [LightGBM] [Info] Start training from score -2.131070 [LightGBM] [Info] Start training from score -2.214000 [LightGBM] [Info] Start training from score -2.188025 [LightGBM] [Info] Start training from score -2.180726 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf LightGBM accuracy : 0.8735632183908046 LightGBM f1 score : 0.8737808912916817
4. Results¶
(1) Accuracy¶
The following barplot compares the accuracies of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest accuracies among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low accuracies; around 50 ~ 60%.
models = ['Ordered Logistic', 'LDA','QDA', 'KNN','SVC','Random Forest','Gradient Boosting','XGBoost','LightGBM']
accuracy = [acc_olr, acc_LDA, acc_QDA, acc_KNN, acc_SVM, acc_RF, acc_GBT, acc_xgb, acc_lightGBM]
df_acc = pd.DataFrame({'model' : models, 'accuracy' : accuracy})
df_acc = df_acc.sort_values('accuracy', ascending = True)
plt.figure(figsize = (12,6))
ax = sns.barplot(data=df_acc, x='model', y='accuracy', palette='viridis')
# Add accuracy values on top of each bar
for p in ax.patches:
ax.annotate(f'{p.get_height():.3f}', # Format with two decimals
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom', fontsize=10)
plt.ylabel('Accuracy')
plt.title('The accuracies for distinct models')
plt.show()
(2) weighted average $F_1$ score¶
This section suggests different measure, the weighted average $F_1$ score. Suppose $C$ is the total number of classes, $n_k$ is the number of samples in class $k$, $w_k := n_k/\sum_{j=1}^C n_j$ is the weight of class $k$. The precision and recall, another type of measure for performance, are defined as follows:
$$P_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Positive}_k}$$
$$R_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Negative}_k}$$
Then, the class-wise $F_1$ score is $F_{1,(k)} = 2\frac{P_k\cdot R_k}{P_k + R_k}$, and thus the weighted average $F_1$ score is
$$F_{1,weighted} := \sum_{k=1}^C w_k\cdot F_{1,(k)}$$
The following barplot compares the weighted $F_1$ scores of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest weighted $F_1$ scores among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low weighted $F_1$ scores; around 45 ~ 57%.
f1_scores = [f1_olr, f1_LDA, f1_QDA, f1_KNN, f1_SVM, f1_RF, f1_GBT, f1_xgb, f1_lightGBM]
df_f1 = pd.DataFrame({'model' : models, 'f1_score' : f1_scores})
df_f1 = df_f1.sort_values('f1_score', ascending = True)
plt.figure(figsize = (12,6))
ax = sns.barplot(data=df_f1, x='model', y='f1_score', color='skyblue')
for p in ax.patches:
ax.annotate(f'{p.get_height():.3f}', # Format with two decimals
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom', fontsize=10)
plt.ylabel('f1 score')
plt.title('The f1 scores for distinct models')
plt.show()
(3) Feature Importance¶
This barplot shows how important each variable is.
df_LGB_importance = pd.DataFrame({'feature' : X_train.columns,
'importance' : lgb_model.feature_importances_})
df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = True)
plt.figure(figsize = (16,10))
bars = plt.barh(df_LGB_importance['feature'], df_LGB_importance['importance'], color = 'green')
for bar, value in zip(bars, df_LGB_importance['importance']):
plt.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height() / 2,
f'{value:.0f}', va='center', fontsize=9, color='black')
plt.xlabel('Feature Importance')
plt.title('Feature Importance - LightGBM')
plt.show()
The following table shows the top 5 most important variables.
df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = False)
df_LGB_importance['feature'].iloc[:5].to_frame().T
| 2 | 17 | 0 | 23 | 20 | |
|---|---|---|---|---|---|
| feature | daysOfSalesOutstanding | cashPerShare | quickRatio | payablesTurnover | enterpriseValueMultiple |
5. Other Comments¶
Among all the models, the LDA,QDA, and ordered logistic regression models show relatively low performace. This section discusses on some probable reasons of it.
(1) Outliers¶
Most of the columns of numerical variables have extreme values. For example, the minimum and the values between Q1 and Q3 of the column currentRatio are -0.93 and 1.07 ~ 2.16, but its maximum is 1725.5. These outlier values are likely to distort the models' performace.
Especially, the LDA as well as QDA assumes the explanatory variable to follow normal distribution.
cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].describe()
| currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | returnOnCapitalEmployed | ... | effectiveTaxRate | freeCashFlowOperatingCashFlowRatio | freeCashFlowPerShare | cashPerShare | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | 2029.000000 | ... | 2029.000000 | 2029.000000 | 2.029000e+03 | 2.029000e+03 | 2029.000000 | 2029.000000 | 2029.000000 | 2.029000e+03 | 2029.000000 | 2029.000000 |
| mean | 3.529607 | 2.653986 | 0.667364 | 333.795606 | 0.278447 | 0.431483 | 0.497968 | 0.587322 | -37.517928 | -73.974193 | ... | 0.397572 | 0.409550 | 5.094719e+03 | 4.227549e+03 | 3.323579 | 0.437454 | 48.287985 | 6.515123e+03 | 1.447653 | 38.002718 |
| std | 44.052361 | 32.944817 | 3.583943 | 4447.839583 | 6.064134 | 8.984982 | 0.525307 | 11.224622 | 1166.172220 | 2350.275719 | ... | 10.595075 | 3.796488 | 1.469156e+05 | 1.224000e+05 | 87.529866 | 8.984299 | 529.118961 | 1.775290e+05 | 19.483294 | 758.923588 |
| min | -0.932005 | -1.893266 | -0.192736 | -811.845623 | -101.845815 | -124.343612 | -14.800817 | -124.343612 | -40213.178290 | -87162.162160 | ... | -100.611015 | -120.916010 | -4.912742e+03 | -1.915035e+01 | -2555.419643 | -124.343612 | -3749.921337 | -1.195049e+04 | -4.461837 | -76.662850 |
| 25% | 1.071930 | 0.602825 | 0.130630 | 22.905093 | 0.021006 | 0.025649 | 0.233127 | 0.044610 | 0.019176 | 0.028112 | ... | 0.146854 | 0.271478 | 4.119924e-01 | 1.566038e+00 | 2.046822 | 0.028057 | 6.238066 | 2.356735e+00 | 0.073886 | 2.205912 |
| 50% | 1.493338 | 0.985679 | 0.297493 | 42.374120 | 0.064753 | 0.084965 | 0.414774 | 0.107895 | 0.045608 | 0.074421 | ... | 0.300539 | 0.644529 | 2.131742e+00 | 3.686513e+00 | 2.652456 | 0.087322 | 9.274398 | 4.352584e+00 | 0.133050 | 5.759722 |
| 75% | 2.166891 | 1.453820 | 0.624906 | 59.323563 | 0.114807 | 0.144763 | 0.849693 | 0.176181 | 0.077468 | 0.135036 | ... | 0.370653 | 0.836949 | 4.230253e+00 | 8.086152e+00 | 3.658331 | 0.149355 | 12.911759 | 7.319759e+00 | 0.240894 | 9.480892 |
| max | 1725.505005 | 1139.541703 | 125.917417 | 115961.637400 | 198.517873 | 309.694856 | 2.702533 | 410.182214 | 0.487826 | 2.439504 | ... | 429.926282 | 34.594086 | 5.753380e+06 | 4.786803e+06 | 2562.871795 | 309.694856 | 11153.607090 | 6.439270e+06 | 688.526591 | 20314.880400 |
8 rows × 25 columns
(2) Multicollinearity¶
If there is multicollinearity between covariances, this would lead to poor performance of forecasting. The following table shows the correlations of all variables.
cor_rate_data_correlation = cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].corr()
cor_rate_data_correlation
| currentRatio | quickRatio | cashRatio | daysOfSalesOutstanding | netProfitMargin | pretaxProfitMargin | grossProfitMargin | operatingProfitMargin | returnOnAssets | returnOnCapitalEmployed | ... | effectiveTaxRate | freeCashFlowOperatingCashFlowRatio | freeCashFlowPerShare | cashPerShare | companyEquityMultiplier | ebitPerRevenue | enterpriseValueMultiple | operatingCashFlowPerShare | operatingCashFlowSalesRatio | payablesTurnover | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| currentRatio | 1.000000 | 0.104329 | 0.736042 | 0.000020 | 0.003561 | -0.002453 | 0.067326 | -0.001224 | 0.001623 | 0.001583 | ... | -0.026461 | 0.005863 | -0.001307 | -0.001266 | 0.000118 | -0.002489 | -0.002013 | -0.001453 | -0.001439 | -0.002160 |
| quickRatio | 0.104329 | 1.000000 | 0.125848 | 0.736630 | -0.001931 | -0.002208 | -0.028343 | -0.001820 | 0.001539 | 0.001508 | ... | -0.003940 | 0.001713 | -0.001427 | -0.000929 | -0.000365 | -0.002241 | -0.002843 | -0.001587 | -0.001453 | -0.001759 |
| cashRatio | 0.736042 | 0.125848 | 1.000000 | 0.006616 | -0.007936 | -0.006837 | -0.050355 | -0.000723 | 0.003326 | 0.003280 | ... | -0.024406 | -0.013719 | -0.001778 | -0.001652 | -0.001883 | -0.006883 | -0.007945 | -0.002293 | 0.004459 | -0.006915 |
| daysOfSalesOutstanding | 0.000020 | 0.736630 | 0.006616 | 1.000000 | 0.262299 | 0.278222 | -0.068181 | 0.291002 | 0.002416 | 0.002364 | ... | -0.000926 | 0.003267 | -0.002155 | -0.000834 | -0.000242 | 0.278204 | -0.019808 | -0.002283 | 0.406126 | -0.003026 |
| netProfitMargin | 0.003561 | -0.001931 | -0.007936 | 0.262299 | 1.000000 | 0.991241 | -0.099540 | 0.971483 | 0.001577 | 0.001535 | ... | 0.001500 | -0.004830 | -0.000708 | -0.000716 | -0.001185 | 0.991185 | -0.003665 | -0.000779 | 0.785592 | -0.001681 |
| pretaxProfitMargin | -0.002453 | -0.002208 | -0.006837 | 0.278222 | 0.991241 | 1.000000 | -0.135662 | 0.992001 | 0.001637 | 0.001598 | ... | -0.000097 | -0.005518 | -0.000976 | -0.000977 | -0.001148 | 0.999975 | -0.003867 | -0.001080 | 0.831778 | -0.001760 |
| grossProfitMargin | 0.067326 | -0.028343 | -0.050355 | -0.068181 | -0.099540 | -0.135662 | 1.000000 | -0.121829 | -0.030745 | -0.030128 | ... | 0.018387 | 0.016868 | 0.012775 | 0.012781 | 0.010026 | -0.135715 | -0.008007 | 0.011131 | -0.105564 | 0.034244 |
| operatingProfitMargin | -0.001224 | -0.001820 | -0.000723 | 0.291002 | 0.971483 | 0.992001 | -0.121829 | 1.000000 | 0.001675 | 0.001634 | ... | 0.000173 | -0.005949 | -0.001205 | -0.001205 | -0.000962 | 0.992018 | -0.003699 | -0.001305 | 0.871686 | -0.002042 |
| returnOnAssets | 0.001623 | 0.001539 | 0.003326 | 0.002416 | 0.001577 | 0.001637 | -0.030745 | 0.001675 | 1.000000 | 0.995426 | ... | 0.001214 | 0.029072 | -0.001283 | 0.001115 | 0.002629 | 0.001659 | 0.002742 | 0.002275 | 0.002385 | 0.001616 |
| returnOnCapitalEmployed | 0.001583 | 0.001508 | 0.003280 | 0.002364 | 0.001535 | 0.001598 | -0.030128 | 0.001634 | 0.995426 | 1.000000 | ... | 0.001183 | 0.029901 | -0.001162 | 0.001090 | 0.002559 | 0.001619 | 0.002883 | 0.002145 | 0.002334 | 0.001581 |
| returnOnEquity | -0.001644 | -0.001562 | -0.003402 | -0.002444 | -0.001554 | -0.001623 | 0.031101 | -0.001669 | -0.995371 | -0.981650 | ... | -0.001216 | -0.027981 | 0.001368 | -0.001147 | -0.002498 | -0.001644 | -0.002975 | -0.002419 | -0.002412 | -0.001628 |
| assetTurnover | -0.001951 | -0.001854 | -0.003991 | -0.002887 | -0.001831 | -0.001909 | 0.036758 | -0.001974 | -0.822513 | -0.802676 | ... | -0.001444 | -0.022836 | 0.000736 | -0.001329 | -0.003321 | -0.001934 | -0.003633 | -0.002390 | -0.002834 | -0.001926 |
| fixedAssetTurnover | -0.001944 | -0.001845 | -0.003940 | -0.002884 | -0.001830 | -0.001907 | 0.036731 | -0.001973 | -0.810064 | -0.788302 | ... | -0.001442 | -0.022273 | 0.000733 | -0.001328 | -0.003324 | -0.001933 | -0.003634 | -0.002395 | -0.002834 | -0.001925 |
| debtEquityRatio | 0.000093 | -0.000366 | -0.001877 | -0.000231 | -0.001183 | -0.001149 | 0.010051 | -0.000964 | 0.002262 | 0.002200 | ... | 0.003429 | 0.000507 | -0.012039 | -0.009003 | 0.999995 | -0.001148 | 0.010820 | -0.015896 | -0.000781 | -0.000212 |
| debtRatio | 0.005856 | 0.002901 | -0.008638 | 0.001022 | -0.022625 | -0.024115 | 0.018127 | -0.024509 | -0.051972 | -0.051081 | ... | -0.001024 | -0.058294 | -0.011503 | -0.013107 | 0.007100 | -0.024150 | 0.051639 | -0.004820 | -0.021119 | -0.004237 |
| effectiveTaxRate | -0.026461 | -0.003940 | -0.024406 | -0.000926 | 0.001500 | -0.000097 | 0.018387 | 0.000173 | 0.001214 | 0.001183 | ... | 1.000000 | 0.005496 | -0.001528 | -0.001562 | 0.003449 | -0.000029 | -0.004558 | -0.001771 | -0.000724 | -0.000206 |
| freeCashFlowOperatingCashFlowRatio | 0.005863 | 0.001713 | -0.013719 | 0.003267 | -0.004830 | -0.005518 | 0.016868 | -0.005949 | 0.029072 | 0.029901 | ... | 0.005496 | 1.000000 | 0.003497 | 0.003662 | 0.000510 | -0.005449 | -0.002982 | 0.003602 | -0.003585 | -0.003625 |
| freeCashFlowPerShare | -0.001307 | -0.001427 | -0.001778 | -0.002155 | -0.000708 | -0.000976 | 0.012775 | -0.001205 | -0.001283 | -0.001162 | ... | -0.001528 | 0.003497 | 1.000000 | 0.997277 | -0.012076 | -0.000999 | -0.003156 | 0.992371 | -0.002066 | -0.001476 |
| cashPerShare | -0.001266 | -0.000929 | -0.001652 | -0.000834 | -0.000716 | -0.000977 | 0.012781 | -0.001205 | 0.001115 | 0.001090 | ... | -0.001562 | 0.003662 | 0.997277 | 1.000000 | -0.009027 | -0.001000 | -0.003172 | 0.986459 | -0.001514 | -0.001281 |
| companyEquityMultiplier | 0.000118 | -0.000365 | -0.001883 | -0.000242 | -0.001185 | -0.001148 | 0.010026 | -0.000962 | 0.002629 | 0.002559 | ... | 0.003449 | 0.000510 | -0.012076 | -0.009027 | 1.000000 | -0.001146 | 0.010804 | -0.015945 | -0.000804 | -0.000207 |
| ebitPerRevenue | -0.002489 | -0.002241 | -0.006883 | 0.278204 | 0.991185 | 0.999975 | -0.135715 | 0.992018 | 0.001659 | 0.001619 | ... | -0.000029 | -0.005449 | -0.000999 | -0.001000 | -0.001146 | 1.000000 | -0.003925 | -0.001104 | 0.831789 | -0.001790 |
| enterpriseValueMultiple | -0.002013 | -0.002843 | -0.007945 | -0.019808 | -0.003665 | -0.003867 | -0.008007 | -0.003699 | 0.002742 | 0.002883 | ... | -0.004558 | -0.002982 | -0.003156 | -0.003172 | 0.010804 | -0.003925 | 1.000000 | -0.003337 | -0.005003 | -0.001183 |
| operatingCashFlowPerShare | -0.001453 | -0.001587 | -0.002293 | -0.002283 | -0.000779 | -0.001080 | 0.011131 | -0.001305 | 0.002275 | 0.002145 | ... | -0.001771 | 0.003602 | 0.992371 | 0.986459 | -0.015945 | -0.001104 | -0.003337 | 1.000000 | -0.002220 | -0.000302 |
| operatingCashFlowSalesRatio | -0.001439 | -0.001453 | 0.004459 | 0.406126 | 0.785592 | 0.831778 | -0.105564 | 0.871686 | 0.002385 | 0.002334 | ... | -0.000724 | -0.003585 | -0.002066 | -0.001514 | -0.000804 | 0.831789 | -0.005003 | -0.002220 | 1.000000 | -0.003337 |
| payablesTurnover | -0.002160 | -0.001759 | -0.006915 | -0.003026 | -0.001681 | -0.001760 | 0.034244 | -0.002042 | 0.001616 | 0.001581 | ... | -0.000206 | -0.003625 | -0.001476 | -0.001281 | -0.000207 | -0.001790 | -0.001183 | -0.000302 | -0.003337 | 1.000000 |
25 rows × 25 columns
The following image plot visualizes the correlation between each variables. This shows that there are some variables that have strong correlation (whether positive or negative) with other ones. For example, netProfitMargin column has strong correlation with pretaxProfitMargin (0.991241), operatingProfitMargin (0.971483), ebitPerRevenue (0.991185), and operatingCashFlowSalesRatio (0.785592).
plt.figure(figsize=(16, 12))
plt.imshow(cor_rate_data_correlation, cmap='coolwarm', interpolation='none', aspect='auto')
plt.colorbar(label='Correlation Coefficient')
plt.xticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index, rotation=90)
plt.yticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index)
plt.title("Correlation Matrix", fontsize=30)
plt.tight_layout()
plt.show()
This multicollinearity has severe impact on QDA, because QDA relies on the covariance matrices of the classes. Also, the multicollinearity can cause the ordered logistic model to fail to converge, leading to unreliable estimates.
6. Conclusion¶
The models based on gradient boosting, including LightGBM and XGBoost, as well as Random Forest, have the best performance of forecasting the ratings of corporate credit, without dealing with outliers and multicollinearity. If such methods as LDA and Ordered Logistic Regression are necessary, handling the outliers (i.e., treating them as missing values) and deleting columns with high correlation would lead to higher performance than before.