• Skip to primary navigation
  • Skip to content
  • Skip to footer
  • Home
  • Research
  • Transcript
  • CV
    • Multivariate Analysis

      • MVA HW1
      • MVA HW2
      • MVA HW3
      • MVA HW4
      • MVA Final Project
    • Measure Theory

      • Ch1 : Ex.01 ~ Ex.18
      • Ch1 : Ex.19 ~ Ex.38
      • Ch2 : Ex.01 ~ Ex.11
      • Ch2 : Ex.12 ~ Ex.24
      • Ch3 : Ex.01 ~ Ex.11
      • Ch3 : Ex.12 ~ Ex.20
      • Ch3 : Ex.21 ~ Ex.32
      • Ch3 : Extra Questions
      • Ch4 : Ex.01 ~ Ex.09
      • Ch4 : Ex.10 ~ Ex.20
      • Ch4 : Ex.21 ~ Ex.28
      • Ch4 : Ex.29 ~ Ex.35
    • Elements of Statistical Learning

      • ESL CH3
      • ESL CH4
      • ESL CH5
      • ESL CH7
      • ESL CH12
    • Bayesian Statistics

      • Bayes HW1
      • Bayes HW2
      • Bayes HW3
      • Bayes HW4
      • Bayes HW5
      • Bayes HW6
      • Bayes Project 1
      • Bayes Project 2
    • Statistical Computing

      • SC HW1
      • SC HW2
      • SC HW3
      • SC HW4
      • SC HW5
      • SC HW6
      • SC HW7
      • SC HW8
    • Monte Carlo Methods

      • MCMC HW1
      • MCMC HW2
      • MCMC HW3
      • MCMC HW4
      • MCMC HW5
      • MCMC HW6
    • Industrial Academic Cooperation Big Data Analysis

      • Introduction
      • Finance Data Analysis
      • Marketing Data Analysis
    • Data Science Institute

      • Introduction
      • Consultation 1
      • Consultation 2
      • Consultation 3
      • Consultation 4
      • Consultation 5

    0. Purpose of the Project¶

    The Corporate Credit Rating evaluates the likelihood of a company repaying its debt, i.e., its default probability. This notebook is to compare several models predicting corporate credit ratings. This ratings determines the cost of capital, and aids in investment decision-making. Also, it helps financial institutions (banks, insurance companies, etc.) manage financial risks and contributes to market stability.

    This rating is done by credit rating agencies such as S&P, Moody's and Fitch. They represent credit risk using similar rating scales, i.e., AAA, AA, A, BBB. If they downgrade in rating, bond price decreases and interest rate increases. In constrast, if they upgrade in rating, bond price increases and interest rate decreases. Their impact on financial market is so significant that more funds flow to companies with higher credit ratings and various macroeconomic indicators are also influenced.

    The purpose of the project is to compare several models for corporate credit rating and suggest the best model for it. The data is originated from Kaggle competition Corporate Credit Rating. There are 2029 credit ratings from major agencies like S&P, which are evaluated between 2010 and 2016. The subject of the rating is the companies listed on Nasdaq and NYSE, and there is no missing value.

    1. Install and Import Packages¶

    In this data analysis, we need several packages for data imbalance and other algorithms such as xgboost and lightgbm. After then, we import packages from sklearn, imblearn,xgboost, lightgbm, and other basic packages such as pandas and matplotlib.pyplot.

    In [ ]:
    pip install imblearn
    
    Requirement already satisfied: imblearn in c:\users\user\anaconda3\lib\site-packages (0.0)Note: you may need to restart the kernel to use updated packages.
    Requirement already satisfied: imbalanced-learn in c:\users\user\anaconda3\lib\site-packages (from imblearn) (0.12.4)
    Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.0.2)
    Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0)
    Requirement already satisfied: joblib>=1.1.1 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2)
    Requirement already satisfied: scipy>=1.5.0 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.7.3)
    Requirement already satisfied: numpy>=1.17.3 in c:\users\user\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.22.0)
    
    
    In [ ]:
    pip install xgboost
    
    Requirement already satisfied: xgboost in c:\users\user\anaconda3\lib\site-packages (2.1.3)
    Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.7.3)
    Requirement already satisfied: numpy in c:\users\user\anaconda3\lib\site-packages (from xgboost) (1.22.0)
    Note: you may need to restart the kernel to use updated packages.
    
    In [ ]:
    pip install lightgbm
    
    Requirement already satisfied: lightgbm in c:\users\user\anaconda3\lib\site-packages (4.5.0)
    Requirement already satisfied: scipy in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.7.3)
    Requirement already satisfied: numpy>=1.17.0 in c:\users\user\anaconda3\lib\site-packages (from lightgbm) (1.22.0)
    Note: you may need to restart the kernel to use updated packages.
    
    In [ ]:
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import f1_score
    from imblearn.over_sampling import ADASYN
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn import svm
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from xgboost import XGBClassifier
    import lightgbm as lgb
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    

    2. EDA & Data Preprocess¶

    The data corporate_rating.csv has 31 columns, including 6 categorical and 25 numerical data.

    The numerical data can be categorized into 4 section. First one is Liquidity, the ability of assets to be converted into cash. This consists of such variables as currentRatio, quickRatio, cashRatio, daysOfSalesOutstanding... Second one is Profitability Indicator, including grossProfitMargin, operatingProfitMargin, pretaxProfitMargin... The third one is Debt indicator, such as debtRatio, debtEquityRatio ... The last section is Cash Flow, which consists of operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare... There are some other columns not in those sections, such as assetTurnover (asset turnover ratio).

    The rest of 6 categorical data are comprised of Rating, Name, Symbol, Rating Agency Name, Date, and Sector.

    In [ ]:
    cor_rate_data = pd.read_csv('corporate_rating.csv')
    cor_rate_data.head()
    
    Out[ ]:
    Rating Name Symbol Rating Agency Name Date Sector currentRatio quickRatio cashRatio daysOfSalesOutstanding ... effectiveTaxRate freeCashFlowOperatingCashFlowRatio freeCashFlowPerShare cashPerShare companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover
    0 A Whirlpool Corporation WHR Egan-Jones Ratings Company 11/27/2015 Consumer Durables 0.945894 0.426395 0.099690 44.203245 ... 0.202716 0.437551 6.810673 9.809403 4.008012 0.049351 7.057088 15.565438 0.058638 3.906655
    1 BBB Whirlpool Corporation WHR Egan-Jones Ratings Company 2/13/2014 Consumer Durables 1.033559 0.498234 0.203120 38.991156 ... 0.074155 0.541997 8.625473 17.402270 3.156783 0.048857 6.460618 15.914250 0.067239 4.002846
    2 BBB Whirlpool Corporation WHR Fitch Ratings 3/6/2015 Consumer Durables 0.963703 0.451505 0.122099 50.841385 ... 0.214529 0.513185 9.693487 13.103448 4.094575 0.044334 10.491970 18.888889 0.074426 3.483510
    3 BBB Whirlpool Corporation WHR Fitch Ratings 6/15/2012 Consumer Durables 1.019851 0.510402 0.176116 41.161738 ... 1.816667 -0.147170 -1.015625 14.440104 3.630950 -0.012858 4.080741 6.901042 0.028394 4.581150
    4 BBB Whirlpool Corporation WHR Standard & Poor's Ratings Services 10/24/2016 Consumer Durables 0.957844 0.495432 0.141608 47.761126 ... 0.166966 0.451372 7.135348 14.257556 4.012780 0.053770 8.293505 15.808147 0.058065 3.857790

    5 rows × 31 columns

    (1) Rating Agency¶

    There are 5 rating agencies, including S&P, Egan-Jones, Moody's, Fitch, and DBRS.

    In [ ]:
    cor_rate_data['Rating Agency Name'].value_counts().to_frame()
    
    Out[ ]:
    Rating Agency Name
    Standard & Poor's Ratings Services 744
    Egan-Jones Ratings Company 603
    Moody's Investors Service 579
    Fitch Ratings 100
    DBRS 3

    The following mapping code is to simplify the name of the rating agencies so that its barplot becomes tidy.

    In [ ]:
    agency_mapping = {"Standard & Poor's Ratings Services" : "S&P",
                      "Egan-Jones Ratings Company" : "Egan-Jones",
                      "Moody's Investors Service" : "Moody's",
                      "Fitch Ratings" : "Fitch",
                      "DBRS" : "DBRS"}
    
    cor_rate_data['Rating Agency Name'] = cor_rate_data['Rating Agency Name'].map(agency_mapping)
    cor_rate_data['Rating Agency Name'].value_counts().to_frame()
    
    Out[ ]:
    Rating Agency Name
    S&P 744
    Egan-Jones 603
    Moody's 579
    Fitch 100
    DBRS 3

    The evaluations of the companies are mainly done by 3 agencies; S&P, Egan-Jones, and Moody's.

    In [ ]:
    plt.figure(figsize = (10,6))
    cor_rate_data['Rating Agency Name'].value_counts().plot(kind = 'bar',
                                                            color=plt.cm.tab20.colors[:len(cor_rate_data['Rating Agency Name'].value_counts())])
    plt.title('Frequency of Rating Agencies', fontsize = 14)
    plt.xlabel('Name of Agencies')
    plt.ylabel('Counts')
    plt.show()
    

    Instead of deleting the agency column, I determined to check how the agencies affect the rating of the companies. However, simply encoding the agencies based on the number of agencies (0 to 4) may not fit the conditions of some models such as LDA and QDA, because they require the explanatory variables to follow normal distribution. So, I generated dummy variables (one-hot encoding) for rating agencies.

    In [ ]:
    cor_rate_data = pd.get_dummies(cor_rate_data,
                                   columns = ['Rating Agency Name'],
                                   prefix = 'Agency',
                                   drop_first=True,
                                   dtype = float)
    cor_rate_data.head()
    
    Out[ ]:
    Rating Name Symbol Date Sector currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin ... companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover Agency_Egan-Jones Agency_Fitch Agency_Moody's Agency_S&P
    0 A Whirlpool Corporation WHR 11/27/2015 Consumer Durables 0.945894 0.426395 0.099690 44.203245 0.037480 ... 4.008012 0.049351 7.057088 15.565438 0.058638 3.906655 1.0 0.0 0.0 0.0
    1 BBB Whirlpool Corporation WHR 2/13/2014 Consumer Durables 1.033559 0.498234 0.203120 38.991156 0.044062 ... 3.156783 0.048857 6.460618 15.914250 0.067239 4.002846 1.0 0.0 0.0 0.0
    2 BBB Whirlpool Corporation WHR 3/6/2015 Consumer Durables 0.963703 0.451505 0.122099 50.841385 0.032709 ... 4.094575 0.044334 10.491970 18.888889 0.074426 3.483510 0.0 1.0 0.0 0.0
    3 BBB Whirlpool Corporation WHR 6/15/2012 Consumer Durables 1.019851 0.510402 0.176116 41.161738 0.020894 ... 3.630950 -0.012858 4.080741 6.901042 0.028394 4.581150 0.0 1.0 0.0 0.0
    4 BBB Whirlpool Corporation WHR 10/24/2016 Consumer Durables 0.957844 0.495432 0.141608 47.761126 0.042861 ... 4.012780 0.053770 8.293505 15.808147 0.058065 3.857790 0.0 0.0 0.0 1.0

    5 rows × 34 columns

    (2) Sectors and other categorical variables¶

    Following code and barplot show the number of sectors, including energy, health care, finance etc.

    In [ ]:
    cor_rate_data['Sector'].value_counts().to_frame()
    
    Out[ ]:
    Sector
    Energy 294
    Basic Industries 260
    Consumer Services 250
    Technology 234
    Capital Goods 233
    Public Utilities 211
    Health Care 171
    Consumer Non-Durables 132
    Consumer Durables 74
    Transportation 63
    Miscellaneous 57
    Finance 50
    In [ ]:
    plt.figure(figsize = (12,6))
    cor_rate_data['Sector'].value_counts().plot(kind = 'bar', color = 'orange')
    plt.title('Distribution Of Sectors')
    plt.ylabel('Counts')
    plt.show()
    

    Including 'Sector' column, I deleted rest of the categorical variables except for 'Rating'.

    In [ ]:
    cor_rate_data = cor_rate_data.drop(columns = ['Sector','Name','Date', 'Symbol'])
    cor_rate_data.head()
    
    Out[ ]:
    Rating currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets ... companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover Agency_Egan-Jones Agency_Fitch Agency_Moody's Agency_S&P
    0 A 0.945894 0.426395 0.099690 44.203245 0.037480 0.049351 0.176631 0.061510 0.041189 ... 4.008012 0.049351 7.057088 15.565438 0.058638 3.906655 1.0 0.0 0.0 0.0
    1 BBB 1.033559 0.498234 0.203120 38.991156 0.044062 0.048857 0.175715 0.066546 0.053204 ... 3.156783 0.048857 6.460618 15.914250 0.067239 4.002846 1.0 0.0 0.0 0.0
    2 BBB 0.963703 0.451505 0.122099 50.841385 0.032709 0.044334 0.170843 0.059783 0.032497 ... 4.094575 0.044334 10.491970 18.888889 0.074426 3.483510 0.0 1.0 0.0 0.0
    3 BBB 1.019851 0.510402 0.176116 41.161738 0.020894 -0.012858 0.138059 0.042430 0.025690 ... 3.630950 -0.012858 4.080741 6.901042 0.028394 4.581150 0.0 1.0 0.0 0.0
    4 BBB 0.957844 0.495432 0.141608 47.761126 0.042861 0.053770 0.177720 0.065354 0.046363 ... 4.012780 0.053770 8.293505 15.808147 0.058065 3.857790 0.0 0.0 0.0 1.0

    5 rows × 30 columns

    (3) Ratings & Oversampling through ADASYN¶

    The following table shows the number of ratings. While some ratings including BBB,BB,A have sufficiently many numbers, others such as AAA, CC, C, and D have significantly low number: 7,5,2, and 1.

    In [ ]:
    rating_order = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B', 'CCC', 'CC', 'C' , 'D' ]
    cor_rate_data['Rating'].value_counts().reindex(rating_order).to_frame().T
    
    Out[ ]:
    AAA AA A BBB BB B CCC CC C D
    Rating 7 89 398 671 490 302 64 5 2 1

    So, my purpose is to oversample the data of extremely low number of ratings. Before oversampling, I substituted the characters into numbers; AAA to 0, AA to 1 and so on. The oversampling method do not work well when there is only 1 data in a category, like D. So, I assigned same number to D as C. The number of companies that are classified as BBB was the largest; 671 companies are rated as BBB. So I tried to adjust all the ratings to have 671 data respectively.

    In [ ]:
    rating_mapping = {'AAA': 0, 'AA': 1, 'A': 2, 'BBB' : 3, 'BB' : 4, 'B' : 5, 'CCC' : 6, 'CC' : 7, 'C' : 8, 'D' : 8}
    cor_rate_data['Rating'] = cor_rate_data['Rating'].map(rating_mapping).fillna(method='ffill')
    cor_rate_data.head()
    
    Out[ ]:
    Rating currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets ... companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover Agency_Egan-Jones Agency_Fitch Agency_Moody's Agency_S&P
    0 2 0.945894 0.426395 0.099690 44.203245 0.037480 0.049351 0.176631 0.061510 0.041189 ... 4.008012 0.049351 7.057088 15.565438 0.058638 3.906655 1.0 0.0 0.0 0.0
    1 3 1.033559 0.498234 0.203120 38.991156 0.044062 0.048857 0.175715 0.066546 0.053204 ... 3.156783 0.048857 6.460618 15.914250 0.067239 4.002846 1.0 0.0 0.0 0.0
    2 3 0.963703 0.451505 0.122099 50.841385 0.032709 0.044334 0.170843 0.059783 0.032497 ... 4.094575 0.044334 10.491970 18.888889 0.074426 3.483510 0.0 1.0 0.0 0.0
    3 3 1.019851 0.510402 0.176116 41.161738 0.020894 -0.012858 0.138059 0.042430 0.025690 ... 3.630950 -0.012858 4.080741 6.901042 0.028394 4.581150 0.0 1.0 0.0 0.0
    4 3 0.957844 0.495432 0.141608 47.761126 0.042861 0.053770 0.177720 0.065354 0.046363 ... 4.012780 0.053770 8.293505 15.808147 0.058065 3.857790 0.0 0.0 0.0 1.0

    5 rows × 30 columns

    Let's delve deeper into ADASYN. This method requires the number of nearest neighbors k (K-nearest neighbors). In this analysis, I set the number k = 2. In ADASYN, the majority class refers to the class with largest amount of data among all other classes, while minority class means the rest of the classes. For the $i$-th data point $x_i$ in the minority class, calculate the ratio: $$r_i = \frac{\text{The number of majority-class neighbors of }x_i}{k}\in[0,1]$$ Then, normalize $r_i$: $g_i = \frac{r_i}{\sum_{j\in\text{minority group}} r_j}$, and determine the number of synthetic samples to generate, $G_i$: $$G_i = g_i*G$$ where $G =$ (Number of majority class samples - Number of minority class samples). For each $x_i$, randomly select one of its $k$ neighbors $x_*$: $$x_{new} = x_i + \lambda(x_* - x_i)$$ where $\lambda \sim \text{Unif}(0,1)$ (interpolation).

    In [ ]:
    X = cor_rate_data.drop(columns=['Rating'])
    y = cor_rate_data['Rating']
    sampling_strategy = {label: 671 for label in set(y)}
    
    adasyn = ADASYN(sampling_strategy = sampling_strategy, n_neighbors=2, random_state=2023311161)
    X_resampled, y_resampled = adasyn.fit_resample(X, y)
    
    y_resampled.value_counts().to_frame()
    
    Out[ ]:
    Rating
    5 721
    4 690
    6 672
    8 672
    3 671
    0 671
    7 670
    2 666
    1 654

    The following code and barplot show the total number of ratings.

    In [ ]:
    rating_remapping = {value: key for key, value in rating_mapping.items()} # switch key and value of rating_mapping dictionary
    rating_remapping[8] = 'C'
    
    rating_order2 = [x for x in rating_order if x != 'D']
    y_resampled_rating = y_resampled.map(rating_remapping)
    y_resampled_rating.value_counts().reindex(rating_order2).to_frame().T
    
    Out[ ]:
    AAA AA A BBB BB B CCC CC C
    Rating 671 654 666 671 690 721 672 670 672
    In [ ]:
    plt.figure(figsize = (12,6))
    y_resampled_rating.value_counts().reindex(rating_order2).plot(kind = 'bar')
    plt.ylabel('counts')
    plt.title('The Number of Ratings')
    plt.show()
    

    Now, combine the resampled rating data and resampled numerical (explanatory) data.

    In [ ]:
    cor_rate_data_adasyn = pd.concat([y_resampled, X_resampled], axis = 1)
    cor_rate_data_adasyn.head()
    
    Out[ ]:
    Rating currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets ... companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover Agency_Egan-Jones Agency_Fitch Agency_Moody's Agency_S&P
    0 2 0.945894 0.426395 0.099690 44.203245 0.037480 0.049351 0.176631 0.061510 0.041189 ... 4.008012 0.049351 7.057088 15.565438 0.058638 3.906655 1.0 0.0 0.0 0.0
    1 3 1.033559 0.498234 0.203120 38.991156 0.044062 0.048857 0.175715 0.066546 0.053204 ... 3.156783 0.048857 6.460618 15.914250 0.067239 4.002846 1.0 0.0 0.0 0.0
    2 3 0.963703 0.451505 0.122099 50.841385 0.032709 0.044334 0.170843 0.059783 0.032497 ... 4.094575 0.044334 10.491970 18.888889 0.074426 3.483510 0.0 1.0 0.0 0.0
    3 3 1.019851 0.510402 0.176116 41.161738 0.020894 -0.012858 0.138059 0.042430 0.025690 ... 3.630950 -0.012858 4.080741 6.901042 0.028394 4.581150 0.0 1.0 0.0 0.0
    4 3 0.957844 0.495432 0.141608 47.761126 0.042861 0.053770 0.177720 0.065354 0.046363 ... 4.012780 0.053770 8.293505 15.808147 0.058065 3.857790 0.0 0.0 0.0 1.0

    5 rows × 30 columns

    (4) Split data into Train and Test dataset¶

    The ratio of train and test data is 80 : 20.

    In [ ]:
    data_train, data_test = train_test_split(cor_rate_data_adasyn, test_size=0.2, random_state = 2023311161)
    X_train, y_train = data_train.iloc[:,2:30], data_train.iloc[:,0]
    X_test, y_test = data_test.iloc[:,2:30], data_test.iloc[:,0]
    X_test.head()
    
    Out[ ]:
    quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets returnOnCapitalEmployed returnOnEquity ... companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover Agency_Egan-Jones Agency_Fitch Agency_Moody's Agency_S&P
    2371 2.278170 0.111674 71.634958 0.144557 0.061013 0.640281 0.203167 0.072946 0.045269 0.169617 ... 2.305414 0.061013 17.261475 3.695267 0.327851 4.956799 0.000000 0.000000 0.0 1.000000
    534 1.038975 0.380729 63.727017 0.059743 0.092966 0.192734 0.103564 0.046122 0.090953 0.109802 ... 2.380685 0.092966 8.102552 3.403397 0.079016 6.936468 0.000000 0.000000 1.0 0.000000
    1071 1.229724 0.380645 46.646221 0.068476 0.103081 0.199890 0.118506 0.053858 0.096087 0.149840 ... 2.782151 0.103081 6.363313 3.485643 0.127798 10.028686 0.000000 0.000000 1.0 0.000000
    5606 0.394453 0.026197 53.924592 0.196575 0.147173 0.998726 -0.193032 0.050472 0.043942 -0.404301 ... -11.054333 0.147173 20.605994 3.182294 0.083168 3.631803 0.014864 0.000000 0.0 0.985136
    4943 0.549608 0.308271 23.011114 -1.083307 -1.194222 0.509708 -1.100582 -0.579432 -0.725569 1.409955 ... -2.537786 -1.194222 -4.132979 5.529796 0.146863 3.738619 0.000000 0.534308 0.0 0.465692

    5 rows × 28 columns

    3. Machine Learning Models¶

    This section compares several machine learning models to compare how well each model predicts the rating of the companies. We'll provide two measures for the prediction performance; accuracy and weighted $f_1$ score.

    In [ ]:
    average = 'weighted'
    

    (1) Ordered Logistic Regression¶

    In [ ]:
    # OrderedModel in statsmodels is too poor.
    
    LR_model = LogisticRegression(random_state=2023311161 ,
                                  multi_class='multinomial',
                                  solver='newton-cg',
                                  max_iter=5000)
    
    LR_model = LR_model.fit(X_train, y_train)
    y_pred_LR = LR_model.predict(X_test)
    acc_olr = accuracy_score(y_test, y_pred_LR)
    print("Ordered Logistic Regression Accuracy :",acc_olr) # 0.5935
    
    f1_olr = f1_score(y_test, y_pred_LR, average = average)
    print("Ordered Logistic Regression f1 score :", f1_olr) # 0.5709
    
    Ordered Logistic Regression Accuracy : 0.5935960591133005
    Ordered Logistic Regression f1 score : 0.5709746447753064
    
    C:\Users\user\anaconda3\lib\site-packages\sklearn\utils\optimize.py:210: ConvergenceWarning: newton-cg failed to converge. Increase the number of iterations.
      warnings.warn(
    

    (2) LDA¶

    In [ ]:
    LDA_model = LinearDiscriminantAnalysis()
    LDA_model.fit(X_train,y_train)
    y_pred_LDA = LDA_model.predict(X_test)
    acc_LDA = accuracy_score(y_test, y_pred_LDA)
    print("LDA Accuracy :",acc_LDA)
    
    f1_LDA = f1_score(y_test, y_pred_LDA, average = average)
    print("LDA f1 score :", f1_LDA)
    
    LDA Accuracy : 0.48932676518883417
    LDA f1 score : 0.4515765543613118
    

    (3) QDA¶

    In [ ]:
    QDA_model = QuadraticDiscriminantAnalysis()
    QDA_model.fit(X_train,y_train)
    y_pred_QDA = QDA_model.predict(X_test)
    acc_QDA = accuracy_score(y_test, y_pred_QDA)
    print("QDA Accuracy :",acc_QDA)
    
    f1_QDA = f1_score(y_test, y_pred_QDA, average = average)
    print("QDA f1 score :", f1_QDA)
    
    QDA Accuracy : 0.535303776683087
    QDA f1 score : 0.4714921035655885
    
    C:\Users\user\anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:878: UserWarning: Variables are collinear
      warnings.warn("Variables are collinear")
    

    (4) KNN¶

    In [ ]:
    KNN_model = KNeighborsClassifier(n_neighbors = 3)
    KNN_model.fit(X_train,y_train)
    y_pred_KNN = KNN_model.predict(X_test)
    acc_KNN = accuracy_score(y_test, y_pred_KNN)
    print("KNN Accuracy :",acc_KNN)
    
    f1_KNN = f1_score(y_test, y_pred_KNN, average = average)
    print("KNN f1 score :", f1_KNN)
    
    KNN Accuracy : 0.7996715927750411
    KNN f1 score : 0.7858740734584811
    

    (5) SVM¶

    In [ ]:
    SVC_model = svm.SVC(gamma= 'auto')
    SVC_model.fit(X_train, y_train)
    y_pred_SVM = SVC_model.predict(X_test)
    acc_SVM = accuracy_score(y_test, y_pred_SVM)
    print("SVM Accuracy :",acc_SVM)
    
    f1_SVM = f1_score(y_test, y_pred_SVM, average = average)
    print("SVM f1 score :", f1_SVM)
    
    SVM Accuracy : 0.7955665024630542
    SVM f1 score : 0.8121719550099245
    

    (6) Random Forest¶

    In [ ]:
    RF_model = RandomForestClassifier(random_state=2023311161)
    RF_model.fit(X_train,y_train)
    y_pred_RF = RF_model.predict(X_test)
    acc_RF = accuracy_score(y_test, y_pred_RF)
    print("Random Forest Accuracy :",acc_RF)
    
    f1_RF = f1_score(y_test, y_pred_RF, average = average)
    print("Random Forest f1 score :", f1_RF)
    
    Random Forest Accuracy : 0.8538587848932676
    Random Forest f1 score : 0.8512928695960994
    

    (7) Gradient Boosting¶

    In [ ]:
    GBT_model = GradientBoostingClassifier(random_state=2023311161)
    GBT_model.fit(X_train, y_train)
    y_pred_GBT = GBT_model.predict(X_test)
    acc_GBT = accuracy_score(y_test, y_pred_GBT)
    print("GBT Accuracy :",acc_GBT)
    
    f1_GBT = f1_score(y_test, y_pred_GBT, average = average)
    print("GBT f1 score :", f1_GBT)
    
    GBT Accuracy : 0.7857142857142857
    GBT f1 score : 0.7805326021015632
    

    (8) XGBoost¶

    In [ ]:
    xgb_model = XGBClassifier(objective='multi:softmax', num_class=len(y_train.unique()), random_state=2023311161)
    xgb_model.fit(X_train, y_train)
    y_pred_xgb = xgb_model.predict(X_test)
    acc_xgb = accuracy_score(y_test, y_pred_xgb)
    print("XGBoost accuracy :", acc_xgb)
    
    f1_xgb = f1_score(y_test, y_pred_xgb, average = average)
    print("XGBoost f1 score :", f1_xgb)
    
    XGBoost accuracy : 0.8702791461412152
    XGBoost f1 score : 0.8690958617082526
    

    (9) LightGBM¶

    In [ ]:
    lgb_model = lgb.LGBMClassifier(random_state=2023311161)
    lgb_model.fit(X_train, y_train)
    y_pred_lightGBM = lgb_model.predict(X_test)
    acc_lightGBM = accuracy_score(y_test, y_pred_lightGBM)
    print("LightGBM accuracy :", acc_lightGBM)
    
    f1_lightGBM = f1_score(y_test, y_pred_lightGBM, average = average)
    print("LightGBM f1 score :", f1_lightGBM)
    
    [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001712 seconds.
    You can set `force_col_wise=true` to remove the overhead.
    [LightGBM] [Info] Total Bins 7066
    [LightGBM] [Info] Number of data points in the train set: 4869, number of used features: 28
    [LightGBM] [Info] Start training from score -2.221548
    [LightGBM] [Info] Start training from score -2.219655
    [LightGBM] [Info] Start training from score -2.225343
    [LightGBM] [Info] Start training from score -2.195378
    [LightGBM] [Info] Start training from score -2.202785
    [LightGBM] [Info] Start training from score -2.131070
    [LightGBM] [Info] Start training from score -2.214000
    [LightGBM] [Info] Start training from score -2.188025
    [LightGBM] [Info] Start training from score -2.180726
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
    LightGBM accuracy : 0.8735632183908046
    LightGBM f1 score : 0.8737808912916817
    

    4. Results¶

    (1) Accuracy¶

    The following barplot compares the accuracies of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest accuracies among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low accuracies; around 50 ~ 60%.

    In [ ]:
    models = ['Ordered Logistic', 'LDA','QDA', 'KNN','SVC','Random Forest','Gradient Boosting','XGBoost','LightGBM']
    accuracy = [acc_olr, acc_LDA, acc_QDA, acc_KNN, acc_SVM, acc_RF, acc_GBT, acc_xgb, acc_lightGBM]
    df_acc = pd.DataFrame({'model' : models, 'accuracy' : accuracy})
    df_acc = df_acc.sort_values('accuracy', ascending = True)
    
    plt.figure(figsize = (12,6))
    ax = sns.barplot(data=df_acc, x='model', y='accuracy',  palette='viridis')
    
    # Add accuracy values on top of each bar
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.3f}',  # Format with two decimals
                    (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='bottom', fontsize=10)
    plt.ylabel('Accuracy')
    plt.title('The accuracies for distinct models')
    plt.show()
    

    (2) weighted average $F_1$ score¶

    This section suggests different measure, the weighted average $F_1$ score. Suppose $C$ is the total number of classes, $n_k$ is the number of samples in class $k$, $w_k := n_k/\sum_{j=1}^C n_j$ is the weight of class $k$. The precision and recall, another type of measure for performance, are defined as follows: $$P_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Positive}_k}$$ $$R_k = \frac{\text{True Positive}_k}{\text{True Positive}_k+\text{False Negative}_k}$$ Then, the class-wise $F_1$ score is $F_{1,(k)} = 2\frac{P_k\cdot R_k}{P_k + R_k}$, and thus the weighted average $F_1$ score is $$F_{1,weighted} := \sum_{k=1}^C w_k\cdot F_{1,(k)}$$

    The following barplot compares the weighted $F_1$ scores of each model in ascending order. The LightGBM, XGBoost, and random forest show the highest weighted $F_1$ scores among all of the models; around 87%. However, ordered logistic, LDA, and QDA have relatively low weighted $F_1$ scores; around 45 ~ 57%.

    In [ ]:
    f1_scores = [f1_olr, f1_LDA, f1_QDA, f1_KNN, f1_SVM, f1_RF, f1_GBT, f1_xgb, f1_lightGBM]
    df_f1 = pd.DataFrame({'model' : models, 'f1_score' : f1_scores})
    df_f1 = df_f1.sort_values('f1_score', ascending = True)
    
    plt.figure(figsize = (12,6))
    ax = sns.barplot(data=df_f1, x='model', y='f1_score', color='skyblue')
    
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.3f}',  # Format with two decimals
                    (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='bottom', fontsize=10)
    plt.ylabel('f1 score')
    plt.title('The f1 scores for distinct models')
    plt.show()
    

    (3) Feature Importance¶

    This barplot shows how important each variable is.

    In [ ]:
    df_LGB_importance = pd.DataFrame({'feature' : X_train.columns,
                                      'importance' : lgb_model.feature_importances_})
    df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = True)
    
    plt.figure(figsize = (16,10))
    bars = plt.barh(df_LGB_importance['feature'], df_LGB_importance['importance'], color = 'green')
    for bar, value in zip(bars, df_LGB_importance['importance']):
        plt.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height() / 2,
                 f'{value:.0f}', va='center', fontsize=9, color='black')
    plt.xlabel('Feature Importance')
    plt.title('Feature Importance - LightGBM')
    plt.show()
    

    The following table shows the top 5 most important variables.

    In [ ]:
    df_LGB_importance = df_LGB_importance.sort_values(by = 'importance', ascending = False)
    df_LGB_importance['feature'].iloc[:5].to_frame().T
    
    Out[ ]:
    2 17 0 23 20
    feature daysOfSalesOutstanding cashPerShare quickRatio payablesTurnover enterpriseValueMultiple

    5. Other Comments¶

    Among all the models, the LDA,QDA, and ordered logistic regression models show relatively low performace. This section discusses on some probable reasons of it.

    (1) Outliers¶

    Most of the columns of numerical variables have extreme values. For example, the minimum and the values between Q1 and Q3 of the column currentRatio are -0.93 and 1.07 ~ 2.16, but its maximum is 1725.5. These outlier values are likely to distort the models' performace.

    Especially, the LDA as well as QDA assumes the explanatory variable to follow normal distribution.

    In [ ]:
    cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].describe()
    
    Out[ ]:
    currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets returnOnCapitalEmployed ... effectiveTaxRate freeCashFlowOperatingCashFlowRatio freeCashFlowPerShare cashPerShare companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover
    count 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 2029.000000 ... 2029.000000 2029.000000 2.029000e+03 2.029000e+03 2029.000000 2029.000000 2029.000000 2.029000e+03 2029.000000 2029.000000
    mean 3.529607 2.653986 0.667364 333.795606 0.278447 0.431483 0.497968 0.587322 -37.517928 -73.974193 ... 0.397572 0.409550 5.094719e+03 4.227549e+03 3.323579 0.437454 48.287985 6.515123e+03 1.447653 38.002718
    std 44.052361 32.944817 3.583943 4447.839583 6.064134 8.984982 0.525307 11.224622 1166.172220 2350.275719 ... 10.595075 3.796488 1.469156e+05 1.224000e+05 87.529866 8.984299 529.118961 1.775290e+05 19.483294 758.923588
    min -0.932005 -1.893266 -0.192736 -811.845623 -101.845815 -124.343612 -14.800817 -124.343612 -40213.178290 -87162.162160 ... -100.611015 -120.916010 -4.912742e+03 -1.915035e+01 -2555.419643 -124.343612 -3749.921337 -1.195049e+04 -4.461837 -76.662850
    25% 1.071930 0.602825 0.130630 22.905093 0.021006 0.025649 0.233127 0.044610 0.019176 0.028112 ... 0.146854 0.271478 4.119924e-01 1.566038e+00 2.046822 0.028057 6.238066 2.356735e+00 0.073886 2.205912
    50% 1.493338 0.985679 0.297493 42.374120 0.064753 0.084965 0.414774 0.107895 0.045608 0.074421 ... 0.300539 0.644529 2.131742e+00 3.686513e+00 2.652456 0.087322 9.274398 4.352584e+00 0.133050 5.759722
    75% 2.166891 1.453820 0.624906 59.323563 0.114807 0.144763 0.849693 0.176181 0.077468 0.135036 ... 0.370653 0.836949 4.230253e+00 8.086152e+00 3.658331 0.149355 12.911759 7.319759e+00 0.240894 9.480892
    max 1725.505005 1139.541703 125.917417 115961.637400 198.517873 309.694856 2.702533 410.182214 0.487826 2.439504 ... 429.926282 34.594086 5.753380e+06 4.786803e+06 2562.871795 309.694856 11153.607090 6.439270e+06 688.526591 20314.880400

    8 rows × 25 columns

    (2) Multicollinearity¶

    If there is multicollinearity between covariances, this would lead to poor performance of forecasting. The following table shows the correlations of all variables.

    In [ ]:
    cor_rate_data_correlation = cor_rate_data.loc[:,"currentRatio":"payablesTurnover"].corr()
    cor_rate_data_correlation
    
    Out[ ]:
    currentRatio quickRatio cashRatio daysOfSalesOutstanding netProfitMargin pretaxProfitMargin grossProfitMargin operatingProfitMargin returnOnAssets returnOnCapitalEmployed ... effectiveTaxRate freeCashFlowOperatingCashFlowRatio freeCashFlowPerShare cashPerShare companyEquityMultiplier ebitPerRevenue enterpriseValueMultiple operatingCashFlowPerShare operatingCashFlowSalesRatio payablesTurnover
    currentRatio 1.000000 0.104329 0.736042 0.000020 0.003561 -0.002453 0.067326 -0.001224 0.001623 0.001583 ... -0.026461 0.005863 -0.001307 -0.001266 0.000118 -0.002489 -0.002013 -0.001453 -0.001439 -0.002160
    quickRatio 0.104329 1.000000 0.125848 0.736630 -0.001931 -0.002208 -0.028343 -0.001820 0.001539 0.001508 ... -0.003940 0.001713 -0.001427 -0.000929 -0.000365 -0.002241 -0.002843 -0.001587 -0.001453 -0.001759
    cashRatio 0.736042 0.125848 1.000000 0.006616 -0.007936 -0.006837 -0.050355 -0.000723 0.003326 0.003280 ... -0.024406 -0.013719 -0.001778 -0.001652 -0.001883 -0.006883 -0.007945 -0.002293 0.004459 -0.006915
    daysOfSalesOutstanding 0.000020 0.736630 0.006616 1.000000 0.262299 0.278222 -0.068181 0.291002 0.002416 0.002364 ... -0.000926 0.003267 -0.002155 -0.000834 -0.000242 0.278204 -0.019808 -0.002283 0.406126 -0.003026
    netProfitMargin 0.003561 -0.001931 -0.007936 0.262299 1.000000 0.991241 -0.099540 0.971483 0.001577 0.001535 ... 0.001500 -0.004830 -0.000708 -0.000716 -0.001185 0.991185 -0.003665 -0.000779 0.785592 -0.001681
    pretaxProfitMargin -0.002453 -0.002208 -0.006837 0.278222 0.991241 1.000000 -0.135662 0.992001 0.001637 0.001598 ... -0.000097 -0.005518 -0.000976 -0.000977 -0.001148 0.999975 -0.003867 -0.001080 0.831778 -0.001760
    grossProfitMargin 0.067326 -0.028343 -0.050355 -0.068181 -0.099540 -0.135662 1.000000 -0.121829 -0.030745 -0.030128 ... 0.018387 0.016868 0.012775 0.012781 0.010026 -0.135715 -0.008007 0.011131 -0.105564 0.034244
    operatingProfitMargin -0.001224 -0.001820 -0.000723 0.291002 0.971483 0.992001 -0.121829 1.000000 0.001675 0.001634 ... 0.000173 -0.005949 -0.001205 -0.001205 -0.000962 0.992018 -0.003699 -0.001305 0.871686 -0.002042
    returnOnAssets 0.001623 0.001539 0.003326 0.002416 0.001577 0.001637 -0.030745 0.001675 1.000000 0.995426 ... 0.001214 0.029072 -0.001283 0.001115 0.002629 0.001659 0.002742 0.002275 0.002385 0.001616
    returnOnCapitalEmployed 0.001583 0.001508 0.003280 0.002364 0.001535 0.001598 -0.030128 0.001634 0.995426 1.000000 ... 0.001183 0.029901 -0.001162 0.001090 0.002559 0.001619 0.002883 0.002145 0.002334 0.001581
    returnOnEquity -0.001644 -0.001562 -0.003402 -0.002444 -0.001554 -0.001623 0.031101 -0.001669 -0.995371 -0.981650 ... -0.001216 -0.027981 0.001368 -0.001147 -0.002498 -0.001644 -0.002975 -0.002419 -0.002412 -0.001628
    assetTurnover -0.001951 -0.001854 -0.003991 -0.002887 -0.001831 -0.001909 0.036758 -0.001974 -0.822513 -0.802676 ... -0.001444 -0.022836 0.000736 -0.001329 -0.003321 -0.001934 -0.003633 -0.002390 -0.002834 -0.001926
    fixedAssetTurnover -0.001944 -0.001845 -0.003940 -0.002884 -0.001830 -0.001907 0.036731 -0.001973 -0.810064 -0.788302 ... -0.001442 -0.022273 0.000733 -0.001328 -0.003324 -0.001933 -0.003634 -0.002395 -0.002834 -0.001925
    debtEquityRatio 0.000093 -0.000366 -0.001877 -0.000231 -0.001183 -0.001149 0.010051 -0.000964 0.002262 0.002200 ... 0.003429 0.000507 -0.012039 -0.009003 0.999995 -0.001148 0.010820 -0.015896 -0.000781 -0.000212
    debtRatio 0.005856 0.002901 -0.008638 0.001022 -0.022625 -0.024115 0.018127 -0.024509 -0.051972 -0.051081 ... -0.001024 -0.058294 -0.011503 -0.013107 0.007100 -0.024150 0.051639 -0.004820 -0.021119 -0.004237
    effectiveTaxRate -0.026461 -0.003940 -0.024406 -0.000926 0.001500 -0.000097 0.018387 0.000173 0.001214 0.001183 ... 1.000000 0.005496 -0.001528 -0.001562 0.003449 -0.000029 -0.004558 -0.001771 -0.000724 -0.000206
    freeCashFlowOperatingCashFlowRatio 0.005863 0.001713 -0.013719 0.003267 -0.004830 -0.005518 0.016868 -0.005949 0.029072 0.029901 ... 0.005496 1.000000 0.003497 0.003662 0.000510 -0.005449 -0.002982 0.003602 -0.003585 -0.003625
    freeCashFlowPerShare -0.001307 -0.001427 -0.001778 -0.002155 -0.000708 -0.000976 0.012775 -0.001205 -0.001283 -0.001162 ... -0.001528 0.003497 1.000000 0.997277 -0.012076 -0.000999 -0.003156 0.992371 -0.002066 -0.001476
    cashPerShare -0.001266 -0.000929 -0.001652 -0.000834 -0.000716 -0.000977 0.012781 -0.001205 0.001115 0.001090 ... -0.001562 0.003662 0.997277 1.000000 -0.009027 -0.001000 -0.003172 0.986459 -0.001514 -0.001281
    companyEquityMultiplier 0.000118 -0.000365 -0.001883 -0.000242 -0.001185 -0.001148 0.010026 -0.000962 0.002629 0.002559 ... 0.003449 0.000510 -0.012076 -0.009027 1.000000 -0.001146 0.010804 -0.015945 -0.000804 -0.000207
    ebitPerRevenue -0.002489 -0.002241 -0.006883 0.278204 0.991185 0.999975 -0.135715 0.992018 0.001659 0.001619 ... -0.000029 -0.005449 -0.000999 -0.001000 -0.001146 1.000000 -0.003925 -0.001104 0.831789 -0.001790
    enterpriseValueMultiple -0.002013 -0.002843 -0.007945 -0.019808 -0.003665 -0.003867 -0.008007 -0.003699 0.002742 0.002883 ... -0.004558 -0.002982 -0.003156 -0.003172 0.010804 -0.003925 1.000000 -0.003337 -0.005003 -0.001183
    operatingCashFlowPerShare -0.001453 -0.001587 -0.002293 -0.002283 -0.000779 -0.001080 0.011131 -0.001305 0.002275 0.002145 ... -0.001771 0.003602 0.992371 0.986459 -0.015945 -0.001104 -0.003337 1.000000 -0.002220 -0.000302
    operatingCashFlowSalesRatio -0.001439 -0.001453 0.004459 0.406126 0.785592 0.831778 -0.105564 0.871686 0.002385 0.002334 ... -0.000724 -0.003585 -0.002066 -0.001514 -0.000804 0.831789 -0.005003 -0.002220 1.000000 -0.003337
    payablesTurnover -0.002160 -0.001759 -0.006915 -0.003026 -0.001681 -0.001760 0.034244 -0.002042 0.001616 0.001581 ... -0.000206 -0.003625 -0.001476 -0.001281 -0.000207 -0.001790 -0.001183 -0.000302 -0.003337 1.000000

    25 rows × 25 columns

    The following image plot visualizes the correlation between each variables. This shows that there are some variables that have strong correlation (whether positive or negative) with other ones. For example, netProfitMargin column has strong correlation with pretaxProfitMargin (0.991241), operatingProfitMargin (0.971483), ebitPerRevenue (0.991185), and operatingCashFlowSalesRatio (0.785592).

    In [ ]:
    plt.figure(figsize=(16, 12))
    plt.imshow(cor_rate_data_correlation, cmap='coolwarm', interpolation='none', aspect='auto')
    plt.colorbar(label='Correlation Coefficient')
    plt.xticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index, rotation=90)
    plt.yticks(range(len(cor_rate_data_correlation.index)), cor_rate_data_correlation.index)
    plt.title("Correlation Matrix", fontsize=30)
    plt.tight_layout()
    plt.show()
    

    This multicollinearity has severe impact on QDA, because QDA relies on the covariance matrices of the classes. Also, the multicollinearity can cause the ordered logistic model to fail to converge, leading to unreliable estimates.

    6. Conclusion¶

    The models based on gradient boosting, including LightGBM and XGBoost, as well as Random Forest, have the best performance of forecasting the ratings of corporate credit, without dealing with outliers and multicollinearity. If such methods as LDA and Ordered Logistic Regression are necessary, handling the outliers (i.e., treating them as missing values) and deleting columns with high correlation would lead to higher performance than before.