Bayesian Statistics Project 1¶
1. Objective¶
The main purpose of this data analysis is to suggest a new model that forecasts whether each customer will finish contract using Bayesian paradigm. The dataset is from this Kaggle competition Telco customer churn. The data is from a fictional telecommunications company that provided home phone and internet services to customers in California.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
2. Data¶
The data consists of 33 columns, including where customers live, their gender, monthly charges and so on.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
telco = pd.read_excel('/content/drive/MyDrive/Bayesian/Telco_customer_churn_data.xlsx')
telco.head()
| CustomerID | Count | Country | State | City | Zip Code | Lat Long | Latitude | Longitude | Gender | ... | Contract | Paperless Billing | Payment Method | Monthly Charges | Total Charges | Churn Label | Churn Value | Churn Score | CLTV | Churn Reason | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3668-QPYBK | 1 | United States | California | Los Angeles | 90003 | 33.964131, -118.272783 | 33.964131 | -118.272783 | Male | ... | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | 1 | 86 | 3239 | Competitor made better offer |
| 1 | 9237-HQITU | 1 | United States | California | Los Angeles | 90005 | 34.059281, -118.30742 | 34.059281 | -118.307420 | Female | ... | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | 1 | 67 | 2701 | Moved |
| 2 | 9305-CDSKC | 1 | United States | California | Los Angeles | 90006 | 34.048013, -118.293953 | 34.048013 | -118.293953 | Female | ... | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes | 1 | 86 | 5372 | Moved |
| 3 | 7892-POOKP | 1 | United States | California | Los Angeles | 90010 | 34.062125, -118.315709 | 34.062125 | -118.315709 | Female | ... | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes | 1 | 84 | 5003 | Moved |
| 4 | 0280-XJGEX | 1 | United States | California | Los Angeles | 90015 | 34.039224, -118.266293 | 34.039224 | -118.266293 | Male | ... | Month-to-month | Yes | Bank transfer (automatic) | 103.70 | 5036.3 | Yes | 1 | 89 | 5340 | Competitor had better devices |
5 rows × 33 columns
Most of the columns are categorical variables.
telco.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 7043 non-null object 1 Count 7043 non-null int64 2 Country 7043 non-null object 3 State 7043 non-null object 4 City 7043 non-null object 5 Zip Code 7043 non-null int64 6 Lat Long 7043 non-null object 7 Latitude 7043 non-null float64 8 Longitude 7043 non-null float64 9 Gender 7043 non-null object 10 Senior Citizen 7043 non-null object 11 Partner 7043 non-null object 12 Dependents 7043 non-null object 13 Tenure Months 7043 non-null int64 14 Phone Service 7043 non-null object 15 Multiple Lines 7043 non-null object 16 Internet Service 7043 non-null object 17 Online Security 7043 non-null object 18 Online Backup 7043 non-null object 19 Device Protection 7043 non-null object 20 Tech Support 7043 non-null object 21 Streaming TV 7043 non-null object 22 Streaming Movies 7043 non-null object 23 Contract 7043 non-null object 24 Paperless Billing 7043 non-null object 25 Payment Method 7043 non-null object 26 Monthly Charges 7043 non-null float64 27 Total Charges 7043 non-null object 28 Churn Label 7043 non-null object 29 Churn Value 7043 non-null int64 30 Churn Score 7043 non-null int64 31 CLTV 7043 non-null int64 32 Churn Reason 1869 non-null object dtypes: float64(3), int64(6), object(24) memory usage: 1.8+ MB
Key Variables
- Churn Value: Customer churn status (1: Churned, 0: Retained)
- Churn Reason: Reason for customer churn
- Customer Information Variables:
Gender,Senior Citizen,Dependents, etc. - Contract & Payment Variables:
Contract,Payment Method,Monthly Charges, etc. - Service Subscription Variables:
Internet Service,Online Security,Device Protection, etc.
Although the above code shows there is no null data, there are some empty space in the column 'Total Charges'. The common point is that all of the corresponding customers have zero tenure month, which means they are novice customers of the Telco service. Fortunately, they have monthly charge data, so we replace the empty space with monthly charge values.
telco.loc[telco['Tenure Months'] == 0, 'Total Charges'] = telco['Monthly Charges']
telco[telco['Tenure Months'] == 0]
| CustomerID | Count | Country | State | City | Zip Code | Lat Long | Latitude | Longitude | Gender | ... | Contract | Paperless Billing | Payment Method | Monthly Charges | Total Charges | Churn Label | Churn Value | Churn Score | CLTV | Churn Reason | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2234 | 4472-LVYGI | 1 | United States | California | San Bernardino | 92408 | 34.084909, -117.258107 | 34.084909 | -117.258107 | Female | ... | Two year | Yes | Bank transfer (automatic) | 52.55 | 52.55 | No | 0 | 36 | 2578 | NaN |
| 2438 | 3115-CZMZD | 1 | United States | California | Independence | 93526 | 36.869584, -118.189241 | 36.869584 | -118.189241 | Male | ... | Two year | No | Mailed check | 20.25 | 20.25 | No | 0 | 68 | 5504 | NaN |
| 2568 | 5709-LVOEQ | 1 | United States | California | San Mateo | 94401 | 37.590421, -122.306467 | 37.590421 | -122.306467 | Female | ... | Two year | No | Mailed check | 80.85 | 80.85 | No | 0 | 45 | 2048 | NaN |
| 2667 | 4367-NUYAO | 1 | United States | California | Cupertino | 95014 | 37.306612, -122.080621 | 37.306612 | -122.080621 | Male | ... | Two year | No | Mailed check | 25.75 | 25.75 | No | 0 | 48 | 4950 | NaN |
| 2856 | 1371-DWPAZ | 1 | United States | California | Redcrest | 95569 | 40.363446, -123.835041 | 40.363446 | -123.835041 | Female | ... | Two year | No | Credit card (automatic) | 56.05 | 56.05 | No | 0 | 30 | 4740 | NaN |
| 4331 | 7644-OMVMY | 1 | United States | California | Los Angeles | 90029 | 34.089953, -118.294824 | 34.089953 | -118.294824 | Male | ... | Two year | No | Mailed check | 19.85 | 19.85 | No | 0 | 53 | 2019 | NaN |
| 4687 | 3213-VVOLG | 1 | United States | California | Sun City | 92585 | 33.739412, -117.173334 | 33.739412 | -117.173334 | Male | ... | Two year | No | Mailed check | 25.35 | 25.35 | No | 0 | 49 | 2299 | NaN |
| 5104 | 2520-SGTTA | 1 | United States | California | Ben Lomond | 95005 | 37.078873, -122.090386 | 37.078873 | -122.090386 | Female | ... | Two year | No | Mailed check | 20.00 | 20.0 | No | 0 | 27 | 3763 | NaN |
| 5719 | 2923-ARZLG | 1 | United States | California | La Verne | 91750 | 34.144703, -117.770299 | 34.144703 | -117.770299 | Male | ... | One year | Yes | Mailed check | 19.70 | 19.7 | No | 0 | 69 | 4890 | NaN |
| 6772 | 4075-WKNIU | 1 | United States | California | Bell | 90201 | 33.970343, -118.171368 | 33.970343 | -118.171368 | Female | ... | Two year | No | Mailed check | 73.35 | 73.35 | No | 0 | 44 | 2342 | NaN |
| 6840 | 2775-SEFEE | 1 | United States | California | Wilmington | 90744 | 33.782068, -118.262263 | 33.782068 | -118.262263 | Male | ... | Two year | Yes | Bank transfer (automatic) | 61.90 | 61.9 | No | 0 | 65 | 5188 | NaN |
11 rows × 33 columns
It turns out that the total charge column is categorical one, which is not intuitive. So, I changed the type of the data into numerical one.
telco['Total Charges'] = pd.to_numeric(telco['Total Charges'],errors='coerce', downcast='float')
telco.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 7043 non-null object 1 Count 7043 non-null int64 2 Country 7043 non-null object 3 State 7043 non-null object 4 City 7043 non-null object 5 Zip Code 7043 non-null int64 6 Lat Long 7043 non-null object 7 Latitude 7043 non-null float64 8 Longitude 7043 non-null float64 9 Gender 7043 non-null object 10 Senior Citizen 7043 non-null object 11 Partner 7043 non-null object 12 Dependents 7043 non-null object 13 Tenure Months 7043 non-null int64 14 Phone Service 7043 non-null object 15 Multiple Lines 7043 non-null object 16 Internet Service 7043 non-null object 17 Online Security 7043 non-null object 18 Online Backup 7043 non-null object 19 Device Protection 7043 non-null object 20 Tech Support 7043 non-null object 21 Streaming TV 7043 non-null object 22 Streaming Movies 7043 non-null object 23 Contract 7043 non-null object 24 Paperless Billing 7043 non-null object 25 Payment Method 7043 non-null object 26 Monthly Charges 7043 non-null float64 27 Total Charges 7043 non-null float32 28 Churn Label 7043 non-null object 29 Churn Value 7043 non-null int64 30 Churn Score 7043 non-null int64 31 CLTV 7043 non-null int64 32 Churn Reason 1869 non-null object dtypes: float32(1), float64(3), int64(6), object(23) memory usage: 1.7+ MB
3. EDA¶
(1) City¶
This section shows how the churn value differs depending on each city. There are more than 1000 cities in this data.
telco_by_city = telco.groupby('City')['Churn Value'].agg(['mean', 'count']).reset_index()
telco_by_city.columns = ['City', 'Average Churn Value', 'Number of Customers']
telco_by_city = telco_by_city.pivot_table(index=None, columns='City', values=['Average Churn Value', 'Number of Customers'])
telco_by_city
| City | Acampo | Acton | Adelanto | Adin | Agoura Hills | Aguanga | Ahwahnee | Alameda | Alamo | Albany | ... | Yermo | Yorba Linda | Yorkville | Yosemite National Park | Yountville | Yreka | Yuba City | Yucaipa | Yucca Valley | Zenia |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average Churn Value | 0.75 | 0.0 | 0.2 | 0.5 | 0.4 | 0.25 | 0.25 | 0.0 | 0.0 | 0.5 | ... | 0.25 | 0.375 | 0.25 | 0.5 | 0.0 | 0.0 | 0.25 | 0.25 | 0.0 | 0.25 |
| Number of Customers | 4.00 | 4.0 | 5.0 | 4.0 | 5.0 | 4.00 | 4.00 | 8.0 | 4.0 | 4.0 | ... | 4.00 | 8.000 | 4.00 | 4.0 | 4.0 | 4.0 | 8.00 | 4.00 | 5.0 | 4.00 |
2 rows × 1129 columns
If we filter the data so that only the cities with over 20 customers are left, we can see that there are only 27 cities.
telco_by_city.loc[:,telco_by_city.loc['Number of Customers'] > 20]
| City | Anaheim | Bakersfield | Berkeley | Burbank | Chula Vista | Fresno | Glendale | Inglewood | Irvine | Long Beach | ... | San Diego | San Francisco | San Jose | Santa Ana | Santa Barbara | Santa Monica | Santa Rosa | Stockton | Torrance | Whittier |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average Churn Value | 0.25 | 0.075 | 0.28125 | 0.24 | 0.2 | 0.25 | 0.325 | 0.2 | 0.178571 | 0.25 | ... | 0.333333 | 0.298077 | 0.258929 | 0.25 | 0.357143 | 0.2 | 0.458333 | 0.272727 | 0.32 | 0.166667 |
| Number of Customers | 28.00 | 40.000 | 32.00000 | 25.00 | 25.0 | 64.00 | 40.000 | 25.0 | 28.000000 | 60.00 | ... | 150.000000 | 104.000000 | 112.000000 | 24.00 | 28.000000 | 25.0 | 24.000000 | 44.000000 | 25.00 | 30.000000 |
2 rows × 27 columns
We also filter the original data to see how many cities have a churn value over 0.7.
telco_by_city_high_churn = telco_by_city.loc[:, telco_by_city.loc['Average Churn Value'] > 0.7]
telco_by_city_high_churn
| City | Acampo | Alpaugh | Amador City | Avenal | Biola | Bodfish | Boulder Creek | Bridgeville | Byron | Comptche | ... | Smith River | South Dos Palos | South Lake Tahoe | Summerland | Templeton | Tipton | Truckee | Twain | Wheatland | Wrightwood |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average Churn Value | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 1.0 | 0.75 | 0.75 | 0.75 | ... | 1.0 | 0.75 | 1.0 | 0.75 | 0.75 | 1.0 | 1.0 | 1.0 | 0.75 | 1.0 |
| Number of Customers | 4.00 | 4.00 | 4.00 | 4.00 | 4.00 | 4.00 | 4.0 | 4.00 | 4.00 | 4.00 | ... | 4.0 | 4.00 | 4.0 | 4.00 | 4.00 | 4.0 | 4.0 | 4.0 | 4.00 | 4.0 |
2 rows × 56 columns
The cities whose churn rate is relatively high has fewer number of customers.
print(telco_by_city_high_churn.iloc[1, :].max(), telco_by_city_high_churn.iloc[1, :].min())
5.0 4.0
(2) Gender¶
The number of male is similar to that of female.
telco_gender = telco.groupby('Gender')['CustomerID'].count().reset_index()
telco_gender
| Gender | CustomerID | |
|---|---|---|
| 0 | Female | 3488 |
| 1 | Male | 3555 |
fig, ax = plt.subplots()
telco_gender = telco.groupby('Gender')['CustomerID'].count().reset_index()
labels = 'Female','Male'
ax.pie(telco_gender['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=['pink','royalblue'])
plt.show()
The following figure shows the the churn value has almost no difference in different genders.
by_gender = telco.groupby('Churn Label')['Gender'].value_counts().to_frame().rename(columns={'Gender': 'Freq'}).reset_index().sort_values('Churn Label')
group_size=telco['Churn Label'].value_counts()
group_names=telco['Churn Label'].value_counts().index
subgroup_size=by_gender['Freq']
subgroup_names=by_gender['Gender']
a, b =[plt.cm.Blues, plt.cm.Reds]
fig, ax = plt.subplots()
fig.suptitle('Gender')
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors=[a(0.6), b(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3, labels=subgroup_names, labeldistance=0.7, colors=[a(0.5), a(0.4), b(0.5), b(0.4)])
plt.setp( mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)
plt.show()
(3) Senior¶
Now, we want to check whether the churn value is affected by the age of the customers. For relatively young customers, the ratio of customers who had churned is almost 1/3 of those who had not. However, the ratio of elderly customers who had churned is about 2/3 of those who had not.
telco_senior = telco.groupby(['Senior Citizen','Churn Label'])['CustomerID'].count().reset_index()
telco_senior
| Senior Citizen | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | No | No | 4508 |
| 1 | No | Yes | 1393 |
| 2 | Yes | No | 666 |
| 3 | Yes | Yes | 476 |
g = sns.catplot(
telco_senior, kind="bar",
x="Churn Label", y="CustomerID", col="Senior Citizen",
height=4, aspect=1, palette = ['limegreen','darkviolet']
)
g.set(ylabel='Counts')
plt.show()
(4) Partner¶
Similarly, the difference in churn value also varies depending on the presence of partner. The churn value of those without no partner is about 33%, while that of customers who have partner is about 19.7%.
telco_partner = telco.groupby(['Partner','Churn Label'])['CustomerID'].count().reset_index()
telco_partner
| Partner | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | No | No | 2441 |
| 1 | No | Yes | 1200 |
| 2 | Yes | No | 2733 |
| 3 | Yes | Yes | 669 |
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'No','Yes'
# churn no
axes[0].pie(telco_partner_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral','deepskyblue'])
axes[0].set_title('No Partner')
# yes
axes[1].pie(telco_partner_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral','deepskyblue'])
axes[1].set_title('With Partner')
plt.show()
(5) Dependents ( = Children)¶
Similarly, the difference in churn value also varies depending on the presence of children. The churn value of those without no children is about 50%, while that of customers who have partner is less than 10%.
telco_dep = telco.groupby(['Dependents','Churn Label'])['CustomerID'].count().reset_index()
telco_dep
| Dependents | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | No | No | 3653 |
| 1 | No | Yes | 1763 |
| 2 | Yes | No | 1521 |
| 3 | Yes | Yes | 106 |
g = sns.catplot(
telco_dep, kind="bar",
x="Churn Label", y="CustomerID", col="Dependents",
height=4, aspect=1,
)
g.set(ylabel='Counts')
<seaborn.axisgrid.FacetGrid at 0x7d3726839b10>
(6) Tenure Month¶
This section compares the difference in churn value depending on the tenure (the period that each customer has signed up for the service). The following histogram shows that the churned customers are centered on low tenure, while the shape of the histogram of those who remained are U-shaped.
telco_tenure = telco[['Tenure Months','Churn Label']]
telco_tenure_no = telco_tenure.loc[telco_tenure['Churn Label'] == 'No',:]['Tenure Months'].to_numpy()
telco_tenure_yes = telco_tenure.loc[telco_tenure['Churn Label'] == 'Yes',:]['Tenure Months'].to_numpy()
plt.hist(telco_tenure_no, density = False, histtype='barstacked', rwidth=0.8, color = 'green',alpha = 0.5, label = 'No churn')
plt.hist(telco_tenure_yes, density = False, histtype='barstacked', rwidth=0.8, color = 'lightcoral',alpha = 0.5, label = 'Yes churn')
plt.title('The Length of Tenure in Accordance to Churn')
plt.legend()
plt.show()
(7) Services¶
1) Correlation btw Churn Values and All Services¶
There are several columns associated with services.
telco_services = telco[['Churn Value','Phone Service','Multiple Lines','Internet Service','Online Security',
'Online Backup','Device Protection','Tech Support','Streaming TV',
'Streaming Movies']]
telco_services.head()
| Churn Value | Phone Service | Multiple Lines | Internet Service | Online Security | Online Backup | Device Protection | Tech Support | Streaming TV | Streaming Movies | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Yes | No | DSL | Yes | Yes | No | No | No | No |
| 1 | 1 | Yes | No | Fiber optic | No | No | No | No | No | No |
| 2 | 1 | Yes | Yes | Fiber optic | No | No | Yes | No | Yes | Yes |
| 3 | 1 | Yes | Yes | Fiber optic | No | No | Yes | Yes | Yes | Yes |
| 4 | 1 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes |
To obtain correlation, we transform above categorical values into binary ones. Some data contain values such as 'No internet service', 'No phone service', which are switched into 0. Also, 'DSL', 'Fiber optic' are some types of internet service, so they are switched to 1 as well.
telco_services = telco_services.replace(to_replace = 'Yes', value = 1)
telco_services = telco_services.replace(to_replace = ['No', 'No internet service','No phone service'], value = 0)
telco_services = telco_services.replace(to_replace = ['DSL', 'Fiber optic'], value = 1)
telco_services
| Churn Value | Phone Service | Multiple Lines | Internet Service | Online Security | Online Backup | Device Protection | Tech Support | Streaming TV | Streaming Movies | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
| 3 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |
| 4 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7039 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 7040 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| 7041 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 7042 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
7043 rows × 10 columns
The resulting correlation matrix is as follows.
telco_services_corr = telco_services.corr()
telco_services_corr
| Churn Value | Phone Service | Multiple Lines | Internet Service | Online Security | Online Backup | Device Protection | Tech Support | Streaming TV | Streaming Movies | |
|---|---|---|---|---|---|---|---|---|---|---|
| Churn Value | 1.000000 | 0.011942 | 0.040102 | 0.227890 | -0.171226 | -0.082255 | -0.066160 | -0.164674 | 0.063228 | 0.061382 |
| Phone Service | 0.011942 | 1.000000 | 0.279690 | -0.172209 | -0.092893 | -0.052312 | -0.071227 | -0.096340 | -0.022574 | -0.032959 |
| Multiple Lines | 0.040102 | 0.279690 | 1.000000 | 0.210564 | 0.098108 | 0.202237 | 0.201137 | 0.100571 | 0.257152 | 0.258751 |
| Internet Service | 0.227890 | -0.172209 | 0.210564 | 1.000000 | 0.333403 | 0.381593 | 0.380754 | 0.336298 | 0.415552 | 0.418675 |
| Online Security | -0.171226 | -0.092893 | 0.098108 | 0.333403 | 1.000000 | 0.283832 | 0.275438 | 0.354931 | 0.176207 | 0.187398 |
| Online Backup | -0.082255 | -0.052312 | 0.202237 | 0.381593 | 0.283832 | 1.000000 | 0.303546 | 0.294233 | 0.282106 | 0.274501 |
| Device Protection | -0.066160 | -0.071227 | 0.201137 | 0.380754 | 0.275438 | 0.303546 | 1.000000 | 0.333313 | 0.390874 | 0.402111 |
| Tech Support | -0.164674 | -0.096340 | 0.100571 | 0.336298 | 0.354931 | 0.294233 | 0.333313 | 1.000000 | 0.278070 | 0.279358 |
| Streaming TV | 0.063228 | -0.022574 | 0.257152 | 0.415552 | 0.176207 | 0.282106 | 0.390874 | 0.278070 | 1.000000 | 0.533094 |
| Streaming Movies | 0.061382 | -0.032959 | 0.258751 | 0.418675 | 0.187398 | 0.274501 | 0.402111 | 0.279358 | 0.533094 | 1.000000 |
We may also obtain corresponding heatmap. It is shown that internet service and phone service are highly correlated, while streaming TV and online security are almost uncorrelated.
sns.heatmap(telco_services_corr.corr(), annot=False, cmap='coolwarm')
plt.show()
2) Internet Service¶
Customers can be categorized as 3 types: those who has no internet service, who has DSL, and fiber optic.
fig, ax = plt.subplots()
telco_internetservice = telco.groupby('Internet Service')['CustomerID'].count().reset_index()
labels = 'DSL','Fiber optic','No'
ax.pie(telco_internetservice['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=['lightgreen','skyblue', 'orange'])
ax.set_title('Internet Service')
plt.show()
My purpose is to figure out whether the difference in churn value exist depending on internet service.
telco_intserv = telco.groupby(['Internet Service','Churn Label'])['CustomerID'].count().reset_index()
telco_intserv
| Internet Service | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | DSL | No | 1962 |
| 1 | DSL | Yes | 459 |
| 2 | Fiber optic | No | 1799 |
| 3 | Fiber optic | Yes | 1297 |
| 4 | No | No | 1413 |
| 5 | No | Yes | 113 |
telco_intserv_no = telco_intserv.loc[telco_intserv['Churn Label'] == 'No',:]
telco_intserv_yes = telco_intserv.loc[telco_intserv['Churn Label'] == 'Yes',:]
telco_intserv_yes
| Internet Service | Churn Label | CustomerID | |
|---|---|---|---|
| 1 | DSL | Yes | 459 |
| 3 | Fiber optic | Yes | 1297 |
| 5 | No | Yes | 113 |
The following pieplot shows the proportion of customers with fiber optic are much higher among those who had churned. This implies that some issues regarding fiber optic exist.
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'DSL','Fiber Optic','No'
colors=['lightgreen','skyblue', 'orange']
# churn no
axes[0].pie(telco_intserv_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('not churned')
# yes
axes[1].pie(telco_intserv_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('churned')
plt.show()
3) Tech Support¶
This column represent whether customers who have internet service also have technique service.
telco_techsupport = telco.groupby('Tech Support')['CustomerID'].count().reset_index()
telco_techsupport
| Tech Support | CustomerID | |
|---|---|---|
| 0 | No | 3473 |
| 1 | No internet service | 1526 |
| 2 | Yes | 2044 |
fig, ax = plt.subplots()
labels = 'No','No internet service', 'Yes'
colors = ['violet','darkorange','royalblue']
ax.pie(telco_techsupport['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=colors)
ax.set_title('Tech Support')
plt.show()
The following pieplots show that the ratio of those who do not have tech service among churned customers are much higher than the ratio of those without tech service among remained customers. This implies that technique service is highly helpful for retaining customers.
telco_techsupport = telco.groupby(['Tech Support','Churn Label'])['CustomerID'].count().reset_index()
telco_tech_no = telco_techsupport.loc[telco_techsupport['Churn Label'] == 'No',:]
telco_tech_yes = telco_techsupport.loc[telco_techsupport['Churn Label'] == 'Yes',:]
telco_tech_no
| Tech Support | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | No | No | 2027 |
| 2 | No internet service | No | 1413 |
| 4 | Yes | No | 1734 |
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'No','No internet service', 'Yes'
colors = ['violet','darkorange','royalblue']
# churn no
axes[0].pie(telco_tech_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('not churned')
# yes
axes[1].pie(telco_tech_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('churned')
plt.show()
(8) Contract¶
According to this dataset, there are 3 types of contracts: month to month, 1 year, and 2 year. The objective of this section is to find the effect of the type of contract on customer churn.
telco_contract = telco.groupby(['Contract','Churn Label'])['CustomerID'].count().reset_index()
telco_contract_no = telco_contract.loc[telco_contract['Churn Label'] == 'No',:]
telco_contract_yes = telco_contract.loc[telco_contract['Churn Label'] == 'Yes',:]
telco_contract_yes
| Contract | Churn Label | CustomerID | |
|---|---|---|---|
| 1 | Month-to-month | Yes | 1655 |
| 3 | One year | Yes | 166 |
| 5 | Two year | Yes | 48 |
According to the following pie graph, the customers with month-to-month contract account for over 88% among churned customers, while those account for about 42% among remained customers. This suggests that those who are signing up for the Telco service through month-to-month are likely to have finishing the contract early in mind while using the Telco service.
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'Month-to-Month','One year', 'Two year'
colors=['royalblue','tomato','forestgreen']
# churn no
axes[0].pie(telco_contract_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('No churn')
# yes
axes[1].pie(telco_contract_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('yes churn')
plt.show()
(9) Paperless Billing¶
This section shows whether the paperless bill affects customers' decision to finish their contract with Telco. According to the barplots below, the ratio of churned customers with paperless biliing are higher than the ratio of churned customers without paperless billing.
telco_paper = telco.groupby(['Paperless Billing', 'Churn Label'])['CustomerID'].count().reset_index()
telco_paper
| Paperless Billing | Churn Label | CustomerID | |
|---|---|---|---|
| 0 | No | No | 2403 |
| 1 | No | Yes | 469 |
| 2 | Yes | No | 2771 |
| 3 | Yes | Yes | 1400 |
g = sns.catplot(
telco_paper, kind="bar",
x="Churn Label", y="CustomerID", col="Paperless Billing",
height=4, aspect=1, palette = ["green", "lightcoral"])
g.set(ylabel = 'Counts')
plt.show()
(10) Payment Method¶
The dataset shows that there are 4 types of payment methods: automatic back transfer, automatic credit card, electronic check, and mailed check.
telco_payment = telco.groupby(['Payment Method','Churn Label'])['CustomerID'].count().reset_index()
telco_payment_no = telco_payment.loc[telco_payment['Churn Label'] == 'No',:]
telco_payment_yes = telco_payment.loc[telco_payment['Churn Label'] == 'Yes',:]
telco_payment_yes
| Payment Method | Churn Label | CustomerID | |
|---|---|---|---|
| 1 | Bank transfer (automatic) | Yes | 258 |
| 3 | Credit card (automatic) | Yes | 232 |
| 5 | Electronic check | Yes | 1071 |
| 7 | Mailed check | Yes | 308 |
Among those who remain, the payment method has no difference. However, the ratio of customers with electronic check stands out among churned customers.
counts_no = telco_payment_no['CustomerID'].to_list()
counts_no
[1286, 1290, 1294, 1304]
fig, axs = plt.subplots(1,2, figsize = (12,5))
methods = ['Bank', 'Credit card','Elec','Mail']
counts_no = telco_payment_no['CustomerID'].to_list()
counts_yes = telco_payment_yes['CustomerID'].to_list()
bar_labels = methods
bar_colors = ['tab:green', 'tab:blue', 'tab:red', 'tab:orange']
# no
axs[0].bar(methods, counts_no, label=bar_labels, color=bar_colors)
axs[0].set_ylabel('# of payment users')
axs[0].set_title('no churn')
axs[0].set_ylim((0,1800))
axs[0].legend()
# yes
axs[1].bar(methods, counts_yes, label=bar_labels, color=bar_colors)
axs[1].set_ylabel('# of payment users')
axs[1].set_title('yes churn')
axs[1].set_ylim((0,1800))
axs[1].legend()
plt.show()
(11) Montly Charges¶
The Monthly Charges column shows how much the customers pay every month. The following histogram shows that remaining customers are likely to pay less monthly charge, while churned customers are likely to have paid higher monthly charge.
telco_monthly = telco[['Monthly Charges','Churn Label']]
telco_monthly_no = telco_monthly.loc[telco_monthly['Churn Label'] == 'No',:]['Monthly Charges'].to_numpy()
telco_monthly_yes = telco_monthly.loc[telco_monthly['Churn Label'] == 'Yes',:]['Monthly Charges'].to_numpy()
telco_monthly_yes
array([ 53.85, 70.7 , 99.65, ..., 75.75, 102.95, 74.4 ])
plt.hist(telco_monthly_no, density = False,bins = 50, histtype='barstacked', rwidth=0.8, color = 'green',alpha = 0.5, label = 'No churn')
plt.hist(telco_monthly_yes, density = False,bins = 50, histtype='barstacked', rwidth=0.8, color = 'lightcoral',alpha = 0.5, label = 'Yes churn')
plt.title('Monthly Charges wrt Churn')
plt.xlabel('Monthly Charges')
plt.ylabel('counts')
plt.legend()
plt.show()
(12) Total Charges¶
While there was difference in churn rate regarding monthly charges, the following histogram suggests there is no significant difference in churn rate regarding total charge.
telco['Total Charges'] = X['Total Charges']
<bound method Series.info of 0 108.150002
1 151.649994
2 820.500000
3 3046.050049
4 5036.299805
...
7038 1419.400024
7039 1990.500000
7040 7362.899902
7041 346.450012
7042 6844.500000
Name: Total Charges, Length: 7043, dtype: float32>
telco_total = telco[['Total Charges','Churn Label']]
sns.histplot(data=telco_total, x='Total Charges', hue='Churn Label', element="bars",
stat="count", common_norm=False, palette={"Yes": "blue", "No": "violet"})
plt.title('Histogram of Total Charges')
plt.show()
(13) Temporary conclusion¶
- There are almost no impact on customer churn with respect to cities.
- There are almost no impact on customer churn with respect to gender.
- Those who are older are more likely to leave.
- People who have no parter or dependents are likely to churn.
- Customers with short length of contract are likely to leave.
- highly correlated : phone service, multiple lines, internet service, streaming services
- less correlated : online services, device protections, etc
- Customers with fiber optics accounts for majority of churn.
- Over 88% of those who left are Month-to-month; short period contract.
- Among those who churned, most of them used electronic payment.
- Among those who churned, customers with high monthly charge are more likely to churn.
To sum up, Senior, Partner, Dependents, Tenure Month, phone service, multiple lines, internet service, streaming services, Contract, Payment Methods, and Monthly Charges columns affects the churn rate to some degree. That is, these variables are the most important variables.
3. Other Methods¶
Until now, we analyzed which variables have the most significant impact on customer churn through qualitative measure, EDA. This section, however, relies on more quantitative measures to select important variables and compare the results with previous EDA.
(1) Data Preprocessing¶
The Ridge and Lasso both require data to be numerical, so I transformed categorical data into numerical ones as follows.
X = telco[['Latitude', 'Longitude', 'Gender', 'Senior Citizen','Partner', 'Dependents', 'Tenure Months', 'Phone Service',
'Multiple Lines', 'Internet Service', 'Online Security','Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method','Monthly Charges', 'Total Charges']]
X['Gender'] = X['Gender'].replace(to_replace = ['Male', 'Female'], value = [1,2])
X['Senior Citizen'] = X['Senior Citizen'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Partner'] = X['Partner'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Dependents'] = X['Dependents'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Phone Service'] = X['Phone Service'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Multiple Lines'] = X['Multiple Lines'].replace(to_replace = ['Yes','No','No phone service'], value = [1,0,0])
X['Internet Service'] = X['Internet Service'].replace(to_replace = ['DSL','Fiber optic','No'], value = [1,2,0])
X['Online Security'] = X['Online Security'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Online Backup'] = X['Online Backup'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Device Protection'] = X['Device Protection'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Tech Support'] = X['Tech Support'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Streaming TV'] = X['Streaming TV'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Streaming Movies'] = X['Streaming Movies'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Contract'] = X['Contract'].replace(to_replace = ['Month-to-month', 'One year','Two year'], value = [0,1,2])
X['Paperless Billing'] = X['Paperless Billing'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Payment Method'] = X['Payment Method'].replace(to_replace = X['Payment Method'].unique(), value = [1,2,3,4])
X.head(10)
<ipython-input-7-4f14e9a222bb>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Gender'] = X['Gender'].replace(to_replace = ['Male', 'Female'], value = [1,2]) <ipython-input-7-4f14e9a222bb>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Senior Citizen'] = X['Senior Citizen'].replace(to_replace = ['Yes','No'], value = [1,0]) <ipython-input-7-4f14e9a222bb>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Partner'] = X['Partner'].replace(to_replace = ['Yes','No'], value = [1,0]) <ipython-input-7-4f14e9a222bb>:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Dependents'] = X['Dependents'].replace(to_replace = ['Yes','No'], value = [1,0]) <ipython-input-7-4f14e9a222bb>:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Phone Service'] = X['Phone Service'].replace(to_replace = ['Yes','No'], value = [1,0]) <ipython-input-7-4f14e9a222bb>:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Multiple Lines'] = X['Multiple Lines'].replace(to_replace = ['Yes','No','No phone service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Internet Service'] = X['Internet Service'].replace(to_replace = ['DSL','Fiber optic','No'], value = [1,2,0]) <ipython-input-7-4f14e9a222bb>:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Online Security'] = X['Online Security'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Online Backup'] = X['Online Backup'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Device Protection'] = X['Device Protection'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Tech Support'] = X['Tech Support'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Streaming TV'] = X['Streaming TV'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Streaming Movies'] = X['Streaming Movies'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0]) <ipython-input-7-4f14e9a222bb>:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Contract'] = X['Contract'].replace(to_replace = ['Month-to-month', 'One year','Two year'], value = [0,1,2]) <ipython-input-7-4f14e9a222bb>:22: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Paperless Billing'] = X['Paperless Billing'].replace(to_replace = ['Yes','No'], value = [1,0]) <ipython-input-7-4f14e9a222bb>:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy X['Payment Method'] = X['Payment Method'].replace(to_replace = X['Payment Method'].unique(), value = [1,2,3,4])
| Latitude | Longitude | Gender | Senior Citizen | Partner | Dependents | Tenure Months | Phone Service | Multiple Lines | Internet Service | ... | Online Backup | Device Protection | Tech Support | Streaming TV | Streaming Movies | Contract | Paperless Billing | Payment Method | Monthly Charges | Total Charges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33.964131 | -118.272783 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 53.85 | 108.150002 |
| 1 | 34.059281 | -118.307420 | 2 | 0 | 0 | 1 | 2 | 1 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 70.70 | 151.649994 |
| 2 | 34.048013 | -118.293953 | 2 | 0 | 0 | 1 | 8 | 1 | 1 | 2 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | 99.65 | 820.500000 |
| 3 | 34.062125 | -118.315709 | 2 | 0 | 1 | 1 | 28 | 1 | 1 | 2 | ... | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 2 | 104.80 | 3046.050049 |
| 4 | 34.039224 | -118.266293 | 1 | 0 | 0 | 1 | 49 | 1 | 1 | 2 | ... | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 3 | 103.70 | 5036.299805 |
| 5 | 34.066367 | -118.309868 | 2 | 0 | 1 | 0 | 10 | 1 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 55.20 | 528.349976 |
| 6 | 34.023810 | -118.156582 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 2 | 39.65 | 39.650002 |
| 7 | 34.066303 | -118.435479 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 20.15 | 20.150000 |
| 8 | 34.099869 | -118.326843 | 1 | 0 | 1 | 1 | 47 | 1 | 1 | 2 | ... | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 2 | 99.35 | 4749.149902 |
| 9 | 34.089953 | -118.294824 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 30.20 | 30.200001 |
10 rows × 21 columns
It can be shown that all the data are changed into numerical data.
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Latitude 7043 non-null float64 1 Longitude 7043 non-null float64 2 Gender 7043 non-null int64 3 Senior Citizen 7043 non-null int64 4 Partner 7043 non-null int64 5 Dependents 7043 non-null int64 6 Tenure Months 7043 non-null int64 7 Phone Service 7043 non-null int64 8 Multiple Lines 7043 non-null int64 9 Internet Service 7043 non-null int64 10 Online Security 7043 non-null int64 11 Online Backup 7043 non-null int64 12 Device Protection 7043 non-null int64 13 Tech Support 7043 non-null int64 14 Streaming TV 7043 non-null int64 15 Streaming Movies 7043 non-null int64 16 Contract 7043 non-null int64 17 Paperless Billing 7043 non-null int64 18 Payment Method 7043 non-null int64 19 Monthly Charges 7043 non-null float64 20 Total Charges 7043 non-null float32 dtypes: float32(1), float64(3), int64(17) memory usage: 1.1 MB
We use StandardScaler() in "sklearn" to standardize all the data. After spliting churn value and the rest of the data, we standardize the data.
scaler = StandardScaler()
y = telco['Churn Value']
X_scaled = scaler.fit_transform(X)
(2) LASSO¶
Using Lasso, we can figure out how each coefficient of logistic regression drops to zero. The variables that survive longer as the penalty increases are the most important variables. The result of Lasso shows that Internet service, Tenure Month, Dependents, Online Security, Contracts, Paperless Billing, Streaming Movies are the most important variables.
alphas = np.logspace(-2, 4, 200)
coefs = []
for alpha in alphas:
model = LogisticRegression(penalty='l1', C=1/alpha, solver='liblinear') # l1 : LASSO
model.fit(X_scaled, y)
coefs.append(model.coef_[0])
coefs = np.array(coefs)
plt.figure(figsize=(24, 8))
plt.subplot(1, 2, 1)
for i, variable_name in enumerate(X.columns[:11]):
plt.plot(alphas, coefs[:, i], label=variable_name)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with LASSO Penalty (from Latitude to Online Security)')
plt.legend(loc = 'lower right')
plt.subplot(1, 2, 2)
for i, variable_name in enumerate(X.columns[11:]):
plt.plot(alphas, coefs[:, i+11], label=variable_name)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with LASSO Penalty (from Online Backup to Total Charges)')
plt.legend(loc = 'lower right')
plt.show()
(3) Ridge¶
Using Ridge, we can figure out how each coefficient of logistic regression converges to zero. The variables that slowly decrease as the penalty increases are the most important variables. The result of Ridge shows that Internet Service,Tenure Months, Dependents, Online Security, Total Charges, Monthly Charges, Contract, Tech Support are the most important variables.
alphas = np.logspace(-2, 6, 200)
coefs = []
for alpha in alphas:
model = LogisticRegression(penalty='l2', C=1/alpha, solver='liblinear') # l2 : Ridge
model.fit(X_scaled, y)
coefs.append(model.coef_[0])
coefs = np.array(coefs)
plt.figure(figsize=(24, 8))
plt.subplot(1, 2, 1)
for i, variable_name in enumerate(X.columns[:11]):
plt.plot(alphas, coefs[:, i], label=variable_name)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with Ridge Penalty (from Latitude to Online Security)')
plt.legend()
plt.subplot(1, 2, 2)
for i, variable_name in enumerate(X.columns[11:]):
plt.plot(alphas, coefs[:, i+11], label=variable_name)
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with Ridge Penalty (from Online Backup to Total Charges)')
plt.legend() # loc = 'lower right'
plt.show()
(4) Conclusion¶
We made a decision to choose the following 8 variables as the most influential variables: Dependents, Contracts , Tenure Months , Paperless Billing ,Internet Service ,Tech Support , Monthly Charges , Total Charges.
4. Analysis Method
Since the variable of interest—whether a customer churns—is binary, logistic regression was used for the analysis. It was assumed that the dependent variable, customer churn status, follows an independent Bernoulli distribution with the churn probability \(μ_i\) as its parameter. Using the logit link function, the relationship between each customer’s churn probability and the explanatory variables was modeled.
\[\begin{aligned} y_i & \overset{iid}{\sim} \text{Bernoulli}(\mu_i), \\ \mu_i&= \frac{\exp\{x_i^{\top}\beta\}}{1 + \exp\{x_i^{\top}\beta\}}. \end{aligned}\]
For parameter estimation, the Bayesian approach was employed, treating each parameter as a random variable and estimating its distribution. A non-informative prior distribution was chosen for this purpose. This decision was made because there is no definite prior information, such as the parameters following a normal distribution or being strictly positive. Therefore, it was deemed natural to let the data alone determine the parameter values (let the data speak for themselves).
\[p(\beta) \propto 1.\]
The original data was split into a training set and a test set. The training set was used for model fitting, while the model’s performance was evaluated using the test set. For estimating the parameters to infer the churn probability in the Bayesian model, the posterior mean was used. A customer was predicted to have churned if the estimated churn probability from each model exceeded 0.5.
Additionally, to examine the extent of any differences from the results of the frequentist logistic regression model, the predicted customer churn statuses from both the Bayesian model and the frequentist model were compared.
5. Bayesian Data Analysis
From this section, we’ll implement Bayesian data analysis in earnest by taking advantage of Rstan.
(1) Data Preprocess in R
Since Rstan will be used in analysis, we import data again in R.
rm(list = ls())
library(readxl)
library(dplyr)
## Warning: 패키지 'dplyr'는 R 버전 4.3.3에서 작성되었습니다
##
## 다음의 패키지를 부착합니다: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstan)
## 필요한 패키지를 로딩중입니다: StanHeaders
##
## rstan version 2.32.3 (Stan version 2.26.1)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
telco <- read_excel("Telco_customer_churn_data.xlsx")
The ‘Total Charges’ columns has some missing values, which was previously mentioned in EDA section. So, we replace them with Monthly Charges data.
#Find missing values and replace them
colSums(is.na(telco))
## CustomerID Count Country State
## 0 0 0 0
## City Zip Code Lat Long Latitude
## 0 0 0 0
## Longitude Gender Senior Citizen Partner
## 0 0 0 0
## Dependents Tenure Months Phone Service Multiple Lines
## 0 0 0 0
## Internet Service Online Security Online Backup Device Protection
## 0 0 0 0
## Tech Support Streaming TV Streaming Movies Contract
## 0 0 0 0
## Paperless Billing Payment Method Monthly Charges Total Charges
## 0 0 0 11
## Churn Label Churn Value Churn Score CLTV
## 0 0 0 0
## Churn Reason
## 5174
telco %>%
mutate(`Total Charges` = ifelse(is.na(`Total Charges`), `Monthly Charges`,
`Total Charges`)) -> telco
colSums(is.na(telco))
## CustomerID Count Country State
## 0 0 0 0
## City Zip Code Lat Long Latitude
## 0 0 0 0
## Longitude Gender Senior Citizen Partner
## 0 0 0 0
## Dependents Tenure Months Phone Service Multiple Lines
## 0 0 0 0
## Internet Service Online Security Online Backup Device Protection
## 0 0 0 0
## Tech Support Streaming TV Streaming Movies Contract
## 0 0 0 0
## Paperless Billing Payment Method Monthly Charges Total Charges
## 0 0 0 0
## Churn Label Churn Value Churn Score CLTV
## 0 0 0 0
## Churn Reason
## 5174
After leveraging on EDA as well as
LASSO and Ridge, we determined to
select 8 columns: Churn Value, Dependents,
Contract, Tenure Months,
Paperless Billing, Internet Service,
Tech Support, Monthly Charges, and
Total Charges. Also, we convert the categorical data into
numerical ones.
# Select variable to use
telco %>%
select(`Churn Value`, Dependents, Contract, `Tenure Months`,
`Paperless Billing`, `Internet Service`, `Tech Support`,
`Monthly Charges`, `Total Charges`) %>%
mutate(Dependents = ifelse(Dependents == "Yes", 1, 0),
Contract = ifelse(Contract == "Month-to-month", 1, 0),
`Paperless Billing` = ifelse(`Paperless Billing` == "Yes", 1, 0),
`Internet Service` = ifelse(`Internet Service` == "Fiber optic", 1, 0),
`Tech Support` = ifelse(`Tech Support` == "Yes", 1,0)) -> telco
head(telco)
## # A tibble: 6 × 9
## `Churn Value` Dependents Contract `Tenure Months` `Paperless Billing`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 2 1
## 2 1 1 1 2 1
## 3 1 1 1 8 1
## 4 1 1 1 28 1
## 5 1 1 1 49 1
## 6 1 0 1 10 0
## # ℹ 4 more variables: `Internet Service` <dbl>, `Tech Support` <dbl>,
## # `Monthly Charges` <dbl>, `Total Charges` <dbl>
The values in the columns “Tenure Months”, “Monthly Charges”, “Total
Charges” are generally high, so we normalize them using
scale function.
to_scale <- c( "Tenure Months", "Monthly Charges", "Total Charges")
telco[,to_scale] <- scale(telco[,to_scale])
Now, we split the entire data into train set and test set. While running Rstan, we use the train data set and test the performance of Bayesian logistic regression using test data set.
y <- telco$`Churn Value`
X <- telco[,!(colnames(telco) %in% c('Churn Value', 'group'))]
set.seed(7043)
row_indices <- sample(nrow(X))
train_size <- round(0.7 * nrow(X))
X_train <- X[row_indices[1:train_size], ]
y_train <- y[row_indices[1:train_size]]
X_test <- X[row_indices[(train_size + 1):nrow(X)], ]
y_test <- y[row_indices[(train_size + 1):nrow(X)]]
(2) Model Fitting
Next code shows a model code telco.glm that is used to
fit Bayesian logistic regression model. Since we determined to set a
non-informative prior, we don’t have any code associated with
alpha and beta in model
section.
telco.glm = "
data {
int<lower=0> N; // number of subjects
int<lower=0> K; // number of predictors
matrix[N, K] X; // predictor matrix
int<lower=0,upper=1> y[N]; // binary outcome
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
}
model {
y ~ bernoulli_logit(alpha + X * beta);
}
"
We fit the model as follows.
sm = stan_model(model_code=telco.glm)
fit = sampling(
object = sm,
data = list(y=y_train, N=nrow(X_train), K=ncol(X_train), X=X_train), # observed data
chains = 4, # number of multiple chains
warmup = 1000, # number of burn-in iterations
iter = 4000, # number of iterations per chain
cores = 4 # number of cores to use (e.g., 1 per chain)
)
The fitting result is as follows. The posterior mean of each
coefficient is shown in the mean column. Each row apart
from alpha represents Dependents,
Contract, Tenure Months,
Paperless Billing, Internet Service,
Tech Support, Monthly Charges,
Total Charges respectively.
print(fit)
## Inference for Stan model: anon_model.
## 4 chains, each with iter=4000; warmup=1000; thin=1;
## post-warmup draws per chain=3000, total post-warmup draws=12000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## alpha -2.04 0.00 0.13 -2.30 -2.13 -2.04 -1.95 -1.78
## beta[1] -1.58 0.00 0.14 -1.85 -1.67 -1.58 -1.48 -1.31
## beta[2] 0.91 0.00 0.12 0.69 0.83 0.91 0.99 1.14
## beta[3] -1.15 0.00 0.16 -1.46 -1.26 -1.15 -1.04 -0.85
## beta[4] 0.49 0.00 0.09 0.32 0.43 0.49 0.55 0.66
## beta[5] 0.30 0.00 0.15 0.00 0.20 0.30 0.40 0.59
## beta[6] -0.56 0.00 0.11 -0.77 -0.63 -0.56 -0.48 -0.35
## beta[7] 0.51 0.00 0.10 0.32 0.45 0.51 0.58 0.71
## beta[8] 0.34 0.00 0.17 0.02 0.23 0.34 0.46 0.67
## lp__ -2081.10 0.03 2.13 -2086.09 -2082.29 -2080.79 -2079.56 -2077.94
## n_eff Rhat
## alpha 7287 1
## beta[1] 12090 1
## beta[2] 10456 1
## beta[3] 8451 1
## beta[4] 11853 1
## beta[5] 8413 1
## beta[6] 10161 1
## beta[7] 7502 1
## beta[8] 8135 1
## lp__ 5326 1
##
## Samples were drawn using NUTS(diag_e) at Sat Mar 22 14:15:59 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The effect of each variable on the churn probability depends on the sign of its posterior mean. Since the posterior means of Dependents, Tenure Months, and Tech Support have negative values, it can be interpreted that customers with dependents, longer subscription periods, and those subscribed to the technical support service are less likely to churn. In particular, having dependents significantly lowers the churn probability.
On the other hand, the variables Contract, Paperless Billing, Internet Service, Monthly Charges, and Total Charges have positive posterior mean values. This indicates that customers who have month-to-month contracts, receive paperless billing, use fiber optic internet service, and have higher monthly and total charges are more likely to churn. Especially, customers on month-to-month contracts show a notably higher probability of churn.
The effect size of each variable can be assessed by taking the exponential function of its posterior mean. For example, in the case of the Contract variable, the odds ratio of churn between customers with month-to-month contracts and those without is exp(0.91) = 2.484. For continuous variables, interpretation should consider standardization. For instance, regarding Tenure Months, since the standard deviation in the training set is 24.559, the odds ratio of churn for an increase of approximately 24.5 months in tenure is exp(-1.15) = 0.3166.
plot(fit,pars=c("beta"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
The above code shows the posterior means (black) and their credible interval (red) respectively. Each black dot represents the estimated effect (on the log-odds scale) of the corresponding predictor. This is typically the posterior mean or median of that coefficient. The horizontal red lines show the uncertainty around each estimate, for example, a 95% credible interval. If the interval does not cross zero, it suggests that the model strongly supports a positive or negative effect (depending on the side of zero it lies on).
Overall, these results tell you which predictors have a credible (statistically meaningful) influence on churn and in which direction. Since all of the credible intervals do not contain zero, this shows all the 8 variables have impact on customer churn.
The following code shows how each coefficient converges. This photo shows that all of the coefficients converges well.
traceplot(fit,pars=c("beta"))
(3) Test & Result
Now, we’re going to check the performace of the result using test dataset. The code randomly extracts one draw per test observation. It then computes the linear predictor by multiplying this set of parameters with the corresponding test observation (after adding an intercept column), transforms it to a probability using the logistic function, and finally classifies based on a 0.5 threshold.
set.seed(2113)
param = extract(fit) # output
n_test <- nrow(X_test)
params_mt <- cbind(param$alpha, param$beta)
idx <- sample(1:nrow(params_mt), n_test)
params_test <- params_mt[idx,]
Xbeta_test <- params_test*cbind(1,X_test)
The result is shown to be about 80.92%.
pred_prob_ba <- plogis(rowSums(Xbeta_test))
y_pred_ba <- ifelse(pred_prob_ba >= 0.5, 1, 0)
acc_ba <- mean(y_pred_ba == y_test)
cat("Accuracy of Bayesian Logistic Regression :", acc_ba, "\n")
## Accuracy of Bayesian Logistic Regression : 0.8083294
6. Comparison with Frequentist Method
It was observed that the posterior means from the model fitted using the Bayesian method, show no significant difference from the coefficient estimates. Comparing the churn prediction accuracy of each model using the test set revealed that both models achieved almost the same accuracy of 80.9%. Therefore, it was confirmed that the Bayesian method produces results equivalent to the frequentist one while benefiting from the advantages of the Bayesian approach.
freq.logistic <- glm(y_train ~ ., data = X_train, family = binomial)
summary(freq.logistic)
##
## Call:
## glm(formula = y_train ~ ., family = binomial, data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.03303 0.13175 -15.430 < 2e-16 ***
## Dependents -1.56869 0.13489 -11.630 < 2e-16 ***
## Contract 0.91006 0.11744 7.749 9.24e-15 ***
## `Tenure Months` -1.14428 0.15857 -7.216 5.35e-13 ***
## `Paperless Billing` 0.48643 0.08765 5.550 2.86e-08 ***
## `Internet Service` 0.29750 0.15170 1.961 0.0499 *
## `Tech Support` -0.55338 0.10727 -5.159 2.49e-07 ***
## `Monthly Charges` 0.51314 0.10035 5.113 3.17e-07 ***
## `Total Charges` 0.33899 0.16618 2.040 0.0414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5736.8 on 4929 degrees of freedom
## Residual deviance: 4153.2 on 4921 degrees of freedom
## AIC: 4171.2
##
## Number of Fisher Scoring iterations: 6
freq.logistic$coefficients
## (Intercept) Dependents Contract `Tenure Months`
## -2.0330256 -1.5686940 0.9100605 -1.1442809
## `Paperless Billing` `Internet Service` `Tech Support` `Monthly Charges`
## 0.4864317 0.2975031 -0.5533824 0.5131411
## `Total Charges`
## 0.3389851
pred_prob_f <- predict(freq.logistic, newdata = X_test, type = "response")
y_pred_f <- ifelse(pred_prob_f >= 0.5, 1, 0)
acc_f <- mean(y_pred_f == y_test)
cat("Accuracy of Frequentist Logistic Regression :", acc_f, "\n")
## Accuracy of Frequentist Logistic Regression : 0.8088027