Bayesian Statistics Project 1¶

1. Objective¶

The main purpose of this data analysis is to suggest a new model that forecasts whether each customer will finish contract using Bayesian paradigm. The dataset is from this Kaggle competition Telco customer churn. The data is from a fictional telecommunications company that provided home phone and internet services to customers in California.

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

2. Data¶

The data consists of 33 columns, including where customers live, their gender, monthly charges and so on.

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

telco = pd.read_excel('/content/drive/MyDrive/Bayesian/Telco_customer_churn_data.xlsx')
telco.head()

Out[ ]:

	CustomerID	Count	Country	State	City	Zip Code	Lat Long	Latitude	Longitude	Gender	...	Contract	Paperless Billing	Payment Method	Monthly Charges	Total Charges	Churn Label	Churn Value	Churn Score	CLTV	Churn Reason
0	3668-QPYBK	1	United States	California	Los Angeles	90003	33.964131, -118.272783	33.964131	-118.272783	Male	...	Month-to-month	Yes	Mailed check	53.85	108.15	Yes	1	86	3239	Competitor made better offer
1	9237-HQITU	1	United States	California	Los Angeles	90005	34.059281, -118.30742	34.059281	-118.307420	Female	...	Month-to-month	Yes	Electronic check	70.70	151.65	Yes	1	67	2701	Moved
2	9305-CDSKC	1	United States	California	Los Angeles	90006	34.048013, -118.293953	34.048013	-118.293953	Female	...	Month-to-month	Yes	Electronic check	99.65	820.5	Yes	1	86	5372	Moved
3	7892-POOKP	1	United States	California	Los Angeles	90010	34.062125, -118.315709	34.062125	-118.315709	Female	...	Month-to-month	Yes	Electronic check	104.80	3046.05	Yes	1	84	5003	Moved
4	0280-XJGEX	1	United States	California	Los Angeles	90015	34.039224, -118.266293	34.039224	-118.266293	Male	...	Month-to-month	Yes	Bank transfer (automatic)	103.70	5036.3	Yes	1	89	5340	Competitor had better devices

5 rows × 33 columns

Most of the columns are categorical variables.

In [ ]:

telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 non-null   object 
 17  Online Security    7043 non-null   object 
 18  Online Backup      7043 non-null   object 
 19  Device Protection  7043 non-null   object 
 20  Tech Support       7043 non-null   object 
 21  Streaming TV       7043 non-null   object 
 22  Streaming Movies   7043 non-null   object 
 23  Contract           7043 non-null   object 
 24  Paperless Billing  7043 non-null   object 
 25  Payment Method     7043 non-null   object 
 26  Monthly Charges    7043 non-null   float64
 27  Total Charges      7043 non-null   object 
 28  Churn Label        7043 non-null   object 
 29  Churn Value        7043 non-null   int64  
 30  Churn Score        7043 non-null   int64  
 31  CLTV               7043 non-null   int64  
 32  Churn Reason       1869 non-null   object 
dtypes: float64(3), int64(6), object(24)
memory usage: 1.8+ MB

Key Variables

Churn Value: Customer churn status (1: Churned, 0: Retained)
Churn Reason: Reason for customer churn
Customer Information Variables: Gender, Senior Citizen, Dependents, etc.
Contract & Payment Variables: Contract, Payment Method, Monthly Charges, etc.
Service Subscription Variables: Internet Service, Online Security, Device Protection, etc.

Although the above code shows there is no null data, there are some empty space in the column 'Total Charges'. The common point is that all of the corresponding customers have zero tenure month, which means they are novice customers of the Telco service. Fortunately, they have monthly charge data, so we replace the empty space with monthly charge values.

In [ ]:

telco.loc[telco['Tenure Months'] == 0, 'Total Charges'] = telco['Monthly Charges']
telco[telco['Tenure Months'] == 0]

Out[ ]:

	CustomerID	Count	Country	State	City	Zip Code	Lat Long	Latitude	Longitude	Gender	...	Contract	Paperless Billing	Payment Method	Monthly Charges	Total Charges	Churn Label	Churn Score	CLTV	Churn Reason
2234	4472-LVYGI	1	United States	California	San Bernardino	92408	34.084909, -117.258107	34.084909	-117.258107	Female	...	Two year	Yes	Bank transfer (automatic)	52.55	52.55	No	36	2578	NaN
2438	3115-CZMZD	1	United States	California	Independence	93526	36.869584, -118.189241	36.869584	-118.189241	Male	...	Two year	No	Mailed check	20.25	20.25	No	68	5504	NaN
2568	5709-LVOEQ	1	United States	California	San Mateo	94401	37.590421, -122.306467	37.590421	-122.306467	Female	...	Two year	No	Mailed check	80.85	80.85	No	45	2048	NaN
2667	4367-NUYAO	1	United States	California	Cupertino	95014	37.306612, -122.080621	37.306612	-122.080621	Male	...	Two year	No	Mailed check	25.75	25.75	No	48	4950	NaN
2856	1371-DWPAZ	1	United States	California	Redcrest	95569	40.363446, -123.835041	40.363446	-123.835041	Female	...	Two year	No	Credit card (automatic)	56.05	56.05	No	30	4740	NaN
4331	7644-OMVMY	1	United States	California	Los Angeles	90029	34.089953, -118.294824	34.089953	-118.294824	Male	...	Two year	No	Mailed check	19.85	19.85	No	53	2019	NaN
4687	3213-VVOLG	1	United States	California	Sun City	92585	33.739412, -117.173334	33.739412	-117.173334	Male	...	Two year	No	Mailed check	25.35	25.35	No	49	2299	NaN
5104	2520-SGTTA	1	United States	California	Ben Lomond	95005	37.078873, -122.090386	37.078873	-122.090386	Female	...	Two year	No	Mailed check	20.00	20.0	No	27	3763	NaN
5719	2923-ARZLG	1	United States	California	La Verne	91750	34.144703, -117.770299	34.144703	-117.770299	Male	...	One year	Yes	Mailed check	19.70	19.7	No	69	4890	NaN
6772	4075-WKNIU	1	United States	California	Bell	90201	33.970343, -118.171368	33.970343	-118.171368	Female	...	Two year	No	Mailed check	73.35	73.35	No	44	2342	NaN
6840	2775-SEFEE	1	United States	California	Wilmington	90744	33.782068, -118.262263	33.782068	-118.262263	Male	...	Two year	Yes	Bank transfer (automatic)	61.90	61.9	No	65	5188	NaN

11 rows × 33 columns

It turns out that the total charge column is categorical one, which is not intuitive. So, I changed the type of the data into numerical one.

In [ ]:

telco['Total Charges'] = pd.to_numeric(telco['Total Charges'],errors='coerce', downcast='float')
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         7043 non-null   object 
 1   Count              7043 non-null   int64  
 2   Country            7043 non-null   object 
 3   State              7043 non-null   object 
 4   City               7043 non-null   object 
 5   Zip Code           7043 non-null   int64  
 6   Lat Long           7043 non-null   object 
 7   Latitude           7043 non-null   float64
 8   Longitude          7043 non-null   float64
 9   Gender             7043 non-null   object 
 10  Senior Citizen     7043 non-null   object 
 11  Partner            7043 non-null   object 
 12  Dependents         7043 non-null   object 
 13  Tenure Months      7043 non-null   int64  
 14  Phone Service      7043 non-null   object 
 15  Multiple Lines     7043 non-null   object 
 16  Internet Service   7043 non-null   object 
 17  Online Security    7043 non-null   object 
 18  Online Backup      7043 non-null   object 
 19  Device Protection  7043 non-null   object 
 20  Tech Support       7043 non-null   object 
 21  Streaming TV       7043 non-null   object 
 22  Streaming Movies   7043 non-null   object 
 23  Contract           7043 non-null   object 
 24  Paperless Billing  7043 non-null   object 
 25  Payment Method     7043 non-null   object 
 26  Monthly Charges    7043 non-null   float64
 27  Total Charges      7043 non-null   float32
 28  Churn Label        7043 non-null   object 
 29  Churn Value        7043 non-null   int64  
 30  Churn Score        7043 non-null   int64  
 31  CLTV               7043 non-null   int64  
 32  Churn Reason       1869 non-null   object 
dtypes: float32(1), float64(3), int64(6), object(23)
memory usage: 1.7+ MB

3. EDA¶

(1) City¶

This section shows how the churn value differs depending on each city. There are more than 1000 cities in this data.

In [ ]:

telco_by_city = telco.groupby('City')['Churn Value'].agg(['mean', 'count']).reset_index()
telco_by_city.columns = ['City', 'Average Churn Value', 'Number of Customers']

telco_by_city = telco_by_city.pivot_table(index=None, columns='City', values=['Average Churn Value', 'Number of Customers'])
telco_by_city

Out[ ]:

City	Acampo	Acton	Adelanto	Adin	Agoura Hills	Aguanga	Ahwahnee	Alameda	Alamo	Albany	...	Yermo	Yorba Linda	Yorkville	Yosemite National Park	Yountville	Yreka	Yuba City	Yucaipa	Yucca Valley	Zenia
Average Churn Value	0.75	0.0	0.2	0.5	0.4	0.25	0.25	0.0	0.0	0.5	...	0.25	0.375	0.25	0.5	0.0	0.0	0.25	0.25	0.0	0.25
Number of Customers	4.00	4.0	5.0	4.0	5.0	4.00	4.00	8.0	4.0	4.0	...	4.00	8.000	4.00	4.0	4.0	4.0	8.00	4.00	5.0	4.00

2 rows × 1129 columns

If we filter the data so that only the cities with over 20 customers are left, we can see that there are only 27 cities.

In [ ]:

telco_by_city.loc[:,telco_by_city.loc['Number of Customers'] > 20]

Out[ ]:

City	Anaheim	Bakersfield	Berkeley	Burbank	Chula Vista	Fresno	Glendale	Inglewood	Irvine	Long Beach	...	San Diego	San Francisco	San Jose	Santa Ana	Santa Barbara	Santa Monica	Santa Rosa	Stockton	Torrance	Whittier
Average Churn Value	0.25	0.075	0.28125	0.24	0.2	0.25	0.325	0.2	0.178571	0.25	...	0.333333	0.298077	0.258929	0.25	0.357143	0.2	0.458333	0.272727	0.32	0.166667
Number of Customers	28.00	40.000	32.00000	25.00	25.0	64.00	40.000	25.0	28.000000	60.00	...	150.000000	104.000000	112.000000	24.00	28.000000	25.0	24.000000	44.000000	25.00	30.000000

2 rows × 27 columns

We also filter the original data to see how many cities have a churn value over 0.7.

In [ ]:

telco_by_city_high_churn = telco_by_city.loc[:, telco_by_city.loc['Average Churn Value'] > 0.7]
telco_by_city_high_churn

Out[ ]:

City	Acampo	Alpaugh	Amador City	Avenal	Biola	Bodfish	Boulder Creek	Bridgeville	Byron	Comptche	...	Smith River	South Dos Palos	South Lake Tahoe	Summerland	Templeton	Tipton	Truckee	Twain	Wheatland	Wrightwood
Average Churn Value	0.75	0.75	0.75	0.75	0.75	0.75	1.0	0.75	0.75	0.75	...	1.0	0.75	1.0	0.75	0.75	1.0	1.0	1.0	0.75	1.0
Number of Customers	4.00	4.00	4.00	4.00	4.00	4.00	4.0	4.00	4.00	4.00	...	4.0	4.00	4.0	4.00	4.00	4.0	4.0	4.0	4.00	4.0

2 rows × 56 columns

The cities whose churn rate is relatively high has fewer number of customers.

In [ ]:

print(telco_by_city_high_churn.iloc[1, :].max(), telco_by_city_high_churn.iloc[1, :].min())

5.0 4.0

(2) Gender¶

The number of male is similar to that of female.

In [ ]:

telco_gender = telco.groupby('Gender')['CustomerID'].count().reset_index()
telco_gender

Out[ ]:

	Gender	CustomerID
0	Female	3488
1	Male	3555

In [ ]:

fig, ax = plt.subplots()

telco_gender = telco.groupby('Gender')['CustomerID'].count().reset_index()
labels = 'Female','Male'
ax.pie(telco_gender['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=['pink','royalblue'])
plt.show()

No description has been provided for this image

The following figure shows the the churn value has almost no difference in different genders.

In [ ]:

by_gender = telco.groupby('Churn Label')['Gender'].value_counts().to_frame().rename(columns={'Gender': 'Freq'}).reset_index().sort_values('Churn Label')
group_size=telco['Churn Label'].value_counts()
group_names=telco['Churn Label'].value_counts().index
subgroup_size=by_gender['Freq']
subgroup_names=by_gender['Gender']

a, b =[plt.cm.Blues, plt.cm.Reds]

fig, ax = plt.subplots()
fig.suptitle('Gender')
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors=[a(0.6), b(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')

mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3, labels=subgroup_names, labeldistance=0.7, colors=[a(0.5), a(0.4), b(0.5), b(0.4)])
plt.setp( mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)

plt.show()

(3) Senior¶

Now, we want to check whether the churn value is affected by the age of the customers. For relatively young customers, the ratio of customers who had churned is almost 1/3 of those who had not. However, the ratio of elderly customers who had churned is about 2/3 of those who had not.

In [ ]:

telco_senior = telco.groupby(['Senior Citizen','Churn Label'])['CustomerID'].count().reset_index()
telco_senior

Out[ ]:

	Senior Citizen	Churn Label	CustomerID
0	No	No	4508
1	No	Yes	1393
2	Yes	No	666
3	Yes	Yes	476

In [ ]:

g = sns.catplot(
    telco_senior, kind="bar",
    x="Churn Label", y="CustomerID", col="Senior Citizen",
    height=4, aspect=1, palette = ['limegreen','darkviolet']
)
g.set(ylabel='Counts')

plt.show()

(4) Partner¶

Similarly, the difference in churn value also varies depending on the presence of partner. The churn value of those without no partner is about 33%, while that of customers who have partner is about 19.7%.

In [ ]:

telco_partner = telco.groupby(['Partner','Churn Label'])['CustomerID'].count().reset_index()
telco_partner

Out[ ]:

	Partner	Churn Label	CustomerID
0	No	No	2441
1	No	Yes	1200
2	Yes	No	2733
3	Yes	Yes	669

In [ ]:

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'No','Yes'

# churn no
axes[0].pie(telco_partner_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral','deepskyblue'])
axes[0].set_title('No Partner')

# yes
axes[1].pie(telco_partner_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=['lightcoral','deepskyblue'])
axes[1].set_title('With Partner')

plt.show()

(5) Dependents ( = Children)¶

Similarly, the difference in churn value also varies depending on the presence of children. The churn value of those without no children is about 50%, while that of customers who have partner is less than 10%.

In [ ]:

telco_dep = telco.groupby(['Dependents','Churn Label'])['CustomerID'].count().reset_index()
telco_dep

Out[ ]:

	Dependents	Churn Label	CustomerID
0	No	No	3653
1	No	Yes	1763
2	Yes	No	1521
3	Yes	Yes	106

In [ ]:

g = sns.catplot(
    telco_dep, kind="bar",
    x="Churn Label", y="CustomerID", col="Dependents",
    height=4, aspect=1,
)
g.set(ylabel='Counts')

Out[ ]:

<seaborn.axisgrid.FacetGrid at 0x7d3726839b10>

(6) Tenure Month¶

This section compares the difference in churn value depending on the tenure (the period that each customer has signed up for the service). The following histogram shows that the churned customers are centered on low tenure, while the shape of the histogram of those who remained are U-shaped.

In [ ]:

telco_tenure = telco[['Tenure Months','Churn Label']]

telco_tenure_no = telco_tenure.loc[telco_tenure['Churn Label'] == 'No',:]['Tenure Months'].to_numpy()
telco_tenure_yes = telco_tenure.loc[telco_tenure['Churn Label'] == 'Yes',:]['Tenure Months'].to_numpy()

In [ ]:

plt.hist(telco_tenure_no, density = False, histtype='barstacked', rwidth=0.8, color = 'green',alpha = 0.5, label = 'No churn')
plt.hist(telco_tenure_yes, density = False, histtype='barstacked', rwidth=0.8, color = 'lightcoral',alpha = 0.5, label = 'Yes churn')
plt.title('The Length of Tenure in Accordance to Churn')
plt.legend()

plt.show()

(7) Services¶

1) Correlation btw Churn Values and All Services¶

There are several columns associated with services.

In [ ]:

telco_services = telco[['Churn Value','Phone Service','Multiple Lines','Internet Service','Online Security',
                        'Online Backup','Device Protection','Tech Support','Streaming TV',
                        'Streaming Movies']]
telco_services.head()

Out[ ]:

	Churn Value	Phone Service	Multiple Lines	Internet Service	Online Security	Online Backup	Device Protection	Tech Support	Streaming TV	Streaming Movies
0	1	Yes	No	DSL	Yes	Yes	No	No	No	No
1	1	Yes	No	Fiber optic	No	No	No	No	No	No
2	1	Yes	Yes	Fiber optic	No	No	Yes	No	Yes	Yes
3	1	Yes	Yes	Fiber optic	No	No	Yes	Yes	Yes	Yes
4	1	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes

To obtain correlation, we transform above categorical values into binary ones. Some data contain values such as 'No internet service', 'No phone service', which are switched into 0. Also, 'DSL', 'Fiber optic' are some types of internet service, so they are switched to 1 as well.

In [ ]:

telco_services = telco_services.replace(to_replace = 'Yes', value = 1)
telco_services = telco_services.replace(to_replace = ['No', 'No internet service','No phone service'], value = 0)
telco_services = telco_services.replace(to_replace = ['DSL', 'Fiber optic'], value = 1)
telco_services

Out[ ]:

	Churn Value	Phone Service	Multiple Lines	Internet Service	Online Security	Online Backup	Device Protection	Tech Support	Streaming TV	Streaming Movies
0	1	1	0	1	1	1	0	0	0	0
1	1	1	0	1	0	0	0	0	0	0
2	1	1	1	1	0	0	1	0	1	1
3	1	1	1	1	0	0	1	1	1	1
4	1	1	1	1	0	1	1	0	1	1
...	...	...	...	...	...	...	...	...	...	...
7038	0	1	0	0	0	0	0	0	0	0
7039	0	1	1	1	1	0	1	1	1	1
7040	0	1	1	1	0	1	1	0	1	1
7041	0	0	0	1	1	0	0	0	0	0
7042	0	1	0	1	1	0	1	1	1	1

7043 rows × 10 columns

The resulting correlation matrix is as follows.

In [ ]:

telco_services_corr = telco_services.corr()
telco_services_corr

Out[ ]:

	Churn Value	Phone Service	Multiple Lines	Internet Service	Online Security	Online Backup	Device Protection	Tech Support	Streaming TV	Streaming Movies
Churn Value	1.000000	0.011942	0.040102	0.227890	-0.171226	-0.082255	-0.066160	-0.164674	0.063228	0.061382
Phone Service	0.011942	1.000000	0.279690	-0.172209	-0.092893	-0.052312	-0.071227	-0.096340	-0.022574	-0.032959
Multiple Lines	0.040102	0.279690	1.000000	0.210564	0.098108	0.202237	0.201137	0.100571	0.257152	0.258751
Internet Service	0.227890	-0.172209	0.210564	1.000000	0.333403	0.381593	0.380754	0.336298	0.415552	0.418675
Online Security	-0.171226	-0.092893	0.098108	0.333403	1.000000	0.283832	0.275438	0.354931	0.176207	0.187398
Online Backup	-0.082255	-0.052312	0.202237	0.381593	0.283832	1.000000	0.303546	0.294233	0.282106	0.274501
Device Protection	-0.066160	-0.071227	0.201137	0.380754	0.275438	0.303546	1.000000	0.333313	0.390874	0.402111
Tech Support	-0.164674	-0.096340	0.100571	0.336298	0.354931	0.294233	0.333313	1.000000	0.278070	0.279358
Streaming TV	0.063228	-0.022574	0.257152	0.415552	0.176207	0.282106	0.390874	0.278070	1.000000	0.533094
Streaming Movies	0.061382	-0.032959	0.258751	0.418675	0.187398	0.274501	0.402111	0.279358	0.533094	1.000000

We may also obtain corresponding heatmap. It is shown that internet service and phone service are highly correlated, while streaming TV and online security are almost uncorrelated.

In [ ]:

sns.heatmap(telco_services_corr.corr(), annot=False, cmap='coolwarm')

plt.show()

2) Internet Service¶

Customers can be categorized as 3 types: those who has no internet service, who has DSL, and fiber optic.

In [ ]:

fig, ax = plt.subplots()

telco_internetservice = telco.groupby('Internet Service')['CustomerID'].count().reset_index()
labels = 'DSL','Fiber optic','No'
ax.pie(telco_internetservice['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=['lightgreen','skyblue', 'orange'])
ax.set_title('Internet Service')
plt.show()

My purpose is to figure out whether the difference in churn value exist depending on internet service.

In [ ]:

telco_intserv = telco.groupby(['Internet Service','Churn Label'])['CustomerID'].count().reset_index()
telco_intserv

Out[ ]:

	Internet Service	Churn Label	CustomerID
0	DSL	No	1962
1	DSL	Yes	459
2	Fiber optic	No	1799
3	Fiber optic	Yes	1297
4	No	No	1413
5	No	Yes	113

In [ ]:

telco_intserv_no = telco_intserv.loc[telco_intserv['Churn Label'] == 'No',:]
telco_intserv_yes = telco_intserv.loc[telco_intserv['Churn Label'] == 'Yes',:]

telco_intserv_yes

Out[ ]:

	Internet Service	Churn Label	CustomerID
1	DSL	Yes	459
3	Fiber optic	Yes	1297
5	No	Yes	113

The following pieplot shows the proportion of customers with fiber optic are much higher among those who had churned. This implies that some issues regarding fiber optic exist.

In [ ]:

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'DSL','Fiber Optic','No'
colors=['lightgreen','skyblue', 'orange']

# churn no
axes[0].pie(telco_intserv_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('not churned')

# yes
axes[1].pie(telco_intserv_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('churned')

plt.show()

3) Tech Support¶

This column represent whether customers who have internet service also have technique service.

In [ ]:

telco_techsupport = telco.groupby('Tech Support')['CustomerID'].count().reset_index()
telco_techsupport

Out[ ]:

	Tech Support	CustomerID
0	No	3473
1	No internet service	1526
2	Yes	2044

In [ ]:

fig, ax = plt.subplots()

labels = 'No','No internet service', 'Yes'
colors = ['violet','darkorange','royalblue']
ax.pie(telco_techsupport['CustomerID'], labels=labels, autopct='%1.1f%%',startangle=90, colors=colors)
ax.set_title('Tech Support')
plt.show()

The following pieplots show that the ratio of those who do not have tech service among churned customers are much higher than the ratio of those without tech service among remained customers. This implies that technique service is highly helpful for retaining customers.

In [ ]:

telco_techsupport = telco.groupby(['Tech Support','Churn Label'])['CustomerID'].count().reset_index()
telco_tech_no = telco_techsupport.loc[telco_techsupport['Churn Label'] == 'No',:]
telco_tech_yes = telco_techsupport.loc[telco_techsupport['Churn Label'] == 'Yes',:]

telco_tech_no

Out[ ]:

	Tech Support	Churn Label	CustomerID
0	No	No	2027
2	No internet service	No	1413
4	Yes	No	1734

In [ ]:

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'No','No internet service', 'Yes'
colors = ['violet','darkorange','royalblue']

# churn no
axes[0].pie(telco_tech_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('not churned')

# yes
axes[1].pie(telco_tech_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('churned')

plt.show()

(8) Contract¶

According to this dataset, there are 3 types of contracts: month to month, 1 year, and 2 year. The objective of this section is to find the effect of the type of contract on customer churn.

In [ ]:

telco_contract = telco.groupby(['Contract','Churn Label'])['CustomerID'].count().reset_index()

telco_contract_no = telco_contract.loc[telco_contract['Churn Label'] == 'No',:]
telco_contract_yes = telco_contract.loc[telco_contract['Churn Label'] == 'Yes',:]

telco_contract_yes

Out[ ]:

	Contract	Churn Label	CustomerID
1	Month-to-month	Yes	1655
3	One year	Yes	166
5	Two year	Yes	48

According to the following pie graph, the customers with month-to-month contract account for over 88% among churned customers, while those account for about 42% among remained customers. This suggests that those who are signing up for the Telco service through month-to-month are likely to have finishing the contract early in mind while using the Telco service.

In [ ]:

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
labels = 'Month-to-Month','One year', 'Two year'
colors=['royalblue','tomato','forestgreen']

# churn no
axes[0].pie(telco_contract_no['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[0].set_title('No churn')

# yes
axes[1].pie(telco_contract_yes['CustomerID'], labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
axes[1].set_title('yes churn')

plt.show()

(9) Paperless Billing¶

This section shows whether the paperless bill affects customers' decision to finish their contract with Telco. According to the barplots below, the ratio of churned customers with paperless biliing are higher than the ratio of churned customers without paperless billing.

In [ ]:

telco_paper = telco.groupby(['Paperless Billing', 'Churn Label'])['CustomerID'].count().reset_index()
telco_paper

Out[ ]:

	Paperless Billing	Churn Label	CustomerID
0	No	No	2403
1	No	Yes	469
2	Yes	No	2771
3	Yes	Yes	1400

In [ ]:

g = sns.catplot(
    telco_paper, kind="bar",
    x="Churn Label", y="CustomerID", col="Paperless Billing",
    height=4, aspect=1, palette = ["green", "lightcoral"])

g.set(ylabel = 'Counts')
plt.show()

(10) Payment Method¶

The dataset shows that there are 4 types of payment methods: automatic back transfer, automatic credit card, electronic check, and mailed check.

In [ ]:

telco_payment = telco.groupby(['Payment Method','Churn Label'])['CustomerID'].count().reset_index()

telco_payment_no = telco_payment.loc[telco_payment['Churn Label'] == 'No',:]
telco_payment_yes = telco_payment.loc[telco_payment['Churn Label'] == 'Yes',:]

telco_payment_yes

Out[ ]:

	Payment Method	Churn Label	CustomerID
1	Bank transfer (automatic)	Yes	258
3	Credit card (automatic)	Yes	232
5	Electronic check	Yes	1071
7	Mailed check	Yes	308

Among those who remain, the payment method has no difference. However, the ratio of customers with electronic check stands out among churned customers.

In [ ]:

counts_no = telco_payment_no['CustomerID'].to_list()
counts_no

Out[ ]:

[1286, 1290, 1294, 1304]

In [ ]:

fig, axs = plt.subplots(1,2, figsize = (12,5))

methods = ['Bank', 'Credit card','Elec','Mail']
counts_no = telco_payment_no['CustomerID'].to_list()
counts_yes = telco_payment_yes['CustomerID'].to_list()

bar_labels = methods
bar_colors = ['tab:green', 'tab:blue', 'tab:red', 'tab:orange']


# no
axs[0].bar(methods, counts_no, label=bar_labels, color=bar_colors)
axs[0].set_ylabel('# of payment users')
axs[0].set_title('no churn')
axs[0].set_ylim((0,1800))
axs[0].legend()

# yes
axs[1].bar(methods, counts_yes, label=bar_labels, color=bar_colors)
axs[1].set_ylabel('# of payment users')
axs[1].set_title('yes churn')
axs[1].set_ylim((0,1800))
axs[1].legend()

plt.show()

(11) Montly Charges¶

The Monthly Charges column shows how much the customers pay every month. The following histogram shows that remaining customers are likely to pay less monthly charge, while churned customers are likely to have paid higher monthly charge.

In [ ]:

telco_monthly = telco[['Monthly Charges','Churn Label']]

telco_monthly_no = telco_monthly.loc[telco_monthly['Churn Label'] == 'No',:]['Monthly Charges'].to_numpy()
telco_monthly_yes = telco_monthly.loc[telco_monthly['Churn Label'] == 'Yes',:]['Monthly Charges'].to_numpy()
telco_monthly_yes

Out[ ]:

array([ 53.85,  70.7 ,  99.65, ...,  75.75, 102.95,  74.4 ])

In [ ]:

plt.hist(telco_monthly_no, density = False,bins = 50, histtype='barstacked', rwidth=0.8, color = 'green',alpha = 0.5, label = 'No churn')
plt.hist(telco_monthly_yes, density = False,bins = 50, histtype='barstacked', rwidth=0.8, color = 'lightcoral',alpha = 0.5, label = 'Yes churn')
plt.title('Monthly Charges wrt Churn')
plt.xlabel('Monthly Charges')
plt.ylabel('counts')
plt.legend()

plt.show()

(12) Total Charges¶

While there was difference in churn rate regarding monthly charges, the following histogram suggests there is no significant difference in churn rate regarding total charge.

In [ ]:

telco['Total Charges'] = X['Total Charges']

Out[ ]:

<bound method Series.info of 0        108.150002
1        151.649994
2        820.500000
3       3046.050049
4       5036.299805
           ...     
7038    1419.400024
7039    1990.500000
7040    7362.899902
7041     346.450012
7042    6844.500000
Name: Total Charges, Length: 7043, dtype: float32>

In [ ]:

telco_total = telco[['Total Charges','Churn Label']]

sns.histplot(data=telco_total, x='Total Charges', hue='Churn Label', element="bars",
             stat="count", common_norm=False, palette={"Yes": "blue", "No": "violet"})
plt.title('Histogram of Total Charges')
plt.show()

(13) Temporary conclusion¶

There are almost no impact on customer churn with respect to cities.
There are almost no impact on customer churn with respect to gender.
Those who are older are more likely to leave.
People who have no parter or dependents are likely to churn.
Customers with short length of contract are likely to leave.
highly correlated : phone service, multiple lines, internet service, streaming services
less correlated : online services, device protections, etc
Customers with fiber optics accounts for majority of churn.
Over 88% of those who left are Month-to-month; short period contract.
Among those who churned, most of them used electronic payment.
Among those who churned, customers with high monthly charge are more likely to churn.

To sum up, Senior, Partner, Dependents, Tenure Month, phone service, multiple lines, internet service, streaming services, Contract, Payment Methods, and Monthly Charges columns affects the churn rate to some degree. That is, these variables are the most important variables.

3. Other Methods¶

Until now, we analyzed which variables have the most significant impact on customer churn through qualitative measure, EDA. This section, however, relies on more quantitative measures to select important variables and compare the results with previous EDA.

(1) Data Preprocessing¶

The Ridge and Lasso both require data to be numerical, so I transformed categorical data into numerical ones as follows.

In [ ]:

X = telco[['Latitude', 'Longitude', 'Gender', 'Senior Citizen','Partner', 'Dependents', 'Tenure Months', 'Phone Service',
           'Multiple Lines', 'Internet Service', 'Online Security','Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
           'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method','Monthly Charges', 'Total Charges']]

X['Gender'] = X['Gender'].replace(to_replace = ['Male', 'Female'], value = [1,2])
X['Senior Citizen'] = X['Senior Citizen'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Partner'] = X['Partner'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Dependents'] = X['Dependents'].replace(to_replace = ['Yes','No'], value = [1,0])

X['Phone Service'] = X['Phone Service'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Multiple Lines'] = X['Multiple Lines'].replace(to_replace = ['Yes','No','No phone service'], value = [1,0,0])
X['Internet Service'] = X['Internet Service'].replace(to_replace = ['DSL','Fiber optic','No'], value = [1,2,0])

X['Online Security'] = X['Online Security'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Online Backup'] = X['Online Backup'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Device Protection'] = X['Device Protection'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Tech Support'] = X['Tech Support'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Streaming TV'] = X['Streaming TV'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
X['Streaming Movies'] = X['Streaming Movies'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])

X['Contract'] = X['Contract'].replace(to_replace = ['Month-to-month', 'One year','Two year'], value = [0,1,2])
X['Paperless Billing'] = X['Paperless Billing'].replace(to_replace = ['Yes','No'], value = [1,0])
X['Payment Method'] = X['Payment Method'].replace(to_replace = X['Payment Method'].unique(), value = [1,2,3,4])

X.head(10)

<ipython-input-7-4f14e9a222bb>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Gender'] = X['Gender'].replace(to_replace = ['Male', 'Female'], value = [1,2])
<ipython-input-7-4f14e9a222bb>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Senior Citizen'] = X['Senior Citizen'].replace(to_replace = ['Yes','No'], value = [1,0])
<ipython-input-7-4f14e9a222bb>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Partner'] = X['Partner'].replace(to_replace = ['Yes','No'], value = [1,0])
<ipython-input-7-4f14e9a222bb>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Dependents'] = X['Dependents'].replace(to_replace = ['Yes','No'], value = [1,0])
<ipython-input-7-4f14e9a222bb>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Phone Service'] = X['Phone Service'].replace(to_replace = ['Yes','No'], value = [1,0])
<ipython-input-7-4f14e9a222bb>:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Multiple Lines'] = X['Multiple Lines'].replace(to_replace = ['Yes','No','No phone service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Internet Service'] = X['Internet Service'].replace(to_replace = ['DSL','Fiber optic','No'], value = [1,2,0])
<ipython-input-7-4f14e9a222bb>:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Online Security'] = X['Online Security'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Online Backup'] = X['Online Backup'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Device Protection'] = X['Device Protection'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Tech Support'] = X['Tech Support'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Streaming TV'] = X['Streaming TV'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Streaming Movies'] = X['Streaming Movies'].replace(to_replace = ['Yes','No','No internet service'], value = [1,0,0])
<ipython-input-7-4f14e9a222bb>:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Contract'] = X['Contract'].replace(to_replace = ['Month-to-month', 'One year','Two year'], value = [0,1,2])
<ipython-input-7-4f14e9a222bb>:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Paperless Billing'] = X['Paperless Billing'].replace(to_replace = ['Yes','No'], value = [1,0])
<ipython-input-7-4f14e9a222bb>:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Payment Method'] = X['Payment Method'].replace(to_replace = X['Payment Method'].unique(), value = [1,2,3,4])

Out[ ]:

	Latitude	Longitude	Gender	Senior Citizen	Partner	Dependents	Tenure Months	Phone Service	Multiple Lines	Internet Service	...	Online Backup	Device Protection	Tech Support	Streaming TV	Streaming Movies	Paperless Billing	Payment Method	Monthly Charges	Total Charges
0	33.964131	-118.272783	1	0	0	0	2	1	0	1	...	1	0	0	0	0	1	1	53.85	108.150002
1	34.059281	-118.307420	2	0	0	1	2	1	0	2	...	0	0	0	0	0	1	2	70.70	151.649994
2	34.048013	-118.293953	2	0	0	1	8	1	1	2	...	0	1	0	1	1	1	2	99.65	820.500000
3	34.062125	-118.315709	2	0	1	1	28	1	1	2	...	0	1	1	1	1	1	2	104.80	3046.050049
4	34.039224	-118.266293	1	0	0	1	49	1	1	2	...	1	1	0	1	1	1	3	103.70	5036.299805
5	34.066367	-118.309868	2	0	1	0	10	1	0	1	...	0	1	1	0	0	0	4	55.20	528.349976
6	34.023810	-118.156582	1	1	0	0	1	0	0	1	...	0	1	0	0	1	1	2	39.65	39.650002
7	34.066303	-118.435479	1	0	0	0	1	1	0	0	...	0	0	0	0	0	0	1	20.15	20.150000
8	34.099869	-118.326843	1	0	1	1	47	1	1	2	...	1	0	0	1	1	1	2	99.35	4749.149902
9	34.089953	-118.294824	1	0	1	0	1	0	0	1	...	1	0	0	0	0	0	2	30.20	30.200001

10 rows × 21 columns

It can be shown that all the data are changed into numerical data.

In [ ]:

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Latitude           7043 non-null   float64
 1   Longitude          7043 non-null   float64
 2   Gender             7043 non-null   int64  
 3   Senior Citizen     7043 non-null   int64  
 4   Partner            7043 non-null   int64  
 5   Dependents         7043 non-null   int64  
 6   Tenure Months      7043 non-null   int64  
 7   Phone Service      7043 non-null   int64  
 8   Multiple Lines     7043 non-null   int64  
 9   Internet Service   7043 non-null   int64  
 10  Online Security    7043 non-null   int64  
 11  Online Backup      7043 non-null   int64  
 12  Device Protection  7043 non-null   int64  
 13  Tech Support       7043 non-null   int64  
 14  Streaming TV       7043 non-null   int64  
 15  Streaming Movies   7043 non-null   int64  
 16  Contract           7043 non-null   int64  
 17  Paperless Billing  7043 non-null   int64  
 18  Payment Method     7043 non-null   int64  
 19  Monthly Charges    7043 non-null   float64
 20  Total Charges      7043 non-null   float32
dtypes: float32(1), float64(3), int64(17)
memory usage: 1.1 MB

We use StandardScaler() in "sklearn" to standardize all the data. After spliting churn value and the rest of the data, we standardize the data.

In [ ]:

scaler = StandardScaler()
y = telco['Churn Value']
X_scaled = scaler.fit_transform(X)

(2) LASSO¶

Using Lasso, we can figure out how each coefficient of logistic regression drops to zero. The variables that survive longer as the penalty increases are the most important variables. The result of Lasso shows that Internet service, Tenure Month, Dependents, Online Security, Contracts, Paperless Billing, Streaming Movies are the most important variables.

In [ ]:

alphas = np.logspace(-2, 4, 200)

coefs = []

for alpha in alphas:
    model = LogisticRegression(penalty='l1', C=1/alpha, solver='liblinear') # l1 : LASSO
    model.fit(X_scaled, y)
    coefs.append(model.coef_[0])

coefs = np.array(coefs)

In [ ]:

plt.figure(figsize=(24, 8))

plt.subplot(1, 2, 1)
for i, variable_name in enumerate(X.columns[:11]):
    plt.plot(alphas, coefs[:, i], label=variable_name)

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with LASSO Penalty (from Latitude to Online Security)')
plt.legend(loc = 'lower right')


plt.subplot(1, 2, 2)
for i, variable_name in enumerate(X.columns[11:]):
    plt.plot(alphas, coefs[:, i+11], label=variable_name)

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with LASSO Penalty  (from Online Backup to Total Charges)')
plt.legend(loc = 'lower right')

plt.show()

(3) Ridge¶

Using Ridge, we can figure out how each coefficient of logistic regression converges to zero. The variables that slowly decrease as the penalty increases are the most important variables. The result of Ridge shows that Internet Service,Tenure Months, Dependents, Online Security, Total Charges, Monthly Charges, Contract, Tech Support are the most important variables.

In [ ]:

alphas = np.logspace(-2, 6, 200)
coefs = []

for alpha in alphas:
    model = LogisticRegression(penalty='l2', C=1/alpha, solver='liblinear') # l2 : Ridge
    model.fit(X_scaled, y)
    coefs.append(model.coef_[0])

coefs = np.array(coefs)

In [ ]:

plt.figure(figsize=(24, 8))

plt.subplot(1, 2, 1)
for i, variable_name in enumerate(X.columns[:11]):
    plt.plot(alphas, coefs[:, i], label=variable_name)

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with Ridge Penalty (from Latitude to Online Security)')
plt.legend()


plt.subplot(1, 2, 2)
for i, variable_name in enumerate(X.columns[11:]):
    plt.plot(alphas, coefs[:, i+11], label=variable_name)

plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficient Paths with Ridge Penalty  (from Online Backup to Total Charges)')
plt.legend() # loc = 'lower right'

plt.show()

(4) Conclusion¶

We made a decision to choose the following 8 variables as the most influential variables: Dependents, Contracts , Tenure Months , Paperless Billing ,Internet Service ,Tech Support , Monthly Charges , Total Charges.

Bayesian Statistics Project 1¶

1. Objective¶

2. Data¶

3. EDA¶

(1) City¶

(2) Gender¶

(3) Senior¶

(4) Partner¶

(5) Dependents ( = Children)¶

(6) Tenure Month¶

(7) Services¶

1) Correlation btw Churn Values and All Services¶

2) Internet Service¶

3) Tech Support¶

(8) Contract¶

(9) Paperless Billing¶

(10) Payment Method¶

(11) Montly Charges¶

(12) Total Charges¶

(13) Temporary conclusion¶

3. Other Methods¶

(1) Data Preprocessing¶

(2) LASSO¶

(3) Ridge¶

(4) Conclusion¶

4. Analysis Method

5. Bayesian Data Analysis

(1) Data Preprocess in R

(2) Model Fitting

(3) Test & Result

6. Comparison with Frequentist Method