4. Analysis Method

Since the variable of interest—whether a customer churns—is binary, logistic regression was used for the analysis. It was assumed that the dependent variable, customer churn status, follows an independent Bernoulli distribution with the churn probability \(μ_i\) as its parameter. Using the logit link function, the relationship between each customer’s churn probability and the explanatory variables was modeled.

\[\begin{aligned} y_i & \overset{iid}{\sim} \text{Bernoulli}(\mu_i), \\ \mu_i&= \frac{\exp\{x_i^{\top}\beta\}}{1 + \exp\{x_i^{\top}\beta\}}. \end{aligned}\]

For parameter estimation, the Bayesian approach was employed, treating each parameter as a random variable and estimating its distribution. A non-informative prior distribution was chosen for this purpose. This decision was made because there is no definite prior information, such as the parameters following a normal distribution or being strictly positive. Therefore, it was deemed natural to let the data alone determine the parameter values (let the data speak for themselves).

\[p(\beta) \propto 1.\]

The original data was split into a training set and a test set. The training set was used for model fitting, while the model’s performance was evaluated using the test set. For estimating the parameters to infer the churn probability in the Bayesian model, the posterior mean was used. A customer was predicted to have churned if the estimated churn probability from each model exceeded 0.5.

Additionally, to examine the extent of any differences from the results of the frequentist logistic regression model, the predicted customer churn statuses from both the Bayesian model and the frequentist model were compared.

5. Bayesian Data Analysis

From this section, we’ll implement Bayesian data analysis in earnest by taking advantage of Rstan.

(1) Data Preprocess in R

Since Rstan will be used in analysis, we import data again in R.

rm(list = ls())

library(readxl)
library(dplyr)
## Warning: 패키지 'dplyr'는 R 버전 4.3.3에서 작성되었습니다
## 
## 다음의 패키지를 부착합니다: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rstan)
## 필요한 패키지를 로딩중입니다: StanHeaders
## 
## rstan version 2.32.3 (Stan version 2.26.1)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
telco <- read_excel("Telco_customer_churn_data.xlsx")

The ‘Total Charges’ columns has some missing values, which was previously mentioned in EDA section. So, we replace them with Monthly Charges data.

#Find missing values and replace them
colSums(is.na(telco))
##        CustomerID             Count           Country             State 
##                 0                 0                 0                 0 
##              City          Zip Code          Lat Long          Latitude 
##                 0                 0                 0                 0 
##         Longitude            Gender    Senior Citizen           Partner 
##                 0                 0                 0                 0 
##        Dependents     Tenure Months     Phone Service    Multiple Lines 
##                 0                 0                 0                 0 
##  Internet Service   Online Security     Online Backup Device Protection 
##                 0                 0                 0                 0 
##      Tech Support      Streaming TV  Streaming Movies          Contract 
##                 0                 0                 0                 0 
## Paperless Billing    Payment Method   Monthly Charges     Total Charges 
##                 0                 0                 0                11 
##       Churn Label       Churn Value       Churn Score              CLTV 
##                 0                 0                 0                 0 
##      Churn Reason 
##              5174
telco %>% 
  mutate(`Total Charges` = ifelse(is.na(`Total Charges`), `Monthly Charges`,
                                   `Total Charges`)) -> telco
colSums(is.na(telco))
##        CustomerID             Count           Country             State 
##                 0                 0                 0                 0 
##              City          Zip Code          Lat Long          Latitude 
##                 0                 0                 0                 0 
##         Longitude            Gender    Senior Citizen           Partner 
##                 0                 0                 0                 0 
##        Dependents     Tenure Months     Phone Service    Multiple Lines 
##                 0                 0                 0                 0 
##  Internet Service   Online Security     Online Backup Device Protection 
##                 0                 0                 0                 0 
##      Tech Support      Streaming TV  Streaming Movies          Contract 
##                 0                 0                 0                 0 
## Paperless Billing    Payment Method   Monthly Charges     Total Charges 
##                 0                 0                 0                 0 
##       Churn Label       Churn Value       Churn Score              CLTV 
##                 0                 0                 0                 0 
##      Churn Reason 
##              5174

After leveraging on EDA as well as LASSO and Ridge, we determined to select 8 columns: Churn Value, Dependents, Contract, Tenure Months, Paperless Billing, Internet Service, Tech Support, Monthly Charges, and Total Charges. Also, we convert the categorical data into numerical ones.

# Select variable to use

telco %>%
  select(`Churn Value`, Dependents, Contract, `Tenure Months`,
         `Paperless Billing`, `Internet Service`, `Tech Support`,
         `Monthly Charges`, `Total Charges`) %>%
  mutate(Dependents = ifelse(Dependents == "Yes", 1, 0),
         Contract = ifelse(Contract == "Month-to-month", 1, 0),
         `Paperless Billing` = ifelse(`Paperless Billing` == "Yes", 1, 0),
         `Internet Service` = ifelse(`Internet Service` == "Fiber optic", 1, 0),
         `Tech Support` = ifelse(`Tech Support` == "Yes", 1,0)) -> telco

head(telco)
## # A tibble: 6 × 9
##   `Churn Value` Dependents Contract `Tenure Months` `Paperless Billing`
##           <dbl>      <dbl>    <dbl>           <dbl>               <dbl>
## 1             1          0        1               2                   1
## 2             1          1        1               2                   1
## 3             1          1        1               8                   1
## 4             1          1        1              28                   1
## 5             1          1        1              49                   1
## 6             1          0        1              10                   0
## # ℹ 4 more variables: `Internet Service` <dbl>, `Tech Support` <dbl>,
## #   `Monthly Charges` <dbl>, `Total Charges` <dbl>

The values in the columns “Tenure Months”, “Monthly Charges”, “Total Charges” are generally high, so we normalize them using scale function.

to_scale <- c( "Tenure Months", "Monthly Charges", "Total Charges")
telco[,to_scale] <- scale(telco[,to_scale])

Now, we split the entire data into train set and test set. While running Rstan, we use the train data set and test the performance of Bayesian logistic regression using test data set.

y <- telco$`Churn Value`
X <- telco[,!(colnames(telco) %in% c('Churn Value', 'group'))]

set.seed(7043)
row_indices <- sample(nrow(X))
train_size <- round(0.7 * nrow(X))

X_train <- X[row_indices[1:train_size], ]
y_train <- y[row_indices[1:train_size]]

X_test <- X[row_indices[(train_size + 1):nrow(X)], ]
y_test <- y[row_indices[(train_size + 1):nrow(X)]]

(2) Model Fitting

Next code shows a model code telco.glm that is used to fit Bayesian logistic regression model. Since we determined to set a non-informative prior, we don’t have any code associated with alpha and beta in model section.

telco.glm = "
data {
  int<lower=0> N;             // number of subjects
  int<lower=0> K;             // number of predictors
  matrix[N, K] X;             // predictor matrix
  int<lower=0,upper=1> y[N];  // binary outcome
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
}
model {
  y ~ bernoulli_logit(alpha + X * beta);  
}
"

We fit the model as follows.

sm = stan_model(model_code=telco.glm)
fit = sampling(
  object = sm,
  data = list(y=y_train, N=nrow(X_train), K=ncol(X_train), X=X_train), # observed data
  chains = 4,    # number of multiple chains
  warmup = 1000, # number of burn-in iterations
  iter   = 4000, # number of iterations per chain
  cores = 4      # number of cores to use (e.g., 1 per chain)
)

The fitting result is as follows. The posterior mean of each coefficient is shown in the mean column. Each row apart from alpha represents Dependents, Contract, Tenure Months, Paperless Billing, Internet Service, Tech Support, Monthly Charges, Total Charges respectively.

print(fit)
## Inference for Stan model: anon_model.
## 4 chains, each with iter=4000; warmup=1000; thin=1; 
## post-warmup draws per chain=3000, total post-warmup draws=12000.
## 
##             mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## alpha      -2.04    0.00 0.13    -2.30    -2.13    -2.04    -1.95    -1.78
## beta[1]    -1.58    0.00 0.14    -1.85    -1.67    -1.58    -1.48    -1.31
## beta[2]     0.91    0.00 0.12     0.69     0.83     0.91     0.99     1.14
## beta[3]    -1.15    0.00 0.16    -1.46    -1.26    -1.15    -1.04    -0.85
## beta[4]     0.49    0.00 0.09     0.32     0.43     0.49     0.55     0.66
## beta[5]     0.30    0.00 0.15     0.00     0.20     0.30     0.40     0.59
## beta[6]    -0.56    0.00 0.11    -0.77    -0.63    -0.56    -0.48    -0.35
## beta[7]     0.51    0.00 0.10     0.32     0.45     0.51     0.58     0.71
## beta[8]     0.34    0.00 0.17     0.02     0.23     0.34     0.46     0.67
## lp__    -2081.10    0.03 2.13 -2086.09 -2082.29 -2080.79 -2079.56 -2077.94
##         n_eff Rhat
## alpha    7287    1
## beta[1] 12090    1
## beta[2] 10456    1
## beta[3]  8451    1
## beta[4] 11853    1
## beta[5]  8413    1
## beta[6] 10161    1
## beta[7]  7502    1
## beta[8]  8135    1
## lp__     5326    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Mar 22 14:15:59 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The effect of each variable on the churn probability depends on the sign of its posterior mean. Since the posterior means of Dependents, Tenure Months, and Tech Support have negative values, it can be interpreted that customers with dependents, longer subscription periods, and those subscribed to the technical support service are less likely to churn. In particular, having dependents significantly lowers the churn probability.

On the other hand, the variables Contract, Paperless Billing, Internet Service, Monthly Charges, and Total Charges have positive posterior mean values. This indicates that customers who have month-to-month contracts, receive paperless billing, use fiber optic internet service, and have higher monthly and total charges are more likely to churn. Especially, customers on month-to-month contracts show a notably higher probability of churn.

The effect size of each variable can be assessed by taking the exponential function of its posterior mean. For example, in the case of the Contract variable, the odds ratio of churn between customers with month-to-month contracts and those without is exp(0.91) = 2.484. For continuous variables, interpretation should consider standardization. For instance, regarding Tenure Months, since the standard deviation in the training set is 24.559, the odds ratio of churn for an increase of approximately 24.5 months in tenure is exp(-1.15) = 0.3166.

plot(fit,pars=c("beta"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

The above code shows the posterior means (black) and their credible interval (red) respectively. Each black dot represents the estimated effect (on the log-odds scale) of the corresponding predictor. This is typically the posterior mean or median of that coefficient. The horizontal red lines show the uncertainty around each estimate, for example, a 95% credible interval. If the interval does not cross zero, it suggests that the model strongly supports a positive or negative effect (depending on the side of zero it lies on).

Overall, these results tell you which predictors have a credible (statistically meaningful) influence on churn and in which direction. Since all of the credible intervals do not contain zero, this shows all the 8 variables have impact on customer churn.

The following code shows how each coefficient converges. This photo shows that all of the coefficients converges well.

traceplot(fit,pars=c("beta"))

(3) Test & Result

Now, we’re going to check the performace of the result using test dataset. The code randomly extracts one draw per test observation. It then computes the linear predictor by multiplying this set of parameters with the corresponding test observation (after adding an intercept column), transforms it to a probability using the logistic function, and finally classifies based on a 0.5 threshold.

set.seed(2113) 

param = extract(fit) # output

n_test <- nrow(X_test)
params_mt <- cbind(param$alpha, param$beta)
idx <- sample(1:nrow(params_mt), n_test)
params_test <- params_mt[idx,]
Xbeta_test <- params_test*cbind(1,X_test) 

The result is shown to be about 80.92%.

pred_prob_ba  <- plogis(rowSums(Xbeta_test))
y_pred_ba <- ifelse(pred_prob_ba >= 0.5, 1, 0)
acc_ba <- mean(y_pred_ba == y_test)
cat("Accuracy of Bayesian Logistic Regression :", acc_ba, "\n")
## Accuracy of Bayesian Logistic Regression : 0.8083294

6. Comparison with Frequentist Method

It was observed that the posterior means from the model fitted using the Bayesian method, show no significant difference from the coefficient estimates. Comparing the churn prediction accuracy of each model using the test set revealed that both models achieved almost the same accuracy of 80.9%. Therefore, it was confirmed that the Bayesian method produces results equivalent to the frequentist one while benefiting from the advantages of the Bayesian approach.

freq.logistic <- glm(y_train ~ ., data = X_train, family = binomial)
summary(freq.logistic)
## 
## Call:
## glm(formula = y_train ~ ., family = binomial, data = X_train)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.03303    0.13175 -15.430  < 2e-16 ***
## Dependents          -1.56869    0.13489 -11.630  < 2e-16 ***
## Contract             0.91006    0.11744   7.749 9.24e-15 ***
## `Tenure Months`     -1.14428    0.15857  -7.216 5.35e-13 ***
## `Paperless Billing`  0.48643    0.08765   5.550 2.86e-08 ***
## `Internet Service`   0.29750    0.15170   1.961   0.0499 *  
## `Tech Support`      -0.55338    0.10727  -5.159 2.49e-07 ***
## `Monthly Charges`    0.51314    0.10035   5.113 3.17e-07 ***
## `Total Charges`      0.33899    0.16618   2.040   0.0414 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5736.8  on 4929  degrees of freedom
## Residual deviance: 4153.2  on 4921  degrees of freedom
## AIC: 4171.2
## 
## Number of Fisher Scoring iterations: 6
freq.logistic$coefficients
##         (Intercept)          Dependents            Contract     `Tenure Months` 
##          -2.0330256          -1.5686940           0.9100605          -1.1442809 
## `Paperless Billing`  `Internet Service`      `Tech Support`   `Monthly Charges` 
##           0.4864317           0.2975031          -0.5533824           0.5131411 
##     `Total Charges` 
##           0.3389851
pred_prob_f <- predict(freq.logistic, newdata = X_test, type = "response")
y_pred_f <- ifelse(pred_prob_f >= 0.5, 1, 0)
acc_f <- mean(y_pred_f == y_test)

cat("Accuracy of Frequentist Logistic Regression :", acc_f, "\n")
## Accuracy of Frequentist Logistic Regression : 0.8088027