0. Purpose of the Project¶

This analysis uses real-world data provided by a digital advertising agency, who shared an anonymized version of the dataset they are actively analyzing, with the goal of proposing more effective marketing strategies based on actual customer behavior. The dataset (available at this link) contains detailed event logs from an online shopping mall, including visit timestamps, advertising media, device type, and purchase-related actions. Using this data, we construct user-level features from raw logs, compare conversion performance across media and devices, segment customers via K-means clustering, and model cart-stage conversion with logistic regression, ultimately deriving actionable insights for retargeting, onboarding, and loyalty-focused campaigns.

1. Intro¶

In this project, we work with an online shopping mall visit dataset that records each customer’s interactions with the site. The data eMFORCE_GA_yonsei.csv include information such as purchase counts, purchase amounts, visit timestamps, advertising media, and behavioral event types (event_name). Using these variables, we aim to analyze which factors influence customers’ purchasing activities and to classify customers into groups based on their purchase patterns.

In [ ]:

import pandas as pd
df = pd.read_csv("eMFORCE_GA_yonsei.csv", encoding='euc-kr')
df.head(10)

Out[ ]:

	event_date	user_pseudo_id	event_timestamp	event_name	purchase_value	device	media	type
0	20241206	1.000002e+09	2024-12-06 12:27:01.140054+00:00	first_visit	NaN	desktop	criteo	광고
1	20241216	1.000004e+09	2024-12-15 15:48:17.412486+00:00	first_visit	NaN	desktop	referral	비광고
2	20241012	1.000006e+09	2024-10-12 08:12:44.957241+00:00	first_visit	NaN	desktop	organic	비광고
3	20241221	1.000009e+09	2024-12-20 22:00:06.601354+00:00	first_visit	NaN	desktop	naver_bs	광고
4	20240710	1.000012e+09	2024-07-10 13:59:21.352197+00:00	first_visit	NaN	desktop	not set	비광고
5	20240817	1.000024e+09	2024-08-16 17:24:35.989460+00:00	first_visit	NaN	desktop	RTB	광고
6	20241115	1.000028e+09	2024-11-15 04:54:52.075057+00:00	first_visit	NaN	desktop	direct	비광고
7	20241124	1.000028e+09	2024-11-24 06:23:23.606165+00:00	session_start	NaN	desktop	referral	비광고
8	20241125	1.000028e+09	2024-11-25 06:13:43.886572+00:00	session_start	NaN	desktop	referral	비광고
9	20240723	1.000034e+09	2024-07-23 08:31:43.145137+00:00	first_visit	NaN	desktop	naver_bs	광고

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25597319 entries, 0 to 25597318
Data columns (total 8 columns):
 #   Column           Dtype  
---  ------           -----  
 0   event_date       int64  
 1   user_pseudo_id   float64
 2   event_timestamp  object 
 3   event_name       object 
 4   purchase_value   float64
 5   device           object 
 6   media            object 
 7   type             object 
dtypes: float64(2), int64(1), object(5)
memory usage: 1.5+ GB

In [ ]:

df['type'].value_counts()

Out[ ]:

type
광고     18782137
비광고     6815182
Name: count, dtype: int64

In [ ]:

df['media'].value_counts()

Out[ ]:

media
criteo                       5391896
advertiser_apppush           4980346
direct                       3655698
kakao_alimtok                2443547
not set                      2048257
RTB                          1734349
advertiser_sms                932248
organic                       636282
kakao_dm                      594582
TOSS                          523974
referral                      474945
ADN                           468558
naver_bs                      365534
advertiser_affiliate          352999
advertiser_homeshopping       290720
naver_search                  268378
advertiser_ars                113569
instagram                      89736
facebook                       79828
google_DA                      45418
kakao_share                    41636
payco                          34540
google_SA                       9827
advertiser_tv                   7388
advertiser_braze                4328
appier                          3801
kakao_SA                        3634
advertiser_naver                 728
advertiser_wondershopping        573
Name: count, dtype: int64

In [ ]:

df['device'].value_counts()

Out[ ]:

device
mobile     24807072
desktop      790247
Name: count, dtype: int64

In [ ]:

df['event_name'].value_counts()

Out[ ]:

event_name
session_start     15710066
first_visit        3803531
begin_checkout     3720879
purchase            987648
add_to_cart         846926
login               491079
회원가입완료               37190
Name: count, dtype: int64

In [ ]:

df.describe()

Out[ ]:

	event_date	user_pseudo_id	purchase_value
count	2.559732e+07	2.559732e+07	9.876480e+05
mean	2.024098e+07	1.070312e+09	5.317989e+04
std	1.719961e+02	6.212570e+08	5.670244e+04
min	2.024070e+07	1.817237e+01	0.000000e+00
25%	2.024082e+07	5.302507e+08	2.990000e+04
50%	2.024100e+07	1.072703e+09	4.111000e+04
75%	2.024112e+07	1.609010e+09	5.997000e+04
max	2.024123e+07	2.147484e+09	6.278400e+06

2. Data Preprocessing¶

(1) row : id & col : event names, media, device¶

We transforms the raw log-level access records into a user-level dataset. First, we create a crosstab so that each row corresponds to a user and each column represents an event_name, summarizing how often each event occurs for that user. Then, for each user, we extract their most frequently used media channel and device. These variables, together with a later-defined purchase conversion indicator (converted), will form a structured table where each user is described by their main media, device, and event-based behavior.

In [ ]:

# One-hot encoding
pivot = pd.crosstab(df['user_pseudo_id'], df['event_name'])
# each users most frequently used media
media = df.groupby('user_pseudo_id')['media'].agg(lambda x: x.value_counts().idxmax())
# each users most frequently used device
device = df.groupby('user_pseudo_id')['device'].agg(lambda x: x.value_counts().idxmax())

This code merges the event-based features with each user’s main media and device to create a single user-level table. It then defines a binary conversion variable, converted, which is set to 1 if the user has at least one purchase event and 0 otherwise, and displays the first five rows of the resulting dataset.

In [ ]:

user_df = pivot.merge(media, left_index=True, right_index=True)
user_df = user_df.merge(device, left_index=True, right_index=True)
user_df['converted'] = (user_df['purchase'] > 0).astype(int)
user_df.head(5)

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	media	device	converted
user_pseudo_id
18.172369	0	21	1	1	17	93	0	advertiser_apppush	mobile	1
23.172043	0	0	1	0	0	0	0	not set	mobile	0
1298.162548	0	0	0	0	0	2	0	direct	mobile	0
1678.173362	0	0	1	0	0	0	0	kakao_dm	mobile	0
2574.172079	0	0	1	0	0	3	0	criteo	mobile	0

(2) time¶

Now, we calculates how long each user stayed on the site in total by defining and aggregating sessions. First, it converts event_timestamp to a proper datetime format and sorts all records by user and time. Then, for each user, it computes the time difference between consecutive events and starts a new session whenever the gap exceeds 24 hours (or when there is no previous event). Using these session IDs, it measures each session’s duration as the difference between the first and last timestamp. Finally, it sums the session durations for each user to obtain total_stay_time_sec, and displays the first 10 users’ results.

In [ ]:

# convert event_timestamp to datetime format
df['event_timestamp'] = pd.to_datetime(df['event_timestamp'], format='mixed')

# sort data by user
df_sample_sorted = df.sort_values(by=['user_pseudo_id', 'event_timestamp'])

# define sessions: if the gap is more than 24 hours, start a new session
df_sample_sorted['prev_time'] = df_sample_sorted.groupby('user_pseudo_id')['event_timestamp'].shift()
df_sample_sorted['time_diff'] = (df_sample_sorted['event_timestamp'] - df_sample_sorted['prev_time']).dt.total_seconds()

# session ID: mark a new session when the gap is at least 1 day (86,400 seconds)
df_sample_sorted['new_session'] = (df_sample_sorted['time_diff'] > 86400) | (df_sample_sorted['time_diff'].isna())
df_sample_sorted['session_id'] = df_sample_sorted.groupby('user_pseudo_id')['new_session'].cumsum()

# session duration: last timestamp - first timestamp
session_time = df_sample_sorted.groupby(['user_pseudo_id', 'session_id'])['event_timestamp'].agg(['min', 'max'])
session_time['session_duration'] = (session_time['max'] - session_time['min']).dt.total_seconds()

# compute total stay time per user
total_session_time = session_time.groupby('user_pseudo_id')['session_duration'].sum().reset_index()
total_session_time.columns = ['user_pseudo_id', 'total_stay_time_sec']

total_session_time.head(10)

Out[ ]:

	user_pseudo_id	total_stay_time_sec
0	18.172369	2.066662e+06
1	23.172043	0.000000e+00
2	1298.162548	0.000000e+00
3	1678.173362	0.000000e+00
4	2574.172079	1.302182e+04
5	2784.172733	0.000000e+00
6	3482.172334	0.000000e+00
7	3614.173000	0.000000e+00
8	3877.172713	0.000000e+00
9	4281.172137	3.981211e+03

This code sets user_pseudo_id as the index of the total session time table and then left-joins it to user_df so that each user’s total stay time (total_stay_time_sec) is added as a new column to the user-level dataset.

In [ ]:

total_session_time.set_index('user_pseudo_id', inplace=True)
user_df = user_df.merge(total_session_time, left_index=True, right_index=True, how='left')

user_df.head()

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	media	device	converted	total_stay_time_sec
user_pseudo_id
18.172369	0	21	1	1	17	93	0	advertiser_apppush	mobile	1	2.066662e+06
23.172043	0	0	1	0	0	0	0	not set	mobile	0	0.000000e+00
1298.162548	0	0	0	0	0	2	0	direct	mobile	0	0.000000e+00
1678.173362	0	0	1	0	0	0	0	kakao_dm	mobile	0	0.000000e+00
2574.172079	0	0	1	0	0	3	0	criteo	mobile	0	1.302182e+04

3. Conversion rate per media¶

In this section, we examine how conversion performance differs across advertising media. By grouping users by media and calculating the mean of the binary converted variable, we obtain the conversion rate for each channel and sort them in descending order to see which media sources are most effective at driving purchases. We broadly divided them into four groups and analyzed the characteristics of each group.

In [ ]:

conv_rate_media = user_df.groupby('media')['converted'].mean().sort_values(ascending=False)
conv_rate_media

Out[ ]:

media
advertiser_apppush           0.577218
advertiser_braze             0.359855
direct                       0.231584
advertiser_naver             0.201439
advertiser_homeshopping      0.198375
RTB                          0.175623
advertiser_sms               0.149249
advertiser_tv                0.145849
advertiser_affiliate         0.140076
advertiser_ars               0.117240
naver_bs                     0.116568
kakao_alimtok                0.116206
payco                        0.086591
not set                      0.084513
naver_search                 0.080671
advertiser_wondershopping    0.069106
kakao_dm                     0.061338
referral                     0.053047
organic                      0.047328
TOSS                         0.044170
kakao_share                  0.040191
google_SA                    0.039424
kakao_SA                     0.033316
criteo                       0.014931
facebook                     0.001398
instagram                    0.001129
google_DA                    0.000149
ADN                          0.000016
appier                       0.000000
Name: converted, dtype: float64

(1) High conversion rate (> 30%)¶

advertiser_apppush: 57.7%, advertiser_braze: 35.9%
App Push: notification messages sent to app users
Braze: a CRM marketing platform that automates and personalizes app push, email, and SMS campaigns
CRM (Customer Relationship Management)–based retargeting channels show outstanding conversion performance.

(2) Moderate conversion rate¶

direct: 23.1%, advertiser_naver: 20.0%, advertiser_homeshopping: 19.8%
Because these users already know the brand and visit directly, they have a higher likelihood of purchasing.
Other advertisers (e.g., advertiser_sms, advertiser_tv, etc.) also show conversion rates of around 14%.

(3) Low conversion rate¶

Finance/payment-related channels (payco: 8.6%, TOSS: 4.4%)
Search-related channels (naver_search: 8.0%, google_SA: 3.9%)
Kakao channels (kakao_dm: 6.3%, kakao_share: 4.0%, kakao_SA: 3.3%)
Advertising strategies in these areas need to be re-evaluated.

(4) Extremely low conversion rate (≪ 0.1%)¶

google_DA (Google Display Ads), ADN, appier (AI-based targeting ads), Facebook, Instagram
These channels are mainly visual and click-oriented, so users tend to browse, show interest, or click lightly for fun rather than proceed to purchase.

4. Device¶

In this section, we investigate whether conversion behavior differs by device type. We first construct a contingency table that cross-tabulates device and the binary converted variable, which summarizes how many users on each device did or did not convert. The conversion rates are approximately 11.28% for mobile users (460,497 / (3,621,810 + 460,497)) and 2.28% for desktop users (9,091 / (390,444 + 9,091)). We next examine whether this substantial difference in conversion rates between mobile and desktop is statistically significant. This table will then be used to statistically test the association between device and conversion.

In [ ]:

from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(user_df['device'], user_df['converted'])

contingency_table

Out[ ]:

converted	0	1
device
desktop	390444	9091
mobile	3621810	460497

(1) chi squared test¶

To formally test whether conversion is independent of device type, we perform a chi-square test of independence with the null hypothesis that conversion rate does not differ by device (H₀: device and conversion are independent) and the alternative that it does (H₁: not independent). The resulting chi-square statistic (≈ 31,459.94) and p-value (≈ 0.0) lead us to reject the null hypothesis, indicating that conversion behavior differs significantly between mobile and desktop users.

In [ ]:

chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p}")

Chi-square statistic: 31459.938266787663
p-value: 0.0

(2) t - test¶

To further compare conversion performance by device, we conduct Welch’s t-test on the binary converted variable for mobile and desktop users. Under the null hypothesis that the two groups have the same mean conversion rate and the alternative that they differ, the test yields a very large t-statistic (≈ 318.03) and a p-value close to 0, so we reject the null and conclude that the average conversion rate for mobile users is significantly higher than that for desktop users.

In [ ]:

from scipy.stats import ttest_ind

mobile = user_df[user_df['device'] == 'mobile']['converted']
desktop = user_df[user_df['device'] == 'desktop']['converted']

t_stat, p_val = ttest_ind(mobile, desktop, equal_var=False)  # Welch's t-test

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_val}")

t-statistic: 318.0330553859352
p-value: 0.0

(3) conclusion¶

Both the chi-square test and Welch’s t-test consistently show that conversion behavior differs significantly by device, with mobile users exhibiting a much higher conversion rate than desktop users. This gap is likely related to the strong performance of mobile-focused channels such as App Push, which drive especially high conversion among app users.

5. K-means¶

In this section, we analyze how customers move through the shopping journey—from visiting and browsing the site to making a purchase—by applying K-means clustering to group users with similar behavioral patterns. To focus the clustering purely on event-based activity features, we remove media and device (categorical attributes used later for interpretation) as well as converted, which is an outcome variable rather than an input feature for unsupervised clustering.

In [ ]:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

In [ ]:

X = user_df.drop(columns=['media', 'device','converted'])
X.head()

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	total_stay_time_sec
user_pseudo_id
18.172369	0	21	1	1	17	93	0	2.066662e+06
23.172043	0	0	1	0	0	0	0	0.000000e+00
1298.162548	0	0	0	0	0	2	0	0.000000e+00
1678.173362	0	0	1	0	0	0	0	0.000000e+00
2574.172079	0	0	1	0	0	3	0	1.302182e+04

We standardize all features using StandardScaler so that each variable has mean 0 and unit variance, ensuring that no single feature dominates the K-means clustering due to its scale.

In [ ]:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

We determine the number of clusters k using the elbow method:

We examine how inertia (within-cluster sum of squares) decreases as k increases.
The optimal k is where the rate of decrease starts to level off.
After k = 6, the reduction in inertia becomes relatively small, suggesting diminishing returns.
With k = 3, clusters mainly separate users into high-conversion, low-conversion, and outlier groups, which provides limited behavioral insight.

In [ ]:

# Elbow Method
inertia = []
K_range = range(1, 11)
for k in K_range:
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X_scaled)
    inertia.append(model.inertia_)


plt.plot(K_range, inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()

We fit a K-means model with k = 6 clusters on the standardized features and assign each user to one of the six clusters. Then, we compute the mean of all numeric variables by cluster to summarize and compare the characteristic behavior of each customer group.

In [ ]:

k = 6

kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)


user_df['cluster'] = clusters
cluster_summary = user_df.groupby('cluster').mean(numeric_only=True)
cluster_summary

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	converted	total_stay_time_sec
cluster
0	0.042468	0.253800	1.004776	0.080826	0.092383	1.452441	0.000000	0.069311	2.496098e+04
1	49.525810	156.056903	0.217767	8.898679	22.225450	406.663625	0.007203	0.990876	1.048004e+07
2	0.000000	15612.666667	6.333333	16072.666667	0.000000	20.000000	0.000000	0.000000	1.530436e+07
3	8.969625	40.070478	0.353236	0.875223	7.354378	191.998610	0.007792	0.843955	5.090688e+06
4	0.429219	1.776358	0.000345	0.116009	0.558161	6.752564	0.000000	0.254752	7.507292e+04
5	0.308470	1.930986	0.928781	0.207768	0.701767	3.990078	1.165559	0.503670	6.956890e+04

Cluster 0 : long-term dormant users¶

Behavioral indicators (excluding outliers) are generally the lowest among all clusters.
Both login rate (0.08) and conversion rate (0.069) are low, and total stay time is also short.
Interpretation: customers who showed only minimal interest without purchasing → candidates for retargeting campaigns.

Cluster 1 : VIP customers (big spenders)¶

Very high activity levels: add_to_cart ≈ 49.5, purchase ≈ 22.2, and extremely long stay_time (over 10 million seconds).
Number of sessions (≈ 406) and logins (≈ 8.9) are very high, indicating highly engaged users.
Conversion rate converted ≈ 0.99, meaning almost all users in this cluster make a purchase.
Interpretation: core revenue-driving customers → strong candidates for loyalty programs and premium benefits.

Cluster 2 : Outliers or abnormal behavior¶

Extremely high activity counts, e.g., login ≈ 16,072 and begin_checkout ≈ 15,612.
Despite the large number of actions, converted = 0 and purchase = 0, showing a highly imbalanced pattern.
Interpretation: likely bot traffic or logging errors → should be considered as outliers and potentially removed from analysis.

Cluster 3 : Loyal regular buyers¶

Solid activity levels: add_to_cart ≈ 8.96, purchase ≈ 7.35, with around 191 sessions and relatively long total stay_time.
High conversion rate converted ≈ 0.84, meaning most users in this cluster end up purchasing.
Interpretation: valuable customers who visit frequently and make repeat purchases → key targets for retention strategies.

Cluster 4 : Low-loyalty general customers¶

Relatively low activity levels: add_to_cart ≈ 0.42, purchase ≈ 1.77, about 6.75 sessions, and short total stay_time.
Conversion rate converted ≈ 0.25, indicating that purchases are relatively infrequent.
Interpretation: at-risk or churning users → candidates for remarketing campaigns and discount coupons.

Cluster 5 : First-time visitors¶

Highest values for “회원가입완료” (signup completed) and “first_visit”, with a moderate conversion rate (converted ≈ 0.5).
Short total stay_time → users who are likely visiting the site for the first time.
Interpretation: customers who need marketing tailored to first-time visitors (onboarding guides, welcome offers, etc.).

6. Cart¶

In this section, we focus on users who have reached the cart stage and examine which factors drive them to complete a purchase. We restrict the data to customers with at least one add_to_cart event and build a logistic regression model with converted as the target and various behavioral features (and device information) as predictors. This allows us to quantify how different actions in the cart stage affect the probability of final conversion and to identify leverage points for improving checkout completion.

In [ ]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [ ]:

user_df_cart = user_df[user_df['add_to_cart'] > 0]
#user_df_cart.loc[:, 'converted'] = (user_df_cart['purchase'] > 0).astype(int)
user_df_cart.head()

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	media	device	converted	total_stay_time_sec	cluster
user_pseudo_id
25456.172429	1	0	1	1	0	0	0	organic	desktop	0	1.723212e+02	0
36909.170936	1	0	0	0	0	1	0	advertiser_tv	mobile	0	1.542716e+02	4
55536.169964	4	12	1	0	3	124	0	RTB	mobile	1	3.675918e+06	3
59429.168086	5	8	0	0	0	20	0	not set	mobile	0	1.031516e+05	4
78975.162420	1	9	0	1	0	21	0	direct	mobile	0	2.080192e+05	4

In [ ]:

user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})
user_df_cart.head()

C:\Users\user\AppData\Local\Temp\ipykernel_5984\3132898683.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})

Out[ ]:

	add_to_cart	begin_checkout	first_visit	login	purchase	session_start	회원가입완료	media	device	converted	total_stay_time_sec	cluster
user_pseudo_id
25456.172429	1	0	1	1	0	0	0	organic	1	0	1.723212e+02	0
36909.170936	1	0	0	0	0	1	0	advertiser_tv	0	0	1.542716e+02	4
55536.169964	4	12	1	0	3	124	0	RTB	0	1	3.675918e+06	3
59429.168086	5	8	0	0	0	20	0	not set	0	0	1.031516e+05	4
78975.162420	1	9	0	1	0	21	0	direct	0	0	2.080192e+05	4

For the cart-stage analysis, we construct the feature matrix X by removing purchase, media, converted, and cluster. The variable converted is our target and must not be included as an input, while purchase is effectively an outcome-like variable that would leak information about conversion. We also exclude media and the derived cluster label so that the logistic regression focuses on the direct effects of users’ behavioral features in the cart stage, without confounding from channel attributes or previously computed segments.

In [ ]:

X = user_df_cart.drop(columns=['purchase','media', 'converted', 'cluster'])
y = user_df_cart['converted']  # target

model = LogisticRegression(max_iter=1000)
model.fit(X, y)

Out[ ]:

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

coeff_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})
print(coeff_df.sort_values(by='Coefficient', ascending=False))

               Feature   Coefficient
1       begin_checkout  4.603798e-01
5               회원가입완료  1.976794e-01
3                login  3.332584e-02
4        session_start  8.550497e-03
7  total_stay_time_sec -3.571540e-07
0          add_to_cart -7.881150e-03
2          first_visit -2.358669e-02
6               device -2.115448e-01

From the logistic regression coefficients, we can draw the following conclusions about which factors most strongly influence conversion at the cart stage:

begin_checkout (β ≈ 0.46)
- This is the strongest positive predictor: users who proceed to the checkout step are much more likely to complete a purchase.
회원가입완료 – signup completed (β ≈ 0.20)
- Completing membership registration is also strongly associated with higher conversion, suggesting that registered members are more likely to finish the purchase than non-members.
login and session_start (small positive coefficients)
- More logins and session starts slightly increase the probability of conversion, indicating that repeated engagement with the site helps, but their effect is weaker than checkout or signup.
first_visit (β < 0)
- First-time visitors are less likely to convert than returning users, which aligns with the idea that users often need multiple visits before purchasing.
device (β ≈ -0.21, mobile = 0, desktop = 1)
- The negative coefficient means that desktop users have a lower conversion probability than mobile users, even among those who added items to the cart, consistent with the earlier device-level analysis.
total_stay_time_sec and add_to_cart (coefficients close to 0)
- Their effects are very small in magnitude in this model, suggesting that spending more time or adding slightly more items to the cart does not, by itself, strongly change the conversion probability, once other behaviors (like checkout and signup) are controlled for.

Overall, the model highlights that moving forward in the funnel (begin_checkout), completing signup, and using mobile are the key drivers of conversion at the cart stage, whereas first-time visits and desktop usage are associated with lower purchase completion.

7. Conclusion¶

In this analysis, we used detailed event-level logs from an online shopping mall to understand how customers move from visit to purchase and how different channels and behaviors affect conversion.

Overall, the analysis shows that personalized CRM channels, mobile usage, and deeper funnel engagement (checkout and signup) are critical drivers of conversion, while user behavior patterns allow us to identify VIPs, loyal buyers, at-risk customers, and newcomers. These insights can be used to design targeted marketing and retention strategies that allocate resources more efficiently and improve overall sales performance.