0. Purpose of the Project¶
This analysis uses real-world data provided by a digital advertising agency, who shared an anonymized version of the dataset they are actively analyzing, with the goal of proposing more effective marketing strategies based on actual customer behavior. The dataset (available at this link) contains detailed event logs from an online shopping mall, including visit timestamps, advertising media, device type, and purchase-related actions. Using this data, we construct user-level features from raw logs, compare conversion performance across media and devices, segment customers via K-means clustering, and model cart-stage conversion with logistic regression, ultimately deriving actionable insights for retargeting, onboarding, and loyalty-focused campaigns.
1. Intro¶
In this project, we work with an online shopping mall visit dataset that records each customer’s interactions with the site. The data eMFORCE_GA_yonsei.csv include information such as purchase counts, purchase amounts, visit timestamps, advertising media, and behavioral event types (event_name). Using these variables, we aim to analyze which factors influence customers’ purchasing activities and to classify customers into groups based on their purchase patterns.
import pandas as pd
df = pd.read_csv("eMFORCE_GA_yonsei.csv", encoding='euc-kr')
df.head(10)
| event_date | user_pseudo_id | event_timestamp | event_name | purchase_value | device | media | type | |
|---|---|---|---|---|---|---|---|---|
| 0 | 20241206 | 1.000002e+09 | 2024-12-06 12:27:01.140054+00:00 | first_visit | NaN | desktop | criteo | 광고 |
| 1 | 20241216 | 1.000004e+09 | 2024-12-15 15:48:17.412486+00:00 | first_visit | NaN | desktop | referral | 비광고 |
| 2 | 20241012 | 1.000006e+09 | 2024-10-12 08:12:44.957241+00:00 | first_visit | NaN | desktop | organic | 비광고 |
| 3 | 20241221 | 1.000009e+09 | 2024-12-20 22:00:06.601354+00:00 | first_visit | NaN | desktop | naver_bs | 광고 |
| 4 | 20240710 | 1.000012e+09 | 2024-07-10 13:59:21.352197+00:00 | first_visit | NaN | desktop | not set | 비광고 |
| 5 | 20240817 | 1.000024e+09 | 2024-08-16 17:24:35.989460+00:00 | first_visit | NaN | desktop | RTB | 광고 |
| 6 | 20241115 | 1.000028e+09 | 2024-11-15 04:54:52.075057+00:00 | first_visit | NaN | desktop | direct | 비광고 |
| 7 | 20241124 | 1.000028e+09 | 2024-11-24 06:23:23.606165+00:00 | session_start | NaN | desktop | referral | 비광고 |
| 8 | 20241125 | 1.000028e+09 | 2024-11-25 06:13:43.886572+00:00 | session_start | NaN | desktop | referral | 비광고 |
| 9 | 20240723 | 1.000034e+09 | 2024-07-23 08:31:43.145137+00:00 | first_visit | NaN | desktop | naver_bs | 광고 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25597319 entries, 0 to 25597318 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 event_date int64 1 user_pseudo_id float64 2 event_timestamp object 3 event_name object 4 purchase_value float64 5 device object 6 media object 7 type object dtypes: float64(2), int64(1), object(5) memory usage: 1.5+ GB
df['type'].value_counts()
type 광고 18782137 비광고 6815182 Name: count, dtype: int64
df['media'].value_counts()
media criteo 5391896 advertiser_apppush 4980346 direct 3655698 kakao_alimtok 2443547 not set 2048257 RTB 1734349 advertiser_sms 932248 organic 636282 kakao_dm 594582 TOSS 523974 referral 474945 ADN 468558 naver_bs 365534 advertiser_affiliate 352999 advertiser_homeshopping 290720 naver_search 268378 advertiser_ars 113569 instagram 89736 facebook 79828 google_DA 45418 kakao_share 41636 payco 34540 google_SA 9827 advertiser_tv 7388 advertiser_braze 4328 appier 3801 kakao_SA 3634 advertiser_naver 728 advertiser_wondershopping 573 Name: count, dtype: int64
df['device'].value_counts()
device mobile 24807072 desktop 790247 Name: count, dtype: int64
df['event_name'].value_counts()
event_name session_start 15710066 first_visit 3803531 begin_checkout 3720879 purchase 987648 add_to_cart 846926 login 491079 회원가입완료 37190 Name: count, dtype: int64
df.describe()
| event_date | user_pseudo_id | purchase_value | |
|---|---|---|---|
| count | 2.559732e+07 | 2.559732e+07 | 9.876480e+05 |
| mean | 2.024098e+07 | 1.070312e+09 | 5.317989e+04 |
| std | 1.719961e+02 | 6.212570e+08 | 5.670244e+04 |
| min | 2.024070e+07 | 1.817237e+01 | 0.000000e+00 |
| 25% | 2.024082e+07 | 5.302507e+08 | 2.990000e+04 |
| 50% | 2.024100e+07 | 1.072703e+09 | 4.111000e+04 |
| 75% | 2.024112e+07 | 1.609010e+09 | 5.997000e+04 |
| max | 2.024123e+07 | 2.147484e+09 | 6.278400e+06 |
2. Data Preprocessing¶
(1) row : id & col : event names, media, device¶
We transforms the raw log-level access records into a user-level dataset. First, we create a crosstab so that each row corresponds to a user and each column represents an event_name, summarizing how often each event occurs for that user. Then, for each user, we extract their most frequently used media channel and device. These variables, together with a later-defined purchase conversion indicator (converted), will form a structured table where each user is described by their main media, device, and event-based behavior.
# One-hot encoding
pivot = pd.crosstab(df['user_pseudo_id'], df['event_name'])
# each users most frequently used media
media = df.groupby('user_pseudo_id')['media'].agg(lambda x: x.value_counts().idxmax())
# each users most frequently used device
device = df.groupby('user_pseudo_id')['device'].agg(lambda x: x.value_counts().idxmax())
This code merges the event-based features with each user’s main media and device to create a single user-level table. It then defines a binary conversion variable, converted, which is set to 1 if the user has at least one purchase event and 0 otherwise, and displays the first five rows of the resulting dataset.
user_df = pivot.merge(media, left_index=True, right_index=True)
user_df = user_df.merge(device, left_index=True, right_index=True)
user_df['converted'] = (user_df['purchase'] > 0).astype(int)
user_df.head(5)
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | media | device | converted | |
|---|---|---|---|---|---|---|---|---|---|---|
| user_pseudo_id | ||||||||||
| 18.172369 | 0 | 21 | 1 | 1 | 17 | 93 | 0 | advertiser_apppush | mobile | 1 |
| 23.172043 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | not set | mobile | 0 |
| 1298.162548 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | direct | mobile | 0 |
| 1678.173362 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | kakao_dm | mobile | 0 |
| 2574.172079 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | criteo | mobile | 0 |
(2) time¶
Now, we calculates how long each user stayed on the site in total by defining and aggregating sessions. First, it converts event_timestamp to a proper datetime format and sorts all records by user and time. Then, for each user, it computes the time difference between consecutive events and starts a new session whenever the gap exceeds 24 hours (or when there is no previous event). Using these session IDs, it measures each session’s duration as the difference between the first and last timestamp. Finally, it sums the session durations for each user to obtain total_stay_time_sec, and displays the first 10 users’ results.
# convert event_timestamp to datetime format
df['event_timestamp'] = pd.to_datetime(df['event_timestamp'], format='mixed')
# sort data by user
df_sample_sorted = df.sort_values(by=['user_pseudo_id', 'event_timestamp'])
# define sessions: if the gap is more than 24 hours, start a new session
df_sample_sorted['prev_time'] = df_sample_sorted.groupby('user_pseudo_id')['event_timestamp'].shift()
df_sample_sorted['time_diff'] = (df_sample_sorted['event_timestamp'] - df_sample_sorted['prev_time']).dt.total_seconds()
# session ID: mark a new session when the gap is at least 1 day (86,400 seconds)
df_sample_sorted['new_session'] = (df_sample_sorted['time_diff'] > 86400) | (df_sample_sorted['time_diff'].isna())
df_sample_sorted['session_id'] = df_sample_sorted.groupby('user_pseudo_id')['new_session'].cumsum()
# session duration: last timestamp - first timestamp
session_time = df_sample_sorted.groupby(['user_pseudo_id', 'session_id'])['event_timestamp'].agg(['min', 'max'])
session_time['session_duration'] = (session_time['max'] - session_time['min']).dt.total_seconds()
# compute total stay time per user
total_session_time = session_time.groupby('user_pseudo_id')['session_duration'].sum().reset_index()
total_session_time.columns = ['user_pseudo_id', 'total_stay_time_sec']
total_session_time.head(10)
| user_pseudo_id | total_stay_time_sec | |
|---|---|---|
| 0 | 18.172369 | 2.066662e+06 |
| 1 | 23.172043 | 0.000000e+00 |
| 2 | 1298.162548 | 0.000000e+00 |
| 3 | 1678.173362 | 0.000000e+00 |
| 4 | 2574.172079 | 1.302182e+04 |
| 5 | 2784.172733 | 0.000000e+00 |
| 6 | 3482.172334 | 0.000000e+00 |
| 7 | 3614.173000 | 0.000000e+00 |
| 8 | 3877.172713 | 0.000000e+00 |
| 9 | 4281.172137 | 3.981211e+03 |
This code sets user_pseudo_id as the index of the total session time table and then left-joins it to user_df so that each user’s total stay time (total_stay_time_sec) is added as a new column to the user-level dataset.
total_session_time.set_index('user_pseudo_id', inplace=True)
user_df = user_df.merge(total_session_time, left_index=True, right_index=True, how='left')
user_df.head()
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | media | device | converted | total_stay_time_sec | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| user_pseudo_id | |||||||||||
| 18.172369 | 0 | 21 | 1 | 1 | 17 | 93 | 0 | advertiser_apppush | mobile | 1 | 2.066662e+06 |
| 23.172043 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | not set | mobile | 0 | 0.000000e+00 |
| 1298.162548 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | direct | mobile | 0 | 0.000000e+00 |
| 1678.173362 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | kakao_dm | mobile | 0 | 0.000000e+00 |
| 2574.172079 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | criteo | mobile | 0 | 1.302182e+04 |
3. Conversion rate per media¶
In this section, we examine how conversion performance differs across advertising media. By grouping users by media and calculating the mean of the binary converted variable, we obtain the conversion rate for each channel and sort them in descending order to see which media sources are most effective at driving purchases. We broadly divided them into four groups and analyzed the characteristics of each group.
conv_rate_media = user_df.groupby('media')['converted'].mean().sort_values(ascending=False)
conv_rate_media
media advertiser_apppush 0.577218 advertiser_braze 0.359855 direct 0.231584 advertiser_naver 0.201439 advertiser_homeshopping 0.198375 RTB 0.175623 advertiser_sms 0.149249 advertiser_tv 0.145849 advertiser_affiliate 0.140076 advertiser_ars 0.117240 naver_bs 0.116568 kakao_alimtok 0.116206 payco 0.086591 not set 0.084513 naver_search 0.080671 advertiser_wondershopping 0.069106 kakao_dm 0.061338 referral 0.053047 organic 0.047328 TOSS 0.044170 kakao_share 0.040191 google_SA 0.039424 kakao_SA 0.033316 criteo 0.014931 facebook 0.001398 instagram 0.001129 google_DA 0.000149 ADN 0.000016 appier 0.000000 Name: converted, dtype: float64
(1) High conversion rate (> 30%)¶
- advertiser_apppush: 57.7%, advertiser_braze: 35.9%
- App Push: notification messages sent to app users
- Braze: a CRM marketing platform that automates and personalizes app push, email, and SMS campaigns
- CRM (Customer Relationship Management)–based retargeting channels show outstanding conversion performance.
(2) Moderate conversion rate¶
- direct: 23.1%, advertiser_naver: 20.0%, advertiser_homeshopping: 19.8%
- Because these users already know the brand and visit directly, they have a higher likelihood of purchasing.
- Other advertisers (e.g., advertiser_sms, advertiser_tv, etc.) also show conversion rates of around 14%.
(3) Low conversion rate¶
- Finance/payment-related channels (payco: 8.6%, TOSS: 4.4%)
- Search-related channels (naver_search: 8.0%, google_SA: 3.9%)
- Kakao channels (kakao_dm: 6.3%, kakao_share: 4.0%, kakao_SA: 3.3%)
- Advertising strategies in these areas need to be re-evaluated.
(4) Extremely low conversion rate (≪ 0.1%)¶
- google_DA (Google Display Ads), ADN, appier (AI-based targeting ads), Facebook, Instagram
- These channels are mainly visual and click-oriented, so users tend to browse, show interest, or click lightly for fun rather than proceed to purchase.
4. Device¶
In this section, we investigate whether conversion behavior differs by device type. We first construct a contingency table that cross-tabulates device and the binary converted variable, which summarizes how many users on each device did or did not convert. The conversion rates are approximately 11.28% for mobile users (460,497 / (3,621,810 + 460,497)) and 2.28% for desktop users (9,091 / (390,444 + 9,091)). We next examine whether this substantial difference in conversion rates between mobile and desktop is statistically significant. This table will then be used to statistically test the association between device and conversion.
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(user_df['device'], user_df['converted'])
contingency_table
| converted | 0 | 1 |
|---|---|---|
| device | ||
| desktop | 390444 | 9091 |
| mobile | 3621810 | 460497 |
(1) chi squared test¶
To formally test whether conversion is independent of device type, we perform a chi-square test of independence with the null hypothesis that conversion rate does not differ by device (H₀: device and conversion are independent) and the alternative that it does (H₁: not independent). The resulting chi-square statistic (≈ 31,459.94) and p-value (≈ 0.0) lead us to reject the null hypothesis, indicating that conversion behavior differs significantly between mobile and desktop users.
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p}")
Chi-square statistic: 31459.938266787663 p-value: 0.0
(2) t - test¶
To further compare conversion performance by device, we conduct Welch’s t-test on the binary converted variable for mobile and desktop users. Under the null hypothesis that the two groups have the same mean conversion rate and the alternative that they differ, the test yields a very large t-statistic (≈ 318.03) and a p-value close to 0, so we reject the null and conclude that the average conversion rate for mobile users is significantly higher than that for desktop users.
from scipy.stats import ttest_ind
mobile = user_df[user_df['device'] == 'mobile']['converted']
desktop = user_df[user_df['device'] == 'desktop']['converted']
t_stat, p_val = ttest_ind(mobile, desktop, equal_var=False) # Welch's t-test
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_val}")
t-statistic: 318.0330553859352 p-value: 0.0
(3) conclusion¶
Both the chi-square test and Welch’s t-test consistently show that conversion behavior differs significantly by device, with mobile users exhibiting a much higher conversion rate than desktop users. This gap is likely related to the strong performance of mobile-focused channels such as App Push, which drive especially high conversion among app users.
5. K-means¶
In this section, we analyze how customers move through the shopping journey—from visiting and browsing the site to making a purchase—by applying K-means clustering to group users with similar behavioral patterns. To focus the clustering purely on event-based activity features, we remove media and device (categorical attributes used later for interpretation) as well as converted, which is an outcome variable rather than an input feature for unsupervised clustering.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
X = user_df.drop(columns=['media', 'device','converted'])
X.head()
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | total_stay_time_sec | |
|---|---|---|---|---|---|---|---|---|
| user_pseudo_id | ||||||||
| 18.172369 | 0 | 21 | 1 | 1 | 17 | 93 | 0 | 2.066662e+06 |
| 23.172043 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.000000e+00 |
| 1298.162548 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0.000000e+00 |
| 1678.173362 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0.000000e+00 |
| 2574.172079 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 1.302182e+04 |
We standardize all features using StandardScaler so that each variable has mean 0 and unit variance, ensuring that no single feature dominates the K-means clustering due to its scale.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
We determine the number of clusters k using the elbow method:
- We examine how inertia (within-cluster sum of squares) decreases as k increases.
- The optimal k is where the rate of decrease starts to level off.
- After k = 6, the reduction in inertia becomes relatively small, suggesting diminishing returns.
- With k = 3, clusters mainly separate users into high-conversion, low-conversion, and outlier groups, which provides limited behavioral insight.
# Elbow Method
inertia = []
K_range = range(1, 11)
for k in K_range:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X_scaled)
inertia.append(model.inertia_)
plt.plot(K_range, inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()
We fit a K-means model with k = 6 clusters on the standardized features and assign each user to one of the six clusters. Then, we compute the mean of all numeric variables by cluster to summarize and compare the characteristic behavior of each customer group.
k = 6
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
user_df['cluster'] = clusters
cluster_summary = user_df.groupby('cluster').mean(numeric_only=True)
cluster_summary
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | converted | total_stay_time_sec | |
|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||
| 0 | 0.042468 | 0.253800 | 1.004776 | 0.080826 | 0.092383 | 1.452441 | 0.000000 | 0.069311 | 2.496098e+04 |
| 1 | 49.525810 | 156.056903 | 0.217767 | 8.898679 | 22.225450 | 406.663625 | 0.007203 | 0.990876 | 1.048004e+07 |
| 2 | 0.000000 | 15612.666667 | 6.333333 | 16072.666667 | 0.000000 | 20.000000 | 0.000000 | 0.000000 | 1.530436e+07 |
| 3 | 8.969625 | 40.070478 | 0.353236 | 0.875223 | 7.354378 | 191.998610 | 0.007792 | 0.843955 | 5.090688e+06 |
| 4 | 0.429219 | 1.776358 | 0.000345 | 0.116009 | 0.558161 | 6.752564 | 0.000000 | 0.254752 | 7.507292e+04 |
| 5 | 0.308470 | 1.930986 | 0.928781 | 0.207768 | 0.701767 | 3.990078 | 1.165559 | 0.503670 | 6.956890e+04 |
Cluster 0 : long-term dormant users¶
- Behavioral indicators (excluding outliers) are generally the lowest among all clusters.
- Both login rate (0.08) and conversion rate (0.069) are low, and total stay time is also short.
- Interpretation: customers who showed only minimal interest without purchasing → candidates for retargeting campaigns.
Cluster 1 : VIP customers (big spenders)¶
- Very high activity levels: add_to_cart ≈ 49.5, purchase ≈ 22.2, and extremely long stay_time (over 10 million seconds).
- Number of sessions (≈ 406) and logins (≈ 8.9) are very high, indicating highly engaged users.
- Conversion rate
converted ≈ 0.99, meaning almost all users in this cluster make a purchase. - Interpretation: core revenue-driving customers → strong candidates for loyalty programs and premium benefits.
Cluster 2 : Outliers or abnormal behavior¶
- Extremely high activity counts, e.g., login ≈ 16,072 and begin_checkout ≈ 15,612.
- Despite the large number of actions,
converted = 0andpurchase = 0, showing a highly imbalanced pattern. - Interpretation: likely bot traffic or logging errors → should be considered as outliers and potentially removed from analysis.
Cluster 3 : Loyal regular buyers¶
- Solid activity levels: add_to_cart ≈ 8.96, purchase ≈ 7.35, with around 191 sessions and relatively long total stay_time.
- High conversion rate
converted ≈ 0.84, meaning most users in this cluster end up purchasing. - Interpretation: valuable customers who visit frequently and make repeat purchases → key targets for retention strategies.
Cluster 4 : Low-loyalty general customers¶
- Relatively low activity levels: add_to_cart ≈ 0.42, purchase ≈ 1.77, about 6.75 sessions, and short total stay_time.
- Conversion rate
converted ≈ 0.25, indicating that purchases are relatively infrequent. - Interpretation: at-risk or churning users → candidates for remarketing campaigns and discount coupons.
Cluster 5 : First-time visitors¶
- Highest values for “회원가입완료” (signup completed) and “first_visit”, with a moderate conversion rate (
converted ≈ 0.5). - Short total stay_time → users who are likely visiting the site for the first time.
- Interpretation: customers who need marketing tailored to first-time visitors (onboarding guides, welcome offers, etc.).
6. Cart¶
In this section, we focus on users who have reached the cart stage and examine which factors drive them to complete a purchase. We restrict the data to customers with at least one add_to_cart event and build a logistic regression model with converted as the target and various behavioral features (and device information) as predictors. This allows us to quantify how different actions in the cart stage affect the probability of final conversion and to identify leverage points for improving checkout completion.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
user_df_cart = user_df[user_df['add_to_cart'] > 0]
#user_df_cart.loc[:, 'converted'] = (user_df_cart['purchase'] > 0).astype(int)
user_df_cart.head()
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | media | device | converted | total_stay_time_sec | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_pseudo_id | ||||||||||||
| 25456.172429 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | organic | desktop | 0 | 1.723212e+02 | 0 |
| 36909.170936 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | advertiser_tv | mobile | 0 | 1.542716e+02 | 4 |
| 55536.169964 | 4 | 12 | 1 | 0 | 3 | 124 | 0 | RTB | mobile | 1 | 3.675918e+06 | 3 |
| 59429.168086 | 5 | 8 | 0 | 0 | 0 | 20 | 0 | not set | mobile | 0 | 1.031516e+05 | 4 |
| 78975.162420 | 1 | 9 | 0 | 1 | 0 | 21 | 0 | direct | mobile | 0 | 2.080192e+05 | 4 |
user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})
user_df_cart.head()
C:\Users\user\AppData\Local\Temp\ipykernel_5984\3132898683.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})
| add_to_cart | begin_checkout | first_visit | login | purchase | session_start | 회원가입완료 | media | device | converted | total_stay_time_sec | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_pseudo_id | ||||||||||||
| 25456.172429 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | organic | 1 | 0 | 1.723212e+02 | 0 |
| 36909.170936 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | advertiser_tv | 0 | 0 | 1.542716e+02 | 4 |
| 55536.169964 | 4 | 12 | 1 | 0 | 3 | 124 | 0 | RTB | 0 | 1 | 3.675918e+06 | 3 |
| 59429.168086 | 5 | 8 | 0 | 0 | 0 | 20 | 0 | not set | 0 | 0 | 1.031516e+05 | 4 |
| 78975.162420 | 1 | 9 | 0 | 1 | 0 | 21 | 0 | direct | 0 | 0 | 2.080192e+05 | 4 |
For the cart-stage analysis, we construct the feature matrix X by removing purchase, media, converted, and cluster. The variable converted is our target and must not be included as an input, while purchase is effectively an outcome-like variable that would leak information about conversion. We also exclude media and the derived cluster label so that the logistic regression focuses on the direct effects of users’ behavioral features in the cart stage, without confounding from channel attributes or previously computed segments.
X = user_df_cart.drop(columns=['purchase','media', 'converted', 'cluster'])
y = user_df_cart['converted'] # target
model = LogisticRegression(max_iter=1000)
model.fit(X, y)
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)
coeff_df = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_[0]
})
print(coeff_df.sort_values(by='Coefficient', ascending=False))
Feature Coefficient 1 begin_checkout 4.603798e-01 5 회원가입완료 1.976794e-01 3 login 3.332584e-02 4 session_start 8.550497e-03 7 total_stay_time_sec -3.571540e-07 0 add_to_cart -7.881150e-03 2 first_visit -2.358669e-02 6 device -2.115448e-01
From the logistic regression coefficients, we can draw the following conclusions about which factors most strongly influence conversion at the cart stage:
begin_checkout(β ≈ 0.46)- This is the strongest positive predictor: users who proceed to the checkout step are much more likely to complete a purchase.
회원가입완료– signup completed (β ≈ 0.20)- Completing membership registration is also strongly associated with higher conversion, suggesting that registered members are more likely to finish the purchase than non-members.
loginandsession_start(small positive coefficients)- More logins and session starts slightly increase the probability of conversion, indicating that repeated engagement with the site helps, but their effect is weaker than checkout or signup.
first_visit(β < 0)- First-time visitors are less likely to convert than returning users, which aligns with the idea that users often need multiple visits before purchasing.
device(β ≈ -0.21, mobile = 0, desktop = 1)- The negative coefficient means that desktop users have a lower conversion probability than mobile users, even among those who added items to the cart, consistent with the earlier device-level analysis.
total_stay_time_secandadd_to_cart(coefficients close to 0)- Their effects are very small in magnitude in this model, suggesting that spending more time or adding slightly more items to the cart does not, by itself, strongly change the conversion probability, once other behaviors (like checkout and signup) are controlled for.
Overall, the model highlights that moving forward in the funnel (begin_checkout), completing signup, and using mobile are the key drivers of conversion at the cart stage, whereas first-time visits and desktop usage are associated with lower purchase completion.
7. Conclusion¶
In this analysis, we used detailed event-level logs from an online shopping mall to understand how customers move from visit to purchase and how different channels and behaviors affect conversion.
Overall, the analysis shows that personalized CRM channels, mobile usage, and deeper funnel engagement (checkout and signup) are critical drivers of conversion, while user behavior patterns allow us to identify VIPs, loyal buyers, at-risk customers, and newcomers. These insights can be used to design targeted marketing and retention strategies that allocate resources more efficiently and improve overall sales performance.