• Skip to primary navigation
  • Skip to content
  • Skip to footer
  • Home
  • Research
  • Transcript
  • CV
    • Multivariate Analysis

      • MVA HW1
      • MVA HW2
      • MVA HW3
      • MVA HW4
      • MVA Final Project
    • Measure Theory

      • Ch1 : Ex.01 ~ Ex.18
      • Ch1 : Ex.19 ~ Ex.38
      • Ch2 : Ex.01 ~ Ex.11
      • Ch2 : Ex.12 ~ Ex.24
      • Ch3 : Ex.01 ~ Ex.11
      • Ch3 : Ex.12 ~ Ex.20
      • Ch3 : Ex.21 ~ Ex.32
      • Ch3 : Extra Questions
      • Ch4 : Ex.01 ~ Ex.09
      • Ch4 : Ex.10 ~ Ex.20
      • Ch4 : Ex.21 ~ Ex.28
      • Ch4 : Ex.29 ~ Ex.35
    • Elements of Statistical Learning

      • ESL CH3
      • ESL CH4
      • ESL CH5
      • ESL CH7
      • ESL CH12
    • Bayesian Statistics

      • Bayes HW1
      • Bayes HW2
      • Bayes HW3
      • Bayes HW4
      • Bayes HW5
      • Bayes HW6
      • Bayes Project 1
      • Bayes Project 2
    • Statistical Computing

      • SC HW1
      • SC HW2
      • SC HW3
      • SC HW4
      • SC HW5
      • SC HW6
      • SC HW7
      • SC HW8
    • Monte Carlo Methods

      • MCMC HW1
      • MCMC HW2
      • MCMC HW3
      • MCMC HW4
      • MCMC HW5
      • MCMC HW6
    • Industrial Academic Cooperation Big Data Analysis

      • Introduction
      • Finance Data Analysis
      • Marketing Data Analysis
    • Data Science Institute

      • Introduction
      • Consultation 1
      • Consultation 2
      • Consultation 3
      • Consultation 4
      • Consultation 5

    0. Purpose of the Project¶

    This analysis uses real-world data provided by a digital advertising agency, who shared an anonymized version of the dataset they are actively analyzing, with the goal of proposing more effective marketing strategies based on actual customer behavior. The dataset (available at this link) contains detailed event logs from an online shopping mall, including visit timestamps, advertising media, device type, and purchase-related actions. Using this data, we construct user-level features from raw logs, compare conversion performance across media and devices, segment customers via K-means clustering, and model cart-stage conversion with logistic regression, ultimately deriving actionable insights for retargeting, onboarding, and loyalty-focused campaigns.

    1. Intro¶

    In this project, we work with an online shopping mall visit dataset that records each customer’s interactions with the site. The data eMFORCE_GA_yonsei.csv include information such as purchase counts, purchase amounts, visit timestamps, advertising media, and behavioral event types (event_name). Using these variables, we aim to analyze which factors influence customers’ purchasing activities and to classify customers into groups based on their purchase patterns.

    In [ ]:
    import pandas as pd
    df = pd.read_csv("eMFORCE_GA_yonsei.csv", encoding='euc-kr')
    df.head(10)
    
    Out[ ]:
    event_date user_pseudo_id event_timestamp event_name purchase_value device media type
    0 20241206 1.000002e+09 2024-12-06 12:27:01.140054+00:00 first_visit NaN desktop criteo 광고
    1 20241216 1.000004e+09 2024-12-15 15:48:17.412486+00:00 first_visit NaN desktop referral 비광고
    2 20241012 1.000006e+09 2024-10-12 08:12:44.957241+00:00 first_visit NaN desktop organic 비광고
    3 20241221 1.000009e+09 2024-12-20 22:00:06.601354+00:00 first_visit NaN desktop naver_bs 광고
    4 20240710 1.000012e+09 2024-07-10 13:59:21.352197+00:00 first_visit NaN desktop not set 비광고
    5 20240817 1.000024e+09 2024-08-16 17:24:35.989460+00:00 first_visit NaN desktop RTB 광고
    6 20241115 1.000028e+09 2024-11-15 04:54:52.075057+00:00 first_visit NaN desktop direct 비광고
    7 20241124 1.000028e+09 2024-11-24 06:23:23.606165+00:00 session_start NaN desktop referral 비광고
    8 20241125 1.000028e+09 2024-11-25 06:13:43.886572+00:00 session_start NaN desktop referral 비광고
    9 20240723 1.000034e+09 2024-07-23 08:31:43.145137+00:00 first_visit NaN desktop naver_bs 광고
    In [ ]:
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 25597319 entries, 0 to 25597318
    Data columns (total 8 columns):
     #   Column           Dtype  
    ---  ------           -----  
     0   event_date       int64  
     1   user_pseudo_id   float64
     2   event_timestamp  object 
     3   event_name       object 
     4   purchase_value   float64
     5   device           object 
     6   media            object 
     7   type             object 
    dtypes: float64(2), int64(1), object(5)
    memory usage: 1.5+ GB
    
    In [ ]:
    df['type'].value_counts()
    
    Out[ ]:
    type
    광고     18782137
    비광고     6815182
    Name: count, dtype: int64
    In [ ]:
    df['media'].value_counts()
    
    Out[ ]:
    media
    criteo                       5391896
    advertiser_apppush           4980346
    direct                       3655698
    kakao_alimtok                2443547
    not set                      2048257
    RTB                          1734349
    advertiser_sms                932248
    organic                       636282
    kakao_dm                      594582
    TOSS                          523974
    referral                      474945
    ADN                           468558
    naver_bs                      365534
    advertiser_affiliate          352999
    advertiser_homeshopping       290720
    naver_search                  268378
    advertiser_ars                113569
    instagram                      89736
    facebook                       79828
    google_DA                      45418
    kakao_share                    41636
    payco                          34540
    google_SA                       9827
    advertiser_tv                   7388
    advertiser_braze                4328
    appier                          3801
    kakao_SA                        3634
    advertiser_naver                 728
    advertiser_wondershopping        573
    Name: count, dtype: int64
    In [ ]:
    df['device'].value_counts()
    
    Out[ ]:
    device
    mobile     24807072
    desktop      790247
    Name: count, dtype: int64
    In [ ]:
    df['event_name'].value_counts()
    
    Out[ ]:
    event_name
    session_start     15710066
    first_visit        3803531
    begin_checkout     3720879
    purchase            987648
    add_to_cart         846926
    login               491079
    회원가입완료               37190
    Name: count, dtype: int64
    In [ ]:
    df.describe()
    
    Out[ ]:
    event_date user_pseudo_id purchase_value
    count 2.559732e+07 2.559732e+07 9.876480e+05
    mean 2.024098e+07 1.070312e+09 5.317989e+04
    std 1.719961e+02 6.212570e+08 5.670244e+04
    min 2.024070e+07 1.817237e+01 0.000000e+00
    25% 2.024082e+07 5.302507e+08 2.990000e+04
    50% 2.024100e+07 1.072703e+09 4.111000e+04
    75% 2.024112e+07 1.609010e+09 5.997000e+04
    max 2.024123e+07 2.147484e+09 6.278400e+06

    2. Data Preprocessing¶

    (1) row : id & col : event names, media, device¶

    We transforms the raw log-level access records into a user-level dataset. First, we create a crosstab so that each row corresponds to a user and each column represents an event_name, summarizing how often each event occurs for that user. Then, for each user, we extract their most frequently used media channel and device. These variables, together with a later-defined purchase conversion indicator (converted), will form a structured table where each user is described by their main media, device, and event-based behavior.

    In [ ]:
    # One-hot encoding
    pivot = pd.crosstab(df['user_pseudo_id'], df['event_name'])
    # each users most frequently used media
    media = df.groupby('user_pseudo_id')['media'].agg(lambda x: x.value_counts().idxmax())
    # each users most frequently used device
    device = df.groupby('user_pseudo_id')['device'].agg(lambda x: x.value_counts().idxmax())
    

    This code merges the event-based features with each user’s main media and device to create a single user-level table. It then defines a binary conversion variable, converted, which is set to 1 if the user has at least one purchase event and 0 otherwise, and displays the first five rows of the resulting dataset.

    In [ ]:
    user_df = pivot.merge(media, left_index=True, right_index=True)
    user_df = user_df.merge(device, left_index=True, right_index=True)
    user_df['converted'] = (user_df['purchase'] > 0).astype(int)
    user_df.head(5)
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 media device converted
    user_pseudo_id
    18.172369 0 21 1 1 17 93 0 advertiser_apppush mobile 1
    23.172043 0 0 1 0 0 0 0 not set mobile 0
    1298.162548 0 0 0 0 0 2 0 direct mobile 0
    1678.173362 0 0 1 0 0 0 0 kakao_dm mobile 0
    2574.172079 0 0 1 0 0 3 0 criteo mobile 0

    (2) time¶

    Now, we calculates how long each user stayed on the site in total by defining and aggregating sessions. First, it converts event_timestamp to a proper datetime format and sorts all records by user and time. Then, for each user, it computes the time difference between consecutive events and starts a new session whenever the gap exceeds 24 hours (or when there is no previous event). Using these session IDs, it measures each session’s duration as the difference between the first and last timestamp. Finally, it sums the session durations for each user to obtain total_stay_time_sec, and displays the first 10 users’ results.

    In [ ]:
    # convert event_timestamp to datetime format
    df['event_timestamp'] = pd.to_datetime(df['event_timestamp'], format='mixed')
    
    # sort data by user
    df_sample_sorted = df.sort_values(by=['user_pseudo_id', 'event_timestamp'])
    
    # define sessions: if the gap is more than 24 hours, start a new session
    df_sample_sorted['prev_time'] = df_sample_sorted.groupby('user_pseudo_id')['event_timestamp'].shift()
    df_sample_sorted['time_diff'] = (df_sample_sorted['event_timestamp'] - df_sample_sorted['prev_time']).dt.total_seconds()
    
    # session ID: mark a new session when the gap is at least 1 day (86,400 seconds)
    df_sample_sorted['new_session'] = (df_sample_sorted['time_diff'] > 86400) | (df_sample_sorted['time_diff'].isna())
    df_sample_sorted['session_id'] = df_sample_sorted.groupby('user_pseudo_id')['new_session'].cumsum()
    
    # session duration: last timestamp - first timestamp
    session_time = df_sample_sorted.groupby(['user_pseudo_id', 'session_id'])['event_timestamp'].agg(['min', 'max'])
    session_time['session_duration'] = (session_time['max'] - session_time['min']).dt.total_seconds()
    
    # compute total stay time per user
    total_session_time = session_time.groupby('user_pseudo_id')['session_duration'].sum().reset_index()
    total_session_time.columns = ['user_pseudo_id', 'total_stay_time_sec']
    
    total_session_time.head(10)
    
    Out[ ]:
    user_pseudo_id total_stay_time_sec
    0 18.172369 2.066662e+06
    1 23.172043 0.000000e+00
    2 1298.162548 0.000000e+00
    3 1678.173362 0.000000e+00
    4 2574.172079 1.302182e+04
    5 2784.172733 0.000000e+00
    6 3482.172334 0.000000e+00
    7 3614.173000 0.000000e+00
    8 3877.172713 0.000000e+00
    9 4281.172137 3.981211e+03

    This code sets user_pseudo_id as the index of the total session time table and then left-joins it to user_df so that each user’s total stay time (total_stay_time_sec) is added as a new column to the user-level dataset.

    In [ ]:
    total_session_time.set_index('user_pseudo_id', inplace=True)
    user_df = user_df.merge(total_session_time, left_index=True, right_index=True, how='left')
    
    user_df.head()
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 media device converted total_stay_time_sec
    user_pseudo_id
    18.172369 0 21 1 1 17 93 0 advertiser_apppush mobile 1 2.066662e+06
    23.172043 0 0 1 0 0 0 0 not set mobile 0 0.000000e+00
    1298.162548 0 0 0 0 0 2 0 direct mobile 0 0.000000e+00
    1678.173362 0 0 1 0 0 0 0 kakao_dm mobile 0 0.000000e+00
    2574.172079 0 0 1 0 0 3 0 criteo mobile 0 1.302182e+04

    3. Conversion rate per media¶

    In this section, we examine how conversion performance differs across advertising media. By grouping users by media and calculating the mean of the binary converted variable, we obtain the conversion rate for each channel and sort them in descending order to see which media sources are most effective at driving purchases. We broadly divided them into four groups and analyzed the characteristics of each group.

    In [ ]:
    conv_rate_media = user_df.groupby('media')['converted'].mean().sort_values(ascending=False)
    conv_rate_media
    
    Out[ ]:
    media
    advertiser_apppush           0.577218
    advertiser_braze             0.359855
    direct                       0.231584
    advertiser_naver             0.201439
    advertiser_homeshopping      0.198375
    RTB                          0.175623
    advertiser_sms               0.149249
    advertiser_tv                0.145849
    advertiser_affiliate         0.140076
    advertiser_ars               0.117240
    naver_bs                     0.116568
    kakao_alimtok                0.116206
    payco                        0.086591
    not set                      0.084513
    naver_search                 0.080671
    advertiser_wondershopping    0.069106
    kakao_dm                     0.061338
    referral                     0.053047
    organic                      0.047328
    TOSS                         0.044170
    kakao_share                  0.040191
    google_SA                    0.039424
    kakao_SA                     0.033316
    criteo                       0.014931
    facebook                     0.001398
    instagram                    0.001129
    google_DA                    0.000149
    ADN                          0.000016
    appier                       0.000000
    Name: converted, dtype: float64

    (1) High conversion rate (> 30%)¶

    • advertiser_apppush: 57.7%, advertiser_braze: 35.9%
    • App Push: notification messages sent to app users
    • Braze: a CRM marketing platform that automates and personalizes app push, email, and SMS campaigns
    • CRM (Customer Relationship Management)–based retargeting channels show outstanding conversion performance.

    (2) Moderate conversion rate¶

    • direct: 23.1%, advertiser_naver: 20.0%, advertiser_homeshopping: 19.8%
    • Because these users already know the brand and visit directly, they have a higher likelihood of purchasing.
    • Other advertisers (e.g., advertiser_sms, advertiser_tv, etc.) also show conversion rates of around 14%.

    (3) Low conversion rate¶

    • Finance/payment-related channels (payco: 8.6%, TOSS: 4.4%)
    • Search-related channels (naver_search: 8.0%, google_SA: 3.9%)
    • Kakao channels (kakao_dm: 6.3%, kakao_share: 4.0%, kakao_SA: 3.3%)
    • Advertising strategies in these areas need to be re-evaluated.

    (4) Extremely low conversion rate (≪ 0.1%)¶

    • google_DA (Google Display Ads), ADN, appier (AI-based targeting ads), Facebook, Instagram
    • These channels are mainly visual and click-oriented, so users tend to browse, show interest, or click lightly for fun rather than proceed to purchase.

    4. Device¶

    In this section, we investigate whether conversion behavior differs by device type. We first construct a contingency table that cross-tabulates device and the binary converted variable, which summarizes how many users on each device did or did not convert. The conversion rates are approximately 11.28% for mobile users (460,497 / (3,621,810 + 460,497)) and 2.28% for desktop users (9,091 / (390,444 + 9,091)). We next examine whether this substantial difference in conversion rates between mobile and desktop is statistically significant. This table will then be used to statistically test the association between device and conversion.

    In [ ]:
    from scipy.stats import chi2_contingency
    
    contingency_table = pd.crosstab(user_df['device'], user_df['converted'])
    
    contingency_table
    
    Out[ ]:
    converted 0 1
    device
    desktop 390444 9091
    mobile 3621810 460497

    (1) chi squared test¶

    To formally test whether conversion is independent of device type, we perform a chi-square test of independence with the null hypothesis that conversion rate does not differ by device (H₀: device and conversion are independent) and the alternative that it does (H₁: not independent). The resulting chi-square statistic (≈ 31,459.94) and p-value (≈ 0.0) lead us to reject the null hypothesis, indicating that conversion behavior differs significantly between mobile and desktop users.

    In [ ]:
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    
    print(f"Chi-square statistic: {chi2}")
    print(f"p-value: {p}")
    
    Chi-square statistic: 31459.938266787663
    p-value: 0.0
    

    (2) t - test¶

    To further compare conversion performance by device, we conduct Welch’s t-test on the binary converted variable for mobile and desktop users. Under the null hypothesis that the two groups have the same mean conversion rate and the alternative that they differ, the test yields a very large t-statistic (≈ 318.03) and a p-value close to 0, so we reject the null and conclude that the average conversion rate for mobile users is significantly higher than that for desktop users.

    In [ ]:
    from scipy.stats import ttest_ind
    
    mobile = user_df[user_df['device'] == 'mobile']['converted']
    desktop = user_df[user_df['device'] == 'desktop']['converted']
    
    t_stat, p_val = ttest_ind(mobile, desktop, equal_var=False)  # Welch's t-test
    
    print(f"t-statistic: {t_stat}")
    print(f"p-value: {p_val}")
    
    t-statistic: 318.0330553859352
    p-value: 0.0
    

    (3) conclusion¶

    Both the chi-square test and Welch’s t-test consistently show that conversion behavior differs significantly by device, with mobile users exhibiting a much higher conversion rate than desktop users. This gap is likely related to the strong performance of mobile-focused channels such as App Push, which drive especially high conversion among app users.

    5. K-means¶

    In this section, we analyze how customers move through the shopping journey—from visiting and browsing the site to making a purchase—by applying K-means clustering to group users with similar behavioral patterns. To focus the clustering purely on event-based activity features, we remove media and device (categorical attributes used later for interpretation) as well as converted, which is an outcome variable rather than an input feature for unsupervised clustering.

    In [ ]:
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score
    import matplotlib.pyplot as plt
    
    In [ ]:
    X = user_df.drop(columns=['media', 'device','converted'])
    X.head()
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 total_stay_time_sec
    user_pseudo_id
    18.172369 0 21 1 1 17 93 0 2.066662e+06
    23.172043 0 0 1 0 0 0 0 0.000000e+00
    1298.162548 0 0 0 0 0 2 0 0.000000e+00
    1678.173362 0 0 1 0 0 0 0 0.000000e+00
    2574.172079 0 0 1 0 0 3 0 1.302182e+04

    We standardize all features using StandardScaler so that each variable has mean 0 and unit variance, ensuring that no single feature dominates the K-means clustering due to its scale.

    In [ ]:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    

    We determine the number of clusters k using the elbow method:

    • We examine how inertia (within-cluster sum of squares) decreases as k increases.
    • The optimal k is where the rate of decrease starts to level off.
    • After k = 6, the reduction in inertia becomes relatively small, suggesting diminishing returns.
    • With k = 3, clusters mainly separate users into high-conversion, low-conversion, and outlier groups, which provides limited behavioral insight.
    In [ ]:
    # Elbow Method
    inertia = []
    K_range = range(1, 11)
    for k in K_range:
        model = KMeans(n_clusters=k, random_state=42)
        model.fit(X_scaled)
        inertia.append(model.inertia_)
    
    
    plt.plot(K_range, inertia, marker='o')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Inertia (within-cluster sum of squares)')
    plt.title('Elbow Method For Optimal k')
    plt.grid(True)
    plt.show()
    

    We fit a K-means model with k = 6 clusters on the standardized features and assign each user to one of the six clusters. Then, we compute the mean of all numeric variables by cluster to summarize and compare the characteristic behavior of each customer group.

    In [ ]:
    k = 6
    
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    
    
    user_df['cluster'] = clusters
    cluster_summary = user_df.groupby('cluster').mean(numeric_only=True)
    cluster_summary
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 converted total_stay_time_sec
    cluster
    0 0.042468 0.253800 1.004776 0.080826 0.092383 1.452441 0.000000 0.069311 2.496098e+04
    1 49.525810 156.056903 0.217767 8.898679 22.225450 406.663625 0.007203 0.990876 1.048004e+07
    2 0.000000 15612.666667 6.333333 16072.666667 0.000000 20.000000 0.000000 0.000000 1.530436e+07
    3 8.969625 40.070478 0.353236 0.875223 7.354378 191.998610 0.007792 0.843955 5.090688e+06
    4 0.429219 1.776358 0.000345 0.116009 0.558161 6.752564 0.000000 0.254752 7.507292e+04
    5 0.308470 1.930986 0.928781 0.207768 0.701767 3.990078 1.165559 0.503670 6.956890e+04

    Cluster 0 : long-term dormant users¶

    • Behavioral indicators (excluding outliers) are generally the lowest among all clusters.
    • Both login rate (0.08) and conversion rate (0.069) are low, and total stay time is also short.
    • Interpretation: customers who showed only minimal interest without purchasing → candidates for retargeting campaigns.

    Cluster 1 : VIP customers (big spenders)¶

    • Very high activity levels: add_to_cart ≈ 49.5, purchase ≈ 22.2, and extremely long stay_time (over 10 million seconds).
    • Number of sessions (≈ 406) and logins (≈ 8.9) are very high, indicating highly engaged users.
    • Conversion rate converted ≈ 0.99, meaning almost all users in this cluster make a purchase.
    • Interpretation: core revenue-driving customers → strong candidates for loyalty programs and premium benefits.

    Cluster 2 : Outliers or abnormal behavior¶

    • Extremely high activity counts, e.g., login ≈ 16,072 and begin_checkout ≈ 15,612.
    • Despite the large number of actions, converted = 0 and purchase = 0, showing a highly imbalanced pattern.
    • Interpretation: likely bot traffic or logging errors → should be considered as outliers and potentially removed from analysis.

    Cluster 3 : Loyal regular buyers¶

    • Solid activity levels: add_to_cart ≈ 8.96, purchase ≈ 7.35, with around 191 sessions and relatively long total stay_time.
    • High conversion rate converted ≈ 0.84, meaning most users in this cluster end up purchasing.
    • Interpretation: valuable customers who visit frequently and make repeat purchases → key targets for retention strategies.

    Cluster 4 : Low-loyalty general customers¶

    • Relatively low activity levels: add_to_cart ≈ 0.42, purchase ≈ 1.77, about 6.75 sessions, and short total stay_time.
    • Conversion rate converted ≈ 0.25, indicating that purchases are relatively infrequent.
    • Interpretation: at-risk or churning users → candidates for remarketing campaigns and discount coupons.

    Cluster 5 : First-time visitors¶

    • Highest values for “회원가입완료” (signup completed) and “first_visit”, with a moderate conversion rate (converted ≈ 0.5).
    • Short total stay_time → users who are likely visiting the site for the first time.
    • Interpretation: customers who need marketing tailored to first-time visitors (onboarding guides, welcome offers, etc.).

    6. Cart¶

    In this section, we focus on users who have reached the cart stage and examine which factors drive them to complete a purchase. We restrict the data to customers with at least one add_to_cart event and build a logistic regression model with converted as the target and various behavioral features (and device information) as predictors. This allows us to quantify how different actions in the cart stage affect the probability of final conversion and to identify leverage points for improving checkout completion.

    In [ ]:
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
    
    In [ ]:
    user_df_cart = user_df[user_df['add_to_cart'] > 0]
    #user_df_cart.loc[:, 'converted'] = (user_df_cart['purchase'] > 0).astype(int)
    user_df_cart.head()
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 media device converted total_stay_time_sec cluster
    user_pseudo_id
    25456.172429 1 0 1 1 0 0 0 organic desktop 0 1.723212e+02 0
    36909.170936 1 0 0 0 0 1 0 advertiser_tv mobile 0 1.542716e+02 4
    55536.169964 4 12 1 0 3 124 0 RTB mobile 1 3.675918e+06 3
    59429.168086 5 8 0 0 0 20 0 not set mobile 0 1.031516e+05 4
    78975.162420 1 9 0 1 0 21 0 direct mobile 0 2.080192e+05 4
    In [ ]:
    user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})
    user_df_cart.head()
    
    C:\Users\user\AppData\Local\Temp\ipykernel_5984\3132898683.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      user_df_cart['device'] = user_df_cart['device'].map({'mobile' : 0, 'desktop' : 1})
    
    Out[ ]:
    add_to_cart begin_checkout first_visit login purchase session_start 회원가입완료 media device converted total_stay_time_sec cluster
    user_pseudo_id
    25456.172429 1 0 1 1 0 0 0 organic 1 0 1.723212e+02 0
    36909.170936 1 0 0 0 0 1 0 advertiser_tv 0 0 1.542716e+02 4
    55536.169964 4 12 1 0 3 124 0 RTB 0 1 3.675918e+06 3
    59429.168086 5 8 0 0 0 20 0 not set 0 0 1.031516e+05 4
    78975.162420 1 9 0 1 0 21 0 direct 0 0 2.080192e+05 4

    For the cart-stage analysis, we construct the feature matrix X by removing purchase, media, converted, and cluster. The variable converted is our target and must not be included as an input, while purchase is effectively an outcome-like variable that would leak information about conversion. We also exclude media and the derived cluster label so that the logistic regression focuses on the direct effects of users’ behavioral features in the cart stage, without confounding from channel attributes or previously computed segments.

    In [ ]:
    X = user_df_cart.drop(columns=['purchase','media', 'converted', 'cluster'])
    y = user_df_cart['converted']  # target
    
    model = LogisticRegression(max_iter=1000)
    model.fit(X, y)
    
    Out[ ]:
    LogisticRegression(max_iter=1000)
    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
    LogisticRegression(max_iter=1000)
    In [ ]:
    coeff_df = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': model.coef_[0]
    })
    print(coeff_df.sort_values(by='Coefficient', ascending=False))
    
                   Feature   Coefficient
    1       begin_checkout  4.603798e-01
    5               회원가입완료  1.976794e-01
    3                login  3.332584e-02
    4        session_start  8.550497e-03
    7  total_stay_time_sec -3.571540e-07
    0          add_to_cart -7.881150e-03
    2          first_visit -2.358669e-02
    6               device -2.115448e-01
    

    From the logistic regression coefficients, we can draw the following conclusions about which factors most strongly influence conversion at the cart stage:

    • begin_checkout (β ≈ 0.46)

      • This is the strongest positive predictor: users who proceed to the checkout step are much more likely to complete a purchase.
    • 회원가입완료 – signup completed (β ≈ 0.20)

      • Completing membership registration is also strongly associated with higher conversion, suggesting that registered members are more likely to finish the purchase than non-members.
    • login and session_start (small positive coefficients)

      • More logins and session starts slightly increase the probability of conversion, indicating that repeated engagement with the site helps, but their effect is weaker than checkout or signup.
    • first_visit (β < 0)

      • First-time visitors are less likely to convert than returning users, which aligns with the idea that users often need multiple visits before purchasing.
    • device (β ≈ -0.21, mobile = 0, desktop = 1)

      • The negative coefficient means that desktop users have a lower conversion probability than mobile users, even among those who added items to the cart, consistent with the earlier device-level analysis.
    • total_stay_time_sec and add_to_cart (coefficients close to 0)

      • Their effects are very small in magnitude in this model, suggesting that spending more time or adding slightly more items to the cart does not, by itself, strongly change the conversion probability, once other behaviors (like checkout and signup) are controlled for.

    Overall, the model highlights that moving forward in the funnel (begin_checkout), completing signup, and using mobile are the key drivers of conversion at the cart stage, whereas first-time visits and desktop usage are associated with lower purchase completion.

    7. Conclusion¶

    In this analysis, we used detailed event-level logs from an online shopping mall to understand how customers move from visit to purchase and how different channels and behaviors affect conversion.

    Overall, the analysis shows that personalized CRM channels, mobile usage, and deeper funnel engagement (checkout and signup) are critical drivers of conversion, while user behavior patterns allow us to identify VIPs, loyal buyers, at-risk customers, and newcomers. These insights can be used to design targeted marketing and retention strategies that allocate resources more efficiently and improve overall sales performance.