Consultation 2: Repeated-Measures ANOVA and Multilevel Mediation for a Four-Type Persuasion Study

I. Client’s Inquiry

1. Context

The client seeks statistical consulting to (a) choose appropriate analysis methods for their study and (b) potentially outsource analyses if they are too difficult to perform independently.

2. Key Questions

Q1. Appropriateness of H1 & H2 Analyses

  • Design: within-subjects (repeated measures); all participants experienced four persuasion types (A, B, C, D).
  • Sample size: n = 286.
  • Assumptions: normality violated, sphericity violated.
  • Question: Is it acceptable to use One-Way Repeated Measures ANOVA with Greenhouse–Geisser correction under these violations?

Q2. Possibility of Mediation Analysis for H3

  • Independent variable structure: 4 equally weighted types; there is no single “reference” type, so standard dummy-coding with one baseline is not conceptually ideal.
  • Design issue: All participants saw all four types (repeated measures); classic mediation frameworks often assume between-subjects manipulations.
  • Current fallback: Simple regressions testing whether perceived autonomy significantly predicts behavioral intention, interaction satisfaction, and reuse intention.
  • Question: Despite the within-subjects design and four-level factor, is there a valid mediation approach for H3?

Q3 Appropriate Analysis for Manipulation-Check Items

  • Example (Type A): show that emotional appeal / gain framing scores are high, whereas rational appeal / loss framing scores are low.
  • Question: What statistical tests should be used to demonstrate that the manipulations worked as intended?

Q4. Balanced Latin Square & Whether an Independent-Groups Design Is Preferable

3. Proposed/Considered Analysis Techniques

  1. Repeated-Measures ANOVA (primary for H1/H2).
  2. Mediation analysis / Simple regression (for H3, as feasible).

4. Hypotheses

  • H1: Do behavioral intention, interaction satisfaction, and reuse intention differ by persuasion type?
  • H2: Does perceived autonomy differ by persuasion type?
  • H3 (Version 1 – Mediation): Perceived autonomy mediates the relationship between persuasion type and each outcome (behavioral intention, interaction satisfaction, reuse intention).
  • H3 (Version 2 – Simple Regression): Within each persuasion type, perceived autonomy significantly predicts behavioral intention, interaction satisfaction, and reuse intention.

5. Data Description

  • N = 286 survey respondents.
  • Scales: 5-point Likert, multiple-choice questionnaire items.
  • Mediator: Perceived autonomy.
  • Outcomes (DVs): Behavioral intention, interaction satisfaction, reuse intention.
  • Independent variable (IV): Four persuasion types (operationalized via image stimuli).

II. Answer of Q1

1. Model

Conclusion: A one-way repeated-measures (RM) ANOVA is the appropriate analysis.

  • Design: Within-subjects. All participants evaluated four persuasion types (A, B, C, D).
  • Dependent variables:
    • For H2: Perceived autonomy
    • For H1: Behavioral intention, interaction satisfaction, reuse intention
  • The same 286 participants experienced all four conditions and provided repeated responses for the same outcomes under each condition. Because observations are not independent across conditions within participants, a repeated-measures approach is required.

Model: \[ Y_{ij} = \mu + \alpha_j + s_i + \epsilon_{ij} \] where \(Y_{ij}\) is the outcome for participant \(i\) in condition \(j\), \(\mu\) is the grand mean, \(\alpha_j\) is the fixed effect of persuasion type \(j\), \(s_i\) is the (random) subject effect, and \(\epsilon_{ij}\) is the residual.

cf) Note on Two-Way Repeated-Measures ANOVA - Technically feasible, but here the client did not separate appeal (emotional vs. rational) and framing (gain vs. loss) as two orthogonal factors.
- Instead, these were combined into a single factor (“persuasion type”).
- Since the goal is to compare differences among the four types, one-way RM ANOVA better matches the research objective.

2. Normality

(1) Kolmogorov–Smirnov (K–S) Normality Test

Definition. Tests whether the sample empirical CDF (ECDF) differs significantly from the CDF of a normal distribution.

  • For sample \(X_1, X_2, \ldots, X_n\), the ECDF is
    \[ F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \le x). \]
  • Test statistic:
    \[ D = \sup_x \lvert F_n(x) - F(x) \rvert, \] the maximum absolute gap between the ECDF and the target normal CDF.
  • Hypotheses:
    \(H_0\): the sample follows a normal distribution.
    \(H_1\): the sample does not follow a normal distribution.
  • Decision rule: if \(p\)-value \(< 0.05\), reject \(H_0\) (i.e., normality is violated).

Procedure 1. Compute the ECDF at each observation. 2. Compute the normal CDF (with appropriate mean/SD). 3. Compute \(D\). 4. Compare to the critical value (or obtain a \(p\)-value) at significance level \(\alpha\) and test \(H_0\).

(2) Shapiro–Wilk Normality Test

For sample \(x_1, x_2, \ldots, x_n\) (with order statistics \(x_{(1)} \le \cdots \le x_{(n)}\)), the statistic is \[ W = \frac{\left(\sum_{i=1}^n a_i\, x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \] where \(a_i\) are weights derived under normality.

Interpretation

  • Numerator: squared linear combination aligning observed order statistics with those expected under normality.
  • Denominator: sample variance (total variation).
  • \(W \approx 1\) indicates strong concordance with normality; smaller \(W\) indicates departures from normality.

Hypotheses

  • \(H_0\): the sample follows a normal distribution.
  • \(H_1\): the sample does not follow a normal distribution.
    Reject \(H_0\) when \(p\)-value \(< 0.05\).

(3) Appropriateness of Normality Testing in This Study

1) On averaging item scores within each condition

Averaging two items per condition can attenuate non-normality. Therefore, it is preferable to test normality on the original item-level scores rather than on the averaged composites.

2) Accounting for the repeated-measures structure

Because the design is within-subjects, measurements are correlated across conditions (A/B/C/D). Testing normality separately on each condition’s raw scores ignores this correlation and can bias interpretation.

A more appropriate approach is to examine the residuals from the repeated-measures model: \[ Y_{ij} = \mu + \alpha_j + s_i + \epsilon_{ij}, \] where \(\alpha_j\) is the fixed effect of condition \(j\), \(s_i\) is the (random) subject effect, and \(\epsilon_{ij} \sim \mathcal{N}(0,\sigma^2)\).

  • Pool all residuals across A, B, C, D and assess their normality (e.g., Shapiro–Wilk or K–S on the residuals).
  • Why not per-condition tests? The subject-specific effect \(s_i\) influences all conditions; evaluating a single condition in isolation incompletely accounts for this shared effect.

Visual diagnostics are recommended alongside tests (e.g., QQ-plots of residuals).

3. Other Models (Beyond Simple One-Way ANOVA)

(1) One-Way ANOVA (Between-Subjects)

  • Appropriate when different participant groups each experience only one condition.
  • Assumes independent observations across groups.
  • Not applicable here because all participants experienced all four conditions (A/B/C/D).

(2) Repeated-Measures MANOVA

  • Participants respond to multiple dependent variables across multiple within-subject conditions (e.g., A/B/C/D).
  • Analyzes several outcomes jointly (e.g., behavioral intention, interaction satisfaction, reuse intention).
  • Useful when you want to test the common effect of condition across correlated outcomes simultaneously.
    • Example: If those three outcomes are correlated and you want to know how they jointly change across A/B/C/D, RM-MANOVA is appropriate.

(3) Most Suitable Approach for This Study

Focus Do we analyze correlations among DVs? Recommended analysis
Compare each DV separately across A/B/C/D No (no explicit need stated) Separate one-way RM ANOVA per DV (current approach is appropriate)

4. Sphericity

  • Greenhouse–Geisser correction is a standard and appropriate remedy when sphericity is violated.
  • No issue using it in this context.

III. Answer of Q2

1. Mediation Analysis

Mediation analysis tests whether the effect of an independent variable X on a dependent variable Y occurs through a mediator M.

Structure:

X (persuasion type) → M (perceived autonomy) → Y (behavioral outcome)
↘────────────────────────────────────────→ Y (direct path)

Key effects:

  • a path: X → M
  • b path: M → Y (controlling for X)
  • c’ path: X → Y (direct effect)
  • Indirect effect: a · b

2. Why Simple Regression Is Insufficient

The client proposes to:

  • Run simple regressions within each persuasion type (A, B, C, D) to test if perceived autonomy → behavioral outcomes (behavioral intention, satisfaction, reuse intention).

But this only examines the “b path” in isolation.
It does not consider:

  • The impact of persuasion type X on perceived autonomy (a path)
  • Nor the full mediation structure
  • Nor the repeated-measures (within-subject) nature of the data

Furthermore, regression assumes independence of observations, which is violated here: the same participants rated all 4 persuasion types.

3. Correct Method: Repeated Measures Mediation (Multilevel Mediation)

Because the data involve:

  • All participants rating all 4 persuasion types (within-subjects factor)
  • Responses nested within participants

→ The correct approach is Multilevel Mediation Analysis (also called Repeated Measures Mediation).

4. Model Structure

Level 1 (within participants):

\[\begin{aligned} M_{ij} &= \alpha_{0i} + a\,X_{ij} + r_{1ij} \\ Y_{ij} &= \beta_{0i} + c' \, X_{ij} + b \, M_{ij} + r_{2ij} \end{aligned}\]

  • \(i\): participant
  • \(j\): condition (A, B, C, D)

Level 2 (between participants):

\[ \alpha_{0i} \sim \mathcal{N}(\gamma_{0}, \tau_{\alpha}^{2}), \quad \beta_{0i} \sim \mathcal{N}(\delta_{0}, \tau_{\beta}^{2}) \]

→ This models both the fixed effects (mediation paths) and random intercepts per participant.

5. Conclusion (Key Message for Client)

  • Yes, H3 mediation analysis is possible despite the repeated-measures design.
  • But it must be done using a Multilevel Mediation model that accounts for the within-subject structure of the data.
  • The simple regression approach may offer partial insight but cannot test for mediation and violates statistical assumptions (independence of residuals).
  • We strongly recommend using tools such as lme4 and mediation (or brms for Bayesian estimation) to perform proper repeated-measures mediation.

IV. Answer of Q3

1. What is a Manipulation Check?

A manipulation check verifies that the experimental manipulation (the intended effect of the independent variable) was perceived by participants as designed.
Example: If you showed participants a “emotional appeal” chatbot vs. a “rational appeal” chatbot, you might ask, “How emotional was the chatbot you just saw?” (1 = not at all, 5 = very much).

3. Non-Normality: Robust Alternatives

  • Paired t-test → Wilcoxon signed-rank test
  • RM ANOVA → Friedman test

However, with \(n=286\), modest deviations from normality are typically not consequential due to large-sample robustness (CLT), especially for paired mean differences and RM ANOVA residuals—provided there are no extreme outliers or severe skewness. Complement tests with QQ-plots and residual diagnostics.


V. Answer of Q4

1. Appropriateness of a Balanced Latin Square

(1) Example

A typical Balanced Latin Square for four conditions:

Order group 1st 2nd 3rd 4th
Group 1 A B D C
Group 2 B C A D
Group 3 C D B A
Group 4 D A C B

Each of A, B, C, D appears exactly once in each serial position.

(2) Is it suitable? (re: “minimizing reviewer concerns about order effects”)

  • In within-subjects designs, a Balanced Latin Square is a gold-standard method for order control—more systematic than simple randomization.
  • In repeated-measures experiments, responses can depend on presentation order (e.g., novelty/primacy for the first stimulus; fatigue for the last). The balanced scheme distributes such effects evenly across conditions.

(3) Must group sizes be identical?

  • Not strictly. Minor imbalances (±1–2 participants) are not problematic.
  • Large imbalances (e.g., 50 vs 15) can distort results; avoid pronounced disparities.

(4) Are there “better” alternatives?

  • For repeated-measures order control, the Balanced Latin Square is already authoritative and appropriate.
  • Reporting that a Latin Square was used typically conveys high design rigor to reviewers.

Sample size sufficiency - Minimum needed ≈ (# of orders) × (minimum per order). - Here: 4 orders and N = 286, i.e., about 70 per orderample power and validity.

Optional model-based check for order effects

\[ Y_{ijk} = \mu + \alpha_j + s_i + \gamma_k + \epsilon_{ijk} \]

  • \(\alpha_j\): fixed effect of condition \(j\) (A/B/C/D)
  • \(s_i\): random subject effect
  • \(\gamma_k\): fixed effect of order group \(k=1,\dots,4\)

If \(\gamma_k\) is not significant, order effects are negligible—evidence that the Latin Square successfully controlled order.

In short, the Latin Square is a validated method for order control.

2. Is an Independent-Groups (Between-Subjects) Design More Appropriate?

(1) Summary

  1. Is the current repeated-measures design appropriate?Yes—very appropriate and aligned with the study goals.
  2. Would between-subjects be better?No—likely reduces accuracy (more noise from individual differences).

(2) Side-by-side comparison

Aspect Repeated-Measures (current) Independent-Groups (alternative)
Statistical power Higher (shared variance removed) Lower (more error from individual differences)
Individual differences Controlled (same person rates all conditions) Uncontrolled
Order effects ⚠️ Potentialmitigate with Balanced Latin Square Absent by design
Fit to research aim (compare persuasion types) Excellent ❌ Risk that group composition (traits) confounds condition effects

(3) Answers to common concerns

“Isn’t it more valid to assign each condition to a different random group?”
No. Individual differences can obscure or distort true condition effects.
Example: If Group A is more sensitive to emotional content and Group D is less so, the outcomes may reflect participant traits, not chatbot persuasion.

“What about fatigue or order effects in repeated measures?”
Valid concern, but Balanced Latin Square minimizes order effects. Some fatigue may remain, but it’s a typical and acceptable tradeoff for cleaner comparisons. Many psychological experiments use repeated-measures for this reason.

“Do we need to recollect data?”
Absolutely not. Your current design is ideal for the research questions.