Consultation 2: Repeated-Measures ANOVA and Multilevel Mediation for a Four-Type Persuasion Study
I. Client’s Inquiry
1. Context
The client seeks statistical consulting to (a) choose appropriate analysis methods for their study and (b) potentially outsource analyses if they are too difficult to perform independently.
2. Key Questions
Q1. Appropriateness of H1 & H2 Analyses
- Design: within-subjects (repeated measures); all participants experienced four persuasion types (A, B, C, D).
- Sample size: n = 286.
- Assumptions: normality violated, sphericity violated.
- Question: Is it acceptable to use One-Way Repeated Measures ANOVA with Greenhouse–Geisser correction under these violations?
Q2. Possibility of Mediation Analysis for H3
- Independent variable structure: 4 equally weighted types; there is no single “reference” type, so standard dummy-coding with one baseline is not conceptually ideal.
- Design issue: All participants saw all four types (repeated measures); classic mediation frameworks often assume between-subjects manipulations.
- Current fallback: Simple regressions testing whether perceived autonomy significantly predicts behavioral intention, interaction satisfaction, and reuse intention.
- Question: Despite the within-subjects design and four-level factor, is there a valid mediation approach for H3?
Q3 Appropriate Analysis for Manipulation-Check Items
- Example (Type A): show that emotional appeal / gain framing scores are high, whereas rational appeal / loss framing scores are low.
- Question: What statistical tests should be used to demonstrate that the manipulations worked as intended?
Q4. Balanced Latin Square & Whether an Independent-Groups Design Is Preferable
3. Proposed/Considered Analysis Techniques
- Repeated-Measures ANOVA (primary for H1/H2).
- Mediation analysis / Simple regression (for H3, as feasible).
4. Hypotheses
- H1: Do behavioral intention, interaction satisfaction, and reuse intention differ by persuasion type?
- H2: Does perceived autonomy differ by persuasion type?
- H3 (Version 1 – Mediation): Perceived autonomy mediates the relationship between persuasion type and each outcome (behavioral intention, interaction satisfaction, reuse intention).
- H3 (Version 2 – Simple Regression): Within each persuasion type, perceived autonomy significantly predicts behavioral intention, interaction satisfaction, and reuse intention.
5. Data Description
- N = 286 survey respondents.
- Scales: 5-point Likert, multiple-choice questionnaire items.
- Mediator: Perceived autonomy.
- Outcomes (DVs): Behavioral intention, interaction satisfaction, reuse intention.
- Independent variable (IV): Four persuasion types (operationalized via image stimuli).
II. Answer of Q1
1. Model
Conclusion: A one-way repeated-measures (RM) ANOVA is the appropriate analysis.
- Design: Within-subjects. All participants evaluated four persuasion types (A, B, C, D).
- Dependent variables:
- For H2: Perceived autonomy
- For H1: Behavioral intention, interaction satisfaction, reuse intention
- For H2: Perceived autonomy
- The same 286 participants experienced all four conditions and provided repeated responses for the same outcomes under each condition. Because observations are not independent across conditions within participants, a repeated-measures approach is required.
Model: \[ Y_{ij} = \mu + \alpha_j + s_i + \epsilon_{ij} \] where \(Y_{ij}\) is the outcome for participant \(i\) in condition \(j\), \(\mu\) is the grand mean, \(\alpha_j\) is the fixed effect of persuasion type \(j\), \(s_i\) is the (random) subject effect, and \(\epsilon_{ij}\) is the residual.
cf) Note on Two-Way Repeated-Measures ANOVA -
Technically feasible, but here the client did
not separate appeal (emotional vs. rational) and
framing (gain vs. loss) as two orthogonal factors.
- Instead, these were combined into a single factor
(“persuasion type”).
- Since the goal is to compare differences among the four
types, one-way RM ANOVA better matches the
research objective.
2. Normality
(1) Kolmogorov–Smirnov (K–S) Normality Test
Definition. Tests whether the sample empirical CDF (ECDF) differs significantly from the CDF of a normal distribution.
- For sample \(X_1, X_2, \ldots,
X_n\), the ECDF is
\[ F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(X_i \le x). \] - Test statistic:
\[ D = \sup_x \lvert F_n(x) - F(x) \rvert, \] the maximum absolute gap between the ECDF and the target normal CDF. - Hypotheses:
\(H_0\): the sample follows a normal distribution.
\(H_1\): the sample does not follow a normal distribution. - Decision rule: if \(p\)-value \(< 0.05\), reject \(H_0\) (i.e., normality is violated).
Procedure 1. Compute the ECDF at each observation. 2. Compute the normal CDF (with appropriate mean/SD). 3. Compute \(D\). 4. Compare to the critical value (or obtain a \(p\)-value) at significance level \(\alpha\) and test \(H_0\).
(2) Shapiro–Wilk Normality Test
For sample \(x_1, x_2, \ldots, x_n\) (with order statistics \(x_{(1)} \le \cdots \le x_{(n)}\)), the statistic is \[ W = \frac{\left(\sum_{i=1}^n a_i\, x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, \] where \(a_i\) are weights derived under normality.
Interpretation
- Numerator: squared linear combination aligning observed order statistics with those expected under normality.
- Denominator: sample variance (total variation).
- \(W \approx 1\) indicates strong concordance with normality; smaller \(W\) indicates departures from normality.
Hypotheses
- \(H_0\): the sample follows a normal distribution.
- \(H_1\): the sample does not follow
a normal distribution.
Reject \(H_0\) when \(p\)-value \(< 0.05\).
(3) Appropriateness of Normality Testing in This Study
1) On averaging item scores within each condition
Averaging two items per condition can attenuate non-normality. Therefore, it is preferable to test normality on the original item-level scores rather than on the averaged composites.
2) Accounting for the repeated-measures structure
Because the design is within-subjects, measurements are correlated across conditions (A/B/C/D). Testing normality separately on each condition’s raw scores ignores this correlation and can bias interpretation.
A more appropriate approach is to examine the residuals from the repeated-measures model: \[ Y_{ij} = \mu + \alpha_j + s_i + \epsilon_{ij}, \] where \(\alpha_j\) is the fixed effect of condition \(j\), \(s_i\) is the (random) subject effect, and \(\epsilon_{ij} \sim \mathcal{N}(0,\sigma^2)\).
- Pool all residuals across A, B, C, D and assess their normality (e.g., Shapiro–Wilk or K–S on the residuals).
- Why not per-condition tests? The subject-specific effect \(s_i\) influences all conditions; evaluating a single condition in isolation incompletely accounts for this shared effect.
Visual diagnostics are recommended alongside tests (e.g., QQ-plots of residuals).
3. Other Models (Beyond Simple One-Way ANOVA)
(1) One-Way ANOVA (Between-Subjects)
- Appropriate when different participant groups each experience only one condition.
- Assumes independent observations across groups.
- Not applicable here because all participants experienced all four conditions (A/B/C/D).
(2) Repeated-Measures MANOVA
- Participants respond to multiple dependent variables across multiple within-subject conditions (e.g., A/B/C/D).
- Analyzes several outcomes jointly (e.g., behavioral intention, interaction satisfaction, reuse intention).
- Useful when you want to test the common effect of
condition across correlated outcomes
simultaneously.
- Example: If those three outcomes are correlated and you want to know how they jointly change across A/B/C/D, RM-MANOVA is appropriate.
(3) Most Suitable Approach for This Study
| Focus | Do we analyze correlations among DVs? | Recommended analysis |
|---|---|---|
| Compare each DV separately across A/B/C/D | No (no explicit need stated) | Separate one-way RM ANOVA per DV (current approach is appropriate) |
4. Sphericity
- Greenhouse–Geisser correction is a standard and
appropriate remedy when sphericity is violated.
- No issue using it in this context.
III. Answer of Q2
1. Mediation Analysis
Mediation analysis tests whether the effect of an independent
variable X on a dependent variable Y occurs
through a mediator M.
Structure:
X (persuasion type) → M (perceived autonomy) → Y (behavioral
outcome)
↘────────────────────────────────────────→ Y (direct path)
Key effects:
- a path:
X → M - b path:
M → Y(controlling forX) - c’ path:
X → Y(direct effect) - Indirect effect:
a · b
2. Why Simple Regression Is Insufficient
The client proposes to:
- Run simple regressions within each persuasion type (A, B, C, D) to test if perceived autonomy → behavioral outcomes (behavioral intention, satisfaction, reuse intention).
But this only examines the “b path” in
isolation.
It does not consider:
- The impact of persuasion type
Xon perceived autonomy (apath) - Nor the full mediation structure
- Nor the repeated-measures (within-subject) nature of the data
Furthermore, regression assumes independence of observations, which is violated here: the same participants rated all 4 persuasion types.
3. Correct Method: Repeated Measures Mediation (Multilevel Mediation)
Because the data involve:
- All participants rating all 4 persuasion types (within-subjects factor)
- Responses nested within participants
→ The correct approach is Multilevel Mediation Analysis (also called Repeated Measures Mediation).
4. Model Structure
Level 1 (within participants):
\[\begin{aligned} M_{ij} &= \alpha_{0i} + a\,X_{ij} + r_{1ij} \\ Y_{ij} &= \beta_{0i} + c' \, X_{ij} + b \, M_{ij} + r_{2ij} \end{aligned}\]
- \(i\): participant
- \(j\): condition (A, B, C, D)
Level 2 (between participants):
\[ \alpha_{0i} \sim \mathcal{N}(\gamma_{0}, \tau_{\alpha}^{2}), \quad \beta_{0i} \sim \mathcal{N}(\delta_{0}, \tau_{\beta}^{2}) \]
→ This models both the fixed effects (mediation paths) and random intercepts per participant.
5. Conclusion (Key Message for Client)
- Yes, H3 mediation analysis is possible despite the repeated-measures design.
- But it must be done using a Multilevel Mediation model that accounts for the within-subject structure of the data.
- The simple regression approach may offer partial insight but cannot test for mediation and violates statistical assumptions (independence of residuals).
- We strongly recommend using tools such as
lme4andmediation(orbrmsfor Bayesian estimation) to perform proper repeated-measures mediation.
IV. Answer of Q3
1. What is a Manipulation Check?
A manipulation check verifies that the experimental manipulation (the
intended effect of the independent variable) was perceived by
participants as designed.
Example: If you showed participants a “emotional
appeal” chatbot vs. a “rational appeal”
chatbot, you might ask, “How emotional was the chatbot you just saw?” (1
= not at all, 5 = very much).
2. Recommended Methods
(1) Paired (Dependent) t-tests
For each type (A, B, C, D), compare: - Emotional vs Rational - Gain vs Loss
If, for a given type A, you find Emotional > Rational and Gain > Loss, then the manipulation for A is successful.
- First one-tailed test (Emotional vs Rational):
\(H_0:\ \mu_{\text{Emo}} - \mu_{\text{Rat}} = 0\)
\(H_1:\ \mu_{\text{Emo}} - \mu_{\text{Rat}} > 0\) - Second one-tailed test (Gain vs Loss):
\(H_0:\ \mu_{\text{Gain}} - \mu_{\text{Loss}} = 0\)
\(H_1:\ \mu_{\text{Gain}} - \mu_{\text{Loss}} > 0\)
Because the design predicts direction (Emotional > Rational; Gain > Loss), one-tailed tests are appropriate.
- Multiplicity: There are 8 paired
t-tests total (2 contrasts × 4 types).
Control the familywise error rate with Bonferroni: use \(\alpha^* = 0.05/8\) for each test (or report adjusted \(p\)-values).
(2) One-Way Repeated-Measures ANOVA (within a type)
Each participant provides Likert responses for all four manipulation-check facets (Emotional, Rational, Gain, Loss) → a within-subjects factor with 4 levels is valid.
For a given type (e.g., A):
- \(H_0\): \(\mu_{\text{Emo}} = \mu_{\text{Rat}} =
\mu_{\text{Gain}} = \mu_{\text{Loss}}\)
- \(H_1\): At least one mean differs.
If the ANOVA is significant (\(p < \alpha\)), run pairwise comparisons (paired t-tests) as post hoc to identify which pairs differ.
- Number of pairs: \(\binom{4}{2} =
6\).
- Apply a multiple-comparison correction (e.g., Bonferroni or
Holm).
- Evidence of Emotional > Rational and Gain > Loss indicates successful manipulation for that type.
Most software (SPSS, R, etc.) can directly perform RM ANOVA and post hoc paired comparisons with appropriate adjustments.
3. Non-Normality: Robust Alternatives
- Paired t-test → Wilcoxon signed-rank test
- RM ANOVA → Friedman test
However, with \(n=286\), modest deviations from normality are typically not consequential due to large-sample robustness (CLT), especially for paired mean differences and RM ANOVA residuals—provided there are no extreme outliers or severe skewness. Complement tests with QQ-plots and residual diagnostics.
V. Answer of Q4
1. Appropriateness of a Balanced Latin Square
(1) Example
A typical Balanced Latin Square for four conditions:
| Order group | 1st | 2nd | 3rd | 4th |
|---|---|---|---|---|
| Group 1 | A | B | D | C |
| Group 2 | B | C | A | D |
| Group 3 | C | D | B | A |
| Group 4 | D | A | C | B |
Each of A, B, C, D appears exactly once in each serial position.
(2) Is it suitable? (re: “minimizing reviewer concerns about order effects”)
- In within-subjects designs, a Balanced Latin Square is a gold-standard method for order control—more systematic than simple randomization.
- In repeated-measures experiments, responses can depend on presentation order (e.g., novelty/primacy for the first stimulus; fatigue for the last). The balanced scheme distributes such effects evenly across conditions.
(3) Must group sizes be identical?
- Not strictly. Minor imbalances (±1–2 participants) are not problematic.
- Large imbalances (e.g., 50 vs 15) can distort results; avoid pronounced disparities.
(4) Are there “better” alternatives?
- For repeated-measures order control, the Balanced Latin Square is already authoritative and appropriate.
- Reporting that a Latin Square was used typically conveys high design rigor to reviewers.
Sample size sufficiency - Minimum needed ≈ (# of orders) × (minimum per order). - Here: 4 orders and N = 286, i.e., about 70 per order → ample power and validity.
Optional model-based check for order effects
\[ Y_{ijk} = \mu + \alpha_j + s_i + \gamma_k + \epsilon_{ijk} \]
- \(\alpha_j\): fixed effect of
condition \(j\) (A/B/C/D)
- \(s_i\): random subject
effect
- \(\gamma_k\): fixed effect of order group \(k=1,\dots,4\)
If \(\gamma_k\) is not significant, order effects are negligible—evidence that the Latin Square successfully controlled order.
In short, the Latin Square is a validated method for order control.
2. Is an Independent-Groups (Between-Subjects) Design More Appropriate?
(1) Summary
- Is the current repeated-measures design
appropriate? ✅ Yes—very appropriate and
aligned with the study goals.
- Would between-subjects be better? ❌ No—likely reduces accuracy (more noise from individual differences).
(2) Side-by-side comparison
| Aspect | Repeated-Measures (current) | Independent-Groups (alternative) |
|---|---|---|
| Statistical power | ✅ Higher (shared variance removed) | ❌ Lower (more error from individual differences) |
| Individual differences | ✅ Controlled (same person rates all conditions) | ❌ Uncontrolled |
| Order effects | ⚠️ Potential → mitigate with Balanced Latin Square | ✅ Absent by design |
| Fit to research aim (compare persuasion types) | ✅ Excellent | ❌ Risk that group composition (traits) confounds condition effects |
(3) Answers to common concerns
“Isn’t it more valid to assign each condition to a different
random group?”
No. Individual differences can obscure or distort true condition
effects.
Example: If Group A is more sensitive to emotional content and Group D
is less so, the outcomes may reflect participant
traits, not chatbot persuasion.
“What about fatigue or order effects in repeated
measures?”
Valid concern, but Balanced Latin Square minimizes
order effects. Some fatigue may remain, but it’s a typical and
acceptable tradeoff for cleaner comparisons. Many
psychological experiments use repeated-measures for this reason.
“Do we need to recollect data?”
Absolutely not. Your current design is
ideal for the research questions.