Consultation3 : GLMM/GEE & Kappa for ACS Preference Data

I. Client’s Inquiry 1

Fifty patients with aphasia were asked to select their preferred activities using the ACS activity cards.
The ACS consists of 67 cards in total, including 33 instrumental activities, 18 leisure activities, and 16 social activities, each represented by a printed photograph.

First, the researcher presented the 33 instrumental activity cards to each patient one by one and asked whether the patient would like to perform the activity.

  • If the patient did not want to do the activity, it was coded as 0.
  • If the patient wanted to do the activity, it was coded as 1.
  • For example, if acs_instrumental_1 = 1 in the Excel coding file, it indicates that the patient prefers instrumental activity #1 (visiting a hospital).

After all 33 instrumental activities were classified, only the activities selected as preferred were gathered, and the patient was asked to choose their Top 1 preferred activity among them. The activity number was recorded in acs_instrumental_top1. For example, if acs_instrumental_top1 = 5, it indicates that instrumental activity #5 (preparing a meal) was selected as the top preference.

After completing the instrumental activity classification, the same procedure was applied to:

  • Leisure activities (acs_leisure_34acs_leisure_51, acs_leisure_top1)
  • Social activities (acs_social_52acs_social_67, acs_social_top1)

To evaluate intra-rater reliability of the ACS, a second assessment was conducted for 30 out of the 50 patients, 2 weeks to 4 months after the first assessment.

Demographic variables

  1. sex: 1 = male, 2 = female

  2. age: numeric

  3. education:

    • 1 = no schooling
    • 2 = elementary school
    • 3 = middle school
    • 4 = high school
    • 5 = college
  4. job:

    • 1 = unemployed
    • 2 = production
    • 3 = service
    • 4 = sales
    • 5 = office
    • 6 = professional
    • 7 = managerial
  5. hand: 1 = right-handed, 2 = left-handed

What statistical methods are appropriate for this analysis, and how should the variables be coded?

Answer of Inquiry 1

2. Why restructuring to long format is needed

  • To fit GLMM/GEE properly, the data should be converted from wide format (33 columns of outcomes) to long format, where each row represents one (subject, activity) observation.

Long-format example (key idea)

  • In long format:

    • activity_id = 1 corresponds to acs_instrumental_1
    • response is the value of that variable (0 or 1)

Example rows:

subject_id activity_id response sex age education job hand
1 1 value of acs_instrumental_1 1/2 1/2
1 2 value of acs_instrumental_2 1/2 1/2

3. Variable coding advice (demographics)

  • The current coding (e.g., sex as 1/2, education as 1–5, job as 1–7, hand as 1/2) is generally fine as long as:

    • In SPSS, these are explicitly treated as categorical variables (Nominal/Ordinal as appropriate, and specified as “categorical predictors” in the model).
  • For easier interpretation (optional but recommended), binary variables can be recoded to 0/1:

    • sex: male = 0, female = 1
    • hand: right = 0, left = 1 This is not strictly required, but coefficients become more intuitive to interpret.

I. Client’s Inquiry 2

The objective of this study is to examine the intra-rater reliability of the ACS assessment. The dataset includes:

  • Preference selections for activities
    (acs_instrumental_1–33, acs_leisure_34–51, acs_social_52–67)
  • The activity number selected as Top 1 preference in each domain

for both the first and second assessments.

I would like to ask:

  1. Which statistical methods are appropriate for assessing the reliability of
    • the selected preferred activities, and
    • the Top 1 preferred activity?
  2. Whether the current variable coding scheme is appropriate for conducting reliability analysis.

For example:

  • acs_instrumental_1 indicates whether Instrumental Activity #1 was selected in the first assessment.
  • 2acs_instrumental_1 indicates whether the same activity was selected in the second assessment.

Answer of Inquiry 2

0. Intraclass Correlation Coefficient (ICC) — Reliability Assessment

The Intraclass Correlation Coefficient (ICC) is a statistic that quantifies the degree of agreement (or consistency) between measurements of the same targets made at different times or by different raters. ICC is typically used for continuous measurements and evaluates how much of the total variance is attributable to differences between subjects rather than measurement error.

  • Model

    • Two-way random: Both subjects and raters (or time points) are treated as random effects (i.e., raters/timepoints are considered a random sample from a larger population).
    • Two-way mixed: Subjects are random but raters/time points are fixed (e.g., specifically Time1 and Time2).
  • Type

    • Consistency: Focuses on whether subjects keep the same relative ranking across measurements (ignores systematic differences in mean levels between raters/timepoints).
    • Absolute agreement: Focuses on exact agreement of measurement values (penalizes systematic differences in means).

A conceptual form of the ICC is:

\[ \mathrm{ICC} \;=\; \frac{\sigma^2_{\text{subject}}} {\sigma^2_{\text{subject}} + \sigma^2_{\text{error}} + (\text{if applicable } \sigma^2_{\text{rater}})} \]

where \(\sigma^2_{\text{subject}}\) is the between-subject variance, \(\sigma^2_{\text{error}}\) is the residual/error variance, and \(\sigma^2_{\text{rater}}\) is the variance due to raters/timepoints when included.

Common, rule-of-thumb ranges (e.g., Landis & Koch or similar) are:

  • \(\mathrm{ICC} \ge 0.75\): Excellent
  • \(0.60 \le \mathrm{ICC} < 0.75\): Good
  • \(0.40 \le \mathrm{ICC} < 0.60\): Fair
  • \(\mathrm{ICC} < 0.40\): Poor

1. Appropriateness of ICC for the Current Data

Not appropriate.

For nominal data such as binary preference selection (0/1) and categorical activity numbers, the Intraclass Correlation Coefficient (ICC) is not an appropriate reliability measure.

(1) Why Is ICC Inappropriate?

  1. ICC is designed for continuous (numerical) data ICC is used when the magnitude and distance between values are meaningful. Examples:

    • Blood pressure values
    • Test scores
    • Speed, distance
    • Mean scores of questionnaire items

    → ICC is meaningful only when the numerical value itself carries quantitative meaning.

  2. The current data are nominal (categorical) Example:

    • acs_instrumental_top1 = 5 → “Preparing meals”
    • acs_instrumental_top1 = 8 → “Doing laundry”

    The numerical difference between 5 and 8 has no quantitative meaning.
    Treating these category labels as numerical values is a logical error. ICC relies on variance and covariance between numerical values, so applying ICC to activity codes produces meaningless results.

(2) What Should Be Used Instead?

Data type Appropriate reliability measure
Nominal (0/1 selection, Top 1 activity code) ✅ Cohen’s Kappa
Ordinal (e.g., ranked satisfaction levels) 🔸 Weighted Kappa
Continuous (scores, measurements) ✅ ICC

(3) When Is ICC Appropriate?

If an assessment produces continuous scores that are repeatedly measured in the same way, ICC is appropriate.

Example : Treatment effect rated on a 0–100 scale

However, activity codes and binary selections are categorical, so ICC should not be used here.

2. Cohen’s Kappa

Cohen’s Kappa measures how consistently two assessments agree beyond chance. In simple terms, it quantifies how well the results of the first and second assessments match.

Example:

  • A participant selects “Visiting a hospital” in the first assessment
  • The same activity is selected again in the second assessment
    → This is counted as agreement.

Kappa summarizes such agreements across all activities. Cohen’s Kappa (\(\kappa\)) is defined as \[ \kappa = \frac{P_o - P_e}{1 - P_e}, \] where:

  • \(P_o\) is the observed proportion of agreement between the first and second assessments.
  • \(P_e\) is the expected proportion of agreement by chance, calculated from the marginal distributions of the two assessments.

The observed agreement \(P_o\) is given by \[ P_o = \frac{\sum_{i=1}^{K} n_{ii}}{N}, \] where:

  • \(K\) is the number of categories,
  • \(n_{ii}\) is the number of cases classified into category \(i\) in both assessments,
  • \(N\) is the total number of observations.

The expected agreement \(P_e\) is calculated as \[ P_e = \sum_{i=1}^{K} \left( \frac{n_{i+}}{N} \cdot \frac{n_{+i}}{N} \right), \] where:

  • \(n_{i+}\) is the total number of cases classified into category \(i\) in the first assessment,
  • \(n_{+i}\) is the total number of cases classified into category \(i\) in the second assessment.

How Is Kappa Different from Simple Agreement Rate?

  • Simple agreement rate:
    Number of matching responses ÷ total responses

  • Problem:
    Some agreement may occur by chance.

Example: Guessing heads or tails yields about 50% agreement by chance.

Kappa adjusts for this by subtracting chance agreement, providing a more accurate reliability estimate.

Interpretation of Kappa Values

Kappa value Interpretation
1.00 Perfect agreement
0.81–0.99 Almost perfect agreement
0.61–0.80 Substantial agreement
0.41–0.60 Moderate agreement
0.21–0.40 Fair agreement
0.00–0.20 Slight agreement
< 0.00 Worse than chance

Example :

Suppose four participants responded as follows:

Participant First Second
A Want Want
B Do not want Want
C Want Want
D Do not want Do not want

Simple agreement rate = 3 / 4 = 75%

Kappa evaluates this agreement after accounting for chance.

3. Agreement of Top 1 Preferred Activities by Domain

The goal is to examine whether the Top 1 preferred activity selected in each domain is consistent between the first and second assessments. For each domain, Cohen’s Kappa can be calculated to assess intra-rater reliability between the two time points.

Example Variables ”

Instrumental Activities

  • acs_instrumental_top1 (first assessment)
  • 2acs_instrumental_top1 (second assessment)

Leisure Activities

  • acs_leisure_top1
  • 2acs_leisure_top1

Social Activities

  • acs_social_top1
  • 2acs_social_top1

Each pair represents the Top 1 selection from the first and second assessments.
A separate Kappa value is computed for each domain.

Although the Top 1 variables are coded numerically, they represent categorical labels, not ordered or quantitative values.

Examples:

  • A value of 5 does not mean “greater” than 7
  • It simply refers to “Activity #5” versus “Activity #7”
  • Therefore, during analysis (e.g., in SPSS), these variables must be treated as Nominal.

4. Conclusion

Comparing the Top 1 preferred activities using Cohen’s Kappa for each domain is an appropriate way to assess intra-rater reliability. Three Kappa values will be obtained (instrumental, leisure, social), and higher values (commonly ≥ 0.6) indicate more stable and reliable assessments.