Consultation 5: Complex Survey Methods for Age-Group Comparisons in KNHANES

I. Client’s Inquiry

Study Objective

Thesis title: Nutritional Recommendations for Muscle Health in the Ageing Population of South Korea: Exploring Dietary Intake and Patterns, Activity Levels, and Socio-cultural Influences on Sarcopenia Risk.
Hypothesis: Intake of specific nutrients/foods in older adults positively affects arm muscle mass (Lean Body Mass) and handgrip strength (HGS), which may in turn improve quality of life (QoL), particularly in relation to physical activity.

Current Issue

I used:
- Complex samples chi-square tests, and
- Post hoc logistic regression with Bonferroni adjustment to compare categorical variables across age groups.
I am unsure whether this approach is appropriate and would like professional guidance on whether the methods are correct (and, if not, what alternatives are recommended).

Data Description

Data source: Korea National Health and Nutrition Examination Survey (KNHANES), 2015–2023.
Exclusion: 2021–2022 were excluded due to missing HGS measurements.
Variables included: socioeconomic status (SES), anthropometric/body composition (InBody) data, nutrient intake, QoL, and HGS.
Sample size: 30,354 participants
- Men: 13,149
- Women: 17,205
Age groups (5 categories):
- Young Adults: 19–29 (n = 3,398)
- Middle-aged Adults: 30–49 (n = 9,422)
- Older Adults: 50–64 (n = 8,174)
- Elderly: 65–74 (n = 6,132)
- Older Elderly: ≥ 75 (n = 3,228)
Key derived/classified measures:
- Physical activity levels and HGS classified using:
  - quintiles, and
  - Asian Working Group for Sarcopenia (AWGS) criteria (low vs high strength)
- Estimated energy requirements calculated
- Macronutrient intake computed as:
  - absolute intake, and
  - percentage contribution to total caloric intake

II. Background

1. Categorical variables

When the client says:

“to get the difference between age groups for different categorical variables,”

the items in the Data Description that most directly connect to “categorical variables” (especially for comparing distributions across the 5 age groups) are the following statements:

“age was categorized into five groups …”
“Physical activity levels and HGS were classified using both quintiles and the AWGS criteria …”
“The dataset includes information on socioeconomic status (SES) … quality of life (QoL) …”

From these, the most natural interpretation is that “different categorical variables” refers primarily to the variables that were explicitly classified into categories, and secondarily to other variables that are commonly categorical in KNHANES.

1) Physical activity (PA) level — explicitly “classified”

Likely categorical forms include:

Quintiles (5 categories) of physical activity level, and/or
A binary grouping (e.g., low vs high) if the quintiles were collapsed

This is a very typical setup for using a chi-square test to check whether the distribution of PA categories differs by age group.

2) Handgrip strength (HGS) — explicitly categorized

Likely categorical forms include:

AWGS criteria: low vs high strength (binary), and/or
Quintiles (5 categories) if HGS was also divided into quintiles

Again, comparing low/high proportions (or quintile distributions) across age groups via chi-square is common.

3) Socioeconomic status (SES) variables — a categorical “bundle”

The client only wrote “SES,” so specific variable names are not given. However, in KNHANES, SES-related variables are often categorical, such as:

income quantile group
education level
occupation class
marital status

It is very common to test whether these SES category distributions differ across age groups.

4) Quality of life (QoL) — likely categorical depending on how it was used

QoL can be continuous (e.g., an index score), but survey-based QoL measures (e.g., EQ-5D-type items) often include categorical responses (e.g., no problems / some problems / severe problems). Since the client explicitly referred to “categorical variables,” it is plausible that QoL was used in a categorical form (individual items or categorized scores).

Summary

In the client’s wording, “different categorical variables” most plausibly refers to:

Physical activity classification variables (quintiles or low/high), and
HGS classification variables (AWGS low/high or quintiles),

as the primary targets, and additionally:

SES variables (commonly categorical in KNHANES), and
QoL variables if represented as categorical items or categorized scores.

2. AWGS criteria (Asian Working Group for Sarcopenia)

AWGS (Asian Working Group for Sarcopenia) is a consensus guideline used to classify/diagnose sarcopenia in Asian populations (especially older adults).
Its core idea is to take continuous measurements (e.g., handgrip strength, gait speed, muscle mass) and apply clinical cut-offs to create categories such as low vs normal.

Typical AWGS 2019 cut-offs (examples)

Low muscle strength (Handgrip strength, HGS)
- Men: < 28 kg
- Women: < 18 kg
Low physical performance
- 6m gait speed < 1.0 m/s, or
- SPPB ≤ 9, or
- 5-times chair stand ≥ 12 s
Low muscle mass also has DXA/BIA-based cut-offs (used for full diagnosis).

The sentence:

“Physical activity levels and HGS were classified using both quintiles and the AWGS criteria …”

is most naturally read as:

The same variables were categorized in two different ways.

(1) Quintiles (Q1–Q5): distribution-based categorization

Split the sample into five equal-sized groups (20% each) based on the variable’s distribution.
Example: HGS quintiles, or physical activity (e.g., MET-min/week) quintiles.

Pros: easy “relative low vs high” comparisons within the sample
Cons: weaker direct clinical meaning (“at risk” threshold)

(2) AWGS criteria: guideline cut-off categorization

Use clinical cut-offs to create categories like low vs normal.
Example: HGS low/high using AWGS thresholds (men < 28 kg, women < 18 kg).

Pros: clinically interpretable (“below risk threshold or not”)
Cons: only possible if the needed measures exist (or if proxies are used)

3. Complex Samples Chi-square Test

A complex samples chi-square test is a design-based test of association (independence) between two categorical variables when the data come from a complex survey design (e.g., stratification, clustering/PSUs, and sampling weights), not from simple random sampling (SRS).

Core ideas (why it differs from the usual Pearson chi-square)

In complex surveys, observations are not i.i.d. due to weights and within-cluster correlation.
→ Using the usual Pearson chi-square test can produce incorrect p-values.
The solution is:
- build a weighted contingency table, but
- compute inference accounting for the survey design, typically via a Rao–Scott adjusted chi-square (often with an F transformation).

1) Starting point: weighted contingency table and Pearson chi-square

Let:

\(A \in \{1,\dots,R\}\), \(B \in \{1,\dots,C\}\) be categorical variables,
\(w_i\) be the survey weight for individual \(i\).

Weighted cell counts

\[ \widehat{N}_{rc} =\sum_{i=1}^{n} w_i \, \mathbf{1}(A_i=r, B_i=c). \]

Marginal totals: \[ \widehat{N}_{r+}=\sum_{c=1}^{C}\widehat{N}_{rc},\quad \widehat{N}_{+c}=\sum_{r=1}^{R}\widehat{N}_{rc},\quad \widehat{N}_{++}=\sum_{r=1}^{R}\sum_{c=1}^{C}\widehat{N}_{rc}. \]

Under the independence null hypothesis \(H_0: A \perp B\), the expected weighted counts are:

\[ \widehat{E}_{rc} =\frac{\widehat{N}_{r+}\widehat{N}_{+c}}{\widehat{N}_{++}}. \]

The (weighted) Pearson chi-square statistic is:

\[ X_P^2 =\sum_{r=1}^{R}\sum_{c=1}^{C} \frac{\left(\widehat{N}_{rc}-\widehat{E}_{rc}\right)^2}{\widehat{E}_{rc}}. \]

Why \(X_P^2\) is not enough in complex surveys

In complex samples, \(X_P^2\) does not follow a standard chi-square distribution well because:

clustering induces correlation (reducing the effective sample size),
unequal weights increase variance.

2) Key adjustment: Rao–Scott correction (design effect)

To get valid inference, we estimate the covariance of estimated proportions (or cell proportions) reflecting stratification/clustering/weights, e.g. using:

Taylor linearization, or
replicate weights (jackknife, BRR, bootstrap).

Rao–Scott adjustment modifies the Pearson statistic to reflect the design effect (variance inflation due to the survey design). A common first-order intuition is:

\[ X_{RS}^2 =\frac{X_P^2}{\widehat{c}}, \] where \(\widehat{c}\) acts like an “average design effect / inflation factor,” shrinking \(X_P^2\) because the true variance is larger under complex sampling.

Many software packages (e.g., SPSS Complex Samples, R survey) often convert the adjusted statistic into an F statistic for p-values:

Numerator df: \((R-1)(C-1)\)
Denominator df: approximately “#PSUs − #strata” (design-based degrees of freedom; exact details depend on implementation)

3) One-line summary

A complex samples chi-square test is an independence test on a weighted contingency table, where the p-value is computed using a design-based adjustment (typically Rao–Scott adjusted chi-square or its F-equivalent) to account for stratification, clustering, and weights.

4. Bonferroni test in logistic regression

In this context, “post hoc logistic regression Bonferroni test” does not mean that logistic regression has a special standalone test called “the Bonferroni test.”
It means:

the researcher ran multiple post hoc comparisons using logistic regression, and
they adjusted the resulting p-values (or significance level / confidence intervals) using the Bonferroni correction to control inflation of Type I error due to multiple testing.

1) Why it is needed: the multiple-comparisons problem

Suppose there are 5 age groups. If we want to find which groups differ, we typically run several comparisons:

Reference-group comparisons: choose one reference group and compare it to the other 4
- number of tests: \(m=4\)
All pairwise comparisons: compare every pair of groups
- number of tests: \(m=\binom{5}{2}=10\)

If we test each comparison at level \(\alpha\) without adjustment, the probability of getting at least one false positive (family-wise error rate, FWER) increases as \(m\) grows.

2) Bonferroni correction: the key idea

Bonferroni controls FWER at \(\alpha\) by making each individual test more stringent.

(a) Adjust the significance level

Use: \[ \alpha^* = \frac{\alpha}{m}. \]

(b) Equivalent p-value adjustment

If the original p-value for comparison \(j\) is \(p_j\), the Bonferroni-adjusted p-value is: \[ p_j^{(\mathrm{Bonf})} = \min\{m\,p_j,\,1\}. \]

Declare significance if: \[ p_j^{(\mathrm{Bonf})} \le \alpha. \]

3) What is being tested multiple times in logistic regression?

In logistic regression, post hoc comparisons are typically tests of:

regression coefficients, or
linear contrasts (differences between coefficients), which correspond to group comparisons.

Example: let \(Y_i\) indicate whether a person has low HGS
(e.g., \(Y_i=1\) for “low”, \(Y_i=0\) for “normal”).
With age group \(G_i \in \{1,\dots,5\}\) and group 1 as reference:

\[ \mathrm{logit}\{\Pr(Y_i=1)\} =\beta_0 + \beta_2\,\mathbf{1}(G_i=2)+\cdots+\beta_5\,\mathbf{1}(G_i=5). \]

Then each \(\beta_k\) represents the log-odds difference between group \(k\) and the reference group 1, and the odds ratio is: \[ \mathrm{OR}_{k:1} = \exp(\beta_k). \]

Typical post hoc logistic + Bonferroni workflows are:

Reference vs others: test \(H_{0k}:\beta_k=0\) for \(k=2,\dots,5\)
- \(m=4\) tests
All pairwise comparisons: test contrasts such as \[ H_{0(a,b)}:\beta_a - \beta_b = 0 \] for all pairs \((a,b)\)
- \(m=10\) tests

Bonferroni adjustment is then applied to these multiple p-values (or contrasts).

4) Bonferroni-adjusted confidence intervals (optional but common)

Bonferroni can also adjust confidence intervals.

Instead of using \(z_{1-\alpha/2}\) for each interval, to achieve simultaneous coverage across \(m\) comparisons we use: \[ \widehat{\beta}_j \pm z_{1-\alpha/(2m)}\,\mathrm{SE}(\widehat{\beta}_j). \]

For odds ratios, exponentiate the endpoints: \[ \exp\Bigl(\widehat{\beta}_j \pm z_{1-\alpha/(2m)}\,\mathrm{SE}(\widehat{\beta}_j)\Bigr). \]

One-sentence summary

“post hoc logistic regression Bonferroni test” means the researcher performed multiple age-group comparisons using logistic regression (via coefficients or contrasts) and controlled the inflated false-positive risk by applying the Bonferroni correction (either \(\alpha/m\) or \(m p_j\) adjustments).

III. Answer

1) Bottom line first: their procedure can be valid, conditionally

What the client did:

Run a complex-samples chi-square test (Rao–Scott family) to test the overall association between age group (5 levels) and a categorical outcome \(Y\).
If significant, run post hoc comparisons using logistic-regression-based methods (to see which age groups differ).
Apply Bonferroni (or Holm, etc.) to adjust for multiple comparisons.

Conceptually, this is a reasonable workflow.

However, for it to be truly correct, both of the following must hold:

The post hoc logistic regression must also account for the complex survey design (weights, strata, clusters/PSUs).
Otherwise, step (1) is design-based but step (2) ignores the design → inconsistent SEs / p-values.
The regression model must match the number of categories / ordinality of \(Y\).
- If \(Y\) is binary → logistic regression
- If \(Y\) has \(3+\) categories → multinomial or ordinal logistic regression (as appropriate)

2) Step 1: What does the complex-samples chi-square test actually test?

Let age group be \(G \in \{1,\dots,5\}\) and a categorical variable be \(Y \in \{1,\dots,C\}\).

Null hypothesis (independence):

\[ H_0:\Pr(Y=c \mid G=g)=\Pr(Y=c)\quad \forall g,c. \]

With survey weights \(w_i\), form the weighted contingency table:

\[ \widehat{N}_{gc}=\sum_{i=1}^{n} w_i\,\mathbf{1}(G_i=g,\;Y_i=c). \]

Compute expected counts under independence:

\[ \widehat{E}_{gc}=\frac{\widehat{N}_{g+}\widehat{N}_{+c}}{\widehat{N}_{++}}. \]

Instead of treating the usual Pearson statistic as chi-square, complex-sample software uses a Rao–Scott adjusted chi-square (or an F-transformation) so that:

the variance / standard errors reflect stratification + clustering + weighting, and
the reference distribution / degrees of freedom are design-based.

3) Step 2 (key): What is the “proper” way to do post hoc logistic regression + Bonferroni?

In practice, the cleanest and easiest-to-defend approach is often to use one survey-weighted regression framework that handles:

a global test (overall age-group effect), and then
post hoc contrasts (pairwise comparisons),

all within the same design-based model (rather than “chi-square first, then regression”).

3.1 If \(Y\) is binary (e.g., low HGS vs normal): survey-weighted logistic regression

Model:

\[ \mathrm{logit}\{\Pr(Y_i=1 \mid G_i)\} =\beta_0+\sum_{g=2}^{5}\beta_g\,\mathbf{1}(G_i=g). \]

If group 1 is the reference, then \(\exp(\beta_g)\) is the odds ratio for “group \(g\) vs group 1”.

(A) Global test (overall age effect)

\[ H_0:\beta_2=\beta_3=\beta_4=\beta_5=0. \]

A typical design-based Wald form is:

\[ W=\widehat{\beta}^{\top}\Bigl(\widehat{\mathrm{Var}}(\widehat{\beta})\Bigr)^{-1}\widehat{\beta}, \]

where \(\widehat{\mathrm{Var}}(\widehat{\beta})\) is the survey-design-based (sandwich) variance.

(B) Post hoc comparisons (pairwise contrasts)

To compare age groups \(a\) and \(b\), test a contrast such as:

\[ H_0:\beta_a-\beta_b=0. \]

You obtain p-values \(p_j\) for \(j=1,\dots,m\) comparisons, then adjust for multiplicity.

Bonferroni-adjusted p-values:

\[ p_j^{(\mathrm{Bonf})}=\min\{m\,p_j,\,1\}. \]

Equivalently, adjust the per-test significance level:

\[ \alpha^{\*}=\frac{\alpha}{m}. \]

3.2 If \(Y\) has \(3+\) categories: multinomial or ordinal survey regression

Running separate binary logistic models for each category is often messy and can complicate error control.

If categories are nominal (no order): multinomial logistic (with baseline category \(c=1\))

\[ \log\frac{\Pr(Y_i=c \mid G_i)}{\Pr(Y_i=1 \mid G_i)} =\alpha_c+\sum_{g=2}^{5}\beta_{cg}\,\mathbf{1}(G_i=g),\quad c=2,\dots,C. \]

If categories are ordered: consider an ordinal (cumulative logit) model.

Global tests are still “all age effects are zero” (a multivariate Wald test), and post hoc comparisons are still contrast tests of the form \(L\beta=0\), followed by Bonferroni/Holm adjustments.

4) Conclusion

“Your overall workflow—testing the age-group association first using a complex-samples chi-square test and then conducting post hoc comparisons—is conceptually acceptable.”
“However, the post hoc logistic regression must also reflect the complex survey design (weights, strata, clusters). If step 1 is design-based but step 2 ignores the design, standard errors and p-values may not be consistent.”
“Also, logistic regression is appropriate only when the outcome is binary. If the outcome has three or more categories, multinomial or ordinal logistic regression is typically the standard approach.”
“Bonferroni adjustment is a valid (but conservative) way to control family-wise error. If the number of comparisons is large, Holm or FDR-based options can be considered depending on the study goal.”

5) A recommended “clean” procedure to propose to the client

For each categorical variable \(Y\):

Fit a survey-design-adjusted model appropriate for \(Y\) (binary / multinomial / ordinal).
Perform a global test of the age-group effect:

\[ H_0:\text{(all age-group coefficients)}=0. \]

If significant, run planned contrasts for post hoc comparisons
(e.g., 4 reference comparisons or 10 pairwise comparisons).
Apply multiple-comparison correction (Bonferroni or Holm):

\[ p^{(\mathrm{adj})}=\min(m\,p,1)\quad \text{or}\quad \alpha/m. \]