Frequently, analysts wish to produce estimates for certain demographic subgroups of interest, such as a particular age range or sex. (Such subgroups may also be referenced in the survey literature and software documentation as "subpopulations," "domains," or "subdomains.") The calculations to generate point estimates for statistics such as means, percentages, and totals require only the observations that are within your subgroup of interest. However, to properly estimate the variance of these statistics with Taylor series linearization, your statistical software requires all observations with a non-zero value for your weight variable, as well as an indicator for which records are in your subpopulation of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 and over, the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years, must be read into the statistical software procedure.
You may be wondering why the variance estimation requires information on records that are not in your subpopulation of interest. The subpopulation sample size within each PSU is actually a random variable. Conceptually, if samples were drawn repeatedly using the original complex survey design, the number of sampled persons in your subpopulation of interest within each PSU would vary somewhat from sample to sample. The variance estimation takes into account this sample-to-sample variability in the subpopulation sample size in calculating the variability of the estimated statistic (e.g. the standard error of the mean BMI for men aged 20 and over). If you were to subset your dataset and only keep the records that are in your subpopulation, then the variance estimation formula would effectively treat the subpopulation sample size as if it were fixed. This would underestimate the variability of the estimated mean BMI. For more information, refer to the suggested sources or a textbook on survey statistics.
WARNING
As a general rule when working with complex survey data such as NHANES, you should never drop records from your analysis dataset before executing your analysis procedures. Instead, use the special statements provided in your software's analysis procedure to perform subgroup analyses.
It is important that you do not create a smaller subset of your data based on any non weight-related groups of interest (e.g. demographic, laboratory or examination variables) before executing your analysis procedures. For example, you should not create a subset of your data in the SAS data step before executing a SUDAAN procedure. Instead, it is highly recommended that you specify your subgroup of interest using the subpopx
statement in the SUDAAN procedure itself. It may be helpful to create a binary indicator variable to define your population of interest in your SAS data step, which you can then use in the subpopx
statement.
Each software package that can analyze complex survey data has commands to produce subgroup estimates while properly accounting for the survey design. The following table presents a summary of proper and improper approaches to subgroup analysis in several software packages. See the section below, "Degrees of Freedom for Subgroup Analysis in NHANES," for additional considerations in subgroup analysis.
Summary of proper and improper approaches to subgroup analysis in selected statistical software:
Software |
Improper approaches |
Proper approaches |
SUDAAN |
- Subsetting your dataset in SAS before executing SUDAAN procedures
|
SUBPOPN or SUBPOPX statement
TABLES statement to produce cross-tabulations in PROC DESCRIPT
|
SAS Survey |
- Subsetting your dataset before executing SAS Survey procedures
- Using a
where or if data set option on your input dataset in a SAS Survey procedure
- Using a
BY statement to produce a separate estimate for each level of the "by" variable
|
DOMAIN statement (Procedures other than SURVEYFREQ )
TABLES statement (Proc SURVEYFREQ )
|
Stata |
- Dropping observations from your dataset
- Using
IF or IN options to subset your data during an estimation command
|
subpop option in the svy prefix command
over() option in the svy:mean , svy:proportion , svy:ratio , and svy:total commands (to request estimates at multiple levels of a categorical variable)
|
R ("survey" package) |
- Subsetting your data frame before defining your survey design object
|
- Subsetting your survey design object to preserve the original survey design information
SVYBY function
|
References
West BT, Berglund P, Heeringa SG. "A closer examination of subpopulation analysis of complex-sample survey data." Stata Journal. 2008;8(4):520-531.
Graubard BI, Korn EL. "Survey inference for subpopulations." Am J Epidemiol. 1996;144(1):102-106.