 # Module 4: Variance Estimation

This module introduces the basic concepts of variance (sampling error) estimation for NHANES data. You will learn how the complex survey design of NHANES and clustering of the data affect variance estimation, which methods are appropriate to use when calculating variance for NHANES data, how to properly calculate the variance for subgroups of interest, and how to specify the sampling design parameters in common statistical software packages.

In general, using statistical weights that reflect the probability of selection and propensity of response for sampled individuals will affect parameter estimates, while incorporating the attributes of the complex sample design (i.e. differential weighting, clustering and stratification) will affect variance estimates (estimated standard errors and thereby test statistics and confidence intervals).

## IMPORTANT NOTE

For NHANES datasets, the use of sampling weights and sample design variables is necessary to obtain unbiased estimates and accurate standard errors and confidence intervals.

NHANES survey design affects variance estimates

As stated in the module on sampling in NHANES, the NHANES has a complex, multistage, probability cluster design. Typically, individuals within a cluster (i.e., county, school, city, census block) are more similar to one another than those in other clusters and this homogeneity of individuals within a given cluster is measured by the intracluster correlation. When designing a survey with a complex sample, you ideally want to decrease the amount of correlation between sample persons within clusters. To achieve this, you want to sample fewer people within each cluster but sample more clusters. However, NHANES can only sample 30 Primary Sampling Units (PSUs) within a 2-year survey cycle because of operational limitations such as the cost of moving the survey MECs and geographic distances between PSUs. The sample size in each PSU is roughly equal and it is intended to yield about 5,000 examined persons per year.

For a complex sample survey such as NHANES, variance estimates computed using standard statistical software packages that assume simple random sampling are generally too low (i.e., significance levels are overstated) and biased because they do not account for the differential weighting and the correlation among sample persons within a cluster. There is a loss of precision and a reduction in the effective sample size because individuals are chosen within clusters instead of being sampled randomly throughout the population.

## WARNING

Standard statistical software packages that assume simple random sampling calculate variance estimates that are generally too low and biased because they do not account for differential weighting and the correlation among sample persons within a cluster.

The impact of the complex sample design upon variance estimates is measured by the design effect (DEFF). It is defined as the ratio of the variance of a statistic which accounts for the complex sample design to the variance of the same statistic based on a hypothetical simple random sample of the same size.

If the DEFF is 1, the variance for the estimate under the cluster sampling is the same as the variance under simple random sampling. The DEFFs for NHANES are typically greater than 1.

When the DEFF is greater than 1, the effective sample size is less than the number of sample persons but greater than the number of clusters. The effective sample size is calculated by dividing the sample size in a subgroup by the DEFF. The design effect is an attribute of a statistic calculated on a particular variable, rather than for the overall NHANES survey cycle. DEFF can be very different for different variables due to differences in variation by geography, by household intra class correlation, and by demographic heterogeneity. Design effects for a variable can also be different for different demographic subgroups (i.e. race and Hispanic origin or age groups.)

## IMPORTANT NOTE

Statistical software that accounts for the sampling design effect must be used to calculate an asymptotically unbiased estimate of the variance and should be used for all statistical tests and the construction of confidence limits. These procedures require information on the first stage of the sample design (identification of the PSU and stratum) for each sample person.

Reference

Park, I and Lee, H (2004) "Design Effects for the weighted mean and total estimators under complex survey sampling." Survey methodology 30:183-193.

Brief description of variance estimation procedures used with NHANES data

Variance of estimates (sampling errors) should be calculated for all survey estimates to aid in determining statistical reliability. For complex sample surveys, exact mathematical formulas for variance estimates are usually not available. Variance approximation procedures are required to provide reasonable, approximately unbiased, and design-consistent estimates of sampling error. Two variance approximation procedures which account for the complex sample design and compute design effects are replication methods and Taylor Series Linearization.

Currently NCHS recommends the use of the Taylor Series Linearization methods for variance estimation in all NHANES surveys, and this tutorial and the example code provide assistance on the linearization method only. SUDAAN, Stata, SAS Survey procedures, SPSS, and R can be used to obtain variance estimated by this method. Survey design variables identifying strata and PSU are required in order to utilize these software packages.

Initially, for the NHANES 1999-2000 survey, the delete-one jackknife (replication) method was used to estimate variances, and these weights are available on the public-use file for that cycle. Balance repeated replication was used for NHANES III. If replication methods are used for other survey cycles, you must compute your own replicate weights.

Taylor Linearization Procedures

To use the linearization method, information about the first stage of the sample design (strata and PSU variables) must be available on the survey data file. The "true" design variables are not released on the public-use data files in order to protect the confidentiality of information provided by survey participants and to reduce disclosure risks associated with a two-year data release. Instead, Masked Variance Units (MVUs) were created and provided on the demographic data files for each survey cycle. These MVUs produce variance estimates that closely approximate the variances that would have been estimated using the true design variables, and should be used for all analyses on public release data. The variable name for the masked variance unit pseudo-stratum is sdmvstra and the variable name for the masked variance unit pseudo-PSU is sdmvpsu.

As described in Module 2: Sample Design, the first stage of the NHANES sampling selects PSUs from strata. This can be treated as sampling "with replacement" because the sampling fraction (the number of PSUs selected compared to the total number of PSUs within each stratum) is small. Therefore, the finite population correction factor = (1 - the sampling fraction) is close to 1 and has a negligible effect on the formula for the design based estimate of variance.

Frequently, analysts wish to produce estimates for certain demographic subgroups of interest, such as a particular age range or gender. (Such subgroups may also be referenced in the survey literature and software documentation as "subpopulations," "domains," or "subdomains.") The calculations to generate point estimates for statistics such as means, percentages, and totals require only the observations that are within your subgroup of interest. However, to properly estimate the variance of these statistics with Taylor series linearization, your statistical software requires all observations with a non-zero value for your weight variable, as well as an indicator for which records are in your subpopulation of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 and over, the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years, must be read into the statistical software procedure.

You may be wondering why the variance estimation requires information on records that are not in your subpopulation of interest. The subpopulation sample size within each PSU is actually a random variable. Conceptually, if samples were drawn repeatedly using the original complex survey design, the number of sampled persons in your subpopulation of interest within each PSU would vary somewhat from sample to sample. The variance estimation takes into account this sample-to-sample variability in the subpopulation sample size in calculating the variability of the estimated statistic (e.g. the standard error of the mean BMI for men aged 20 and over). If you were to subset your dataset and only keep the records that are in your subpopulation, then the variance estimation formula would effectively treat the subpopulation sample size as if it were fixed. This would underestimate the variability of the estimated mean BMI. For more information, refer to the suggested sources or a textbook on survey statistics.

## WARNING

As a general rule when working with complex survey data such as NHANES, you should never drop records from your analysis dataset before executing your analysis procedures. Instead, use the special statements provided in your software's analysis procedure to perform subgroup analyses.

It is important that you do not create a smaller subset of your data based on any non weight-related groups of interest (e.g. demographic, laboratory or examination variables) before executing your analysis procedures. For example, you should not create a subset of your data in the SAS data step before executing a SUDAAN procedure. Instead, it is highly recommended that you specify your subgroup of interest using the subpopx statement in the SUDAAN procedure itself. It may be helpful to create a binary indicator variable to define your population of interest in your SAS data step, which you can then use in the subpopx statement.

Each software package that can analyze complex survey data has commands to produce subgroup estimates while properly accounting for the survey design. The following table presents a summary of proper and improper approaches to subgroup analysis in several software packages. See the section below, "Degrees of Freedom for Subgroup Analysis in NHANES," for additional considerations in subgroup analysis.

Summary of proper and improper approaches to subgroup analysis in selected statistical software:

Software Improper approaches Proper approaches
SUDAAN
• Subsetting your dataset in SAS before executing SUDAAN procedures
• SUBPOPN or SUBPOPX statement
• TABLES statement to produce cross-tabulations in PROC DESCRIPT
SAS Survey
• Subsetting your dataset before executing SAS Survey procedures
• Using a where or if data set option on your input dataset in a SAS Survey procedure
• Using a BY statement to produce a separate estimate for each level of the "by" variable
• DOMAIN statement (Procedures other than SURVEYFREQ)
• TABLES statement (Proc SURVEYFREQ)
Stata
• Dropping observations from your dataset
• Using IF or IN options to subset your data during an estimation command
• subpop option in the svy prefix command
• over() option in the svy:mean, svy:proportion, svy:ratio, and svy:total commands (to request estimates at multiple levels of a categorical variable)
R ("survey" package)
• Subsetting your data frame before defining your survey design object
• Subsetting your survey design object to preserve the original survey design information
• SVYBY function

References

West BT, Berglund P, Heeringa SG. "A closer examination of subpopulation analysis of complex-sample survey data." Stata Journal. 2008;8(4):520-531.

Graubard BI, Korn EL. "Survey inference for subpopulations." Am J Epidemiol. 1996;144(1):102-106.

Degrees of Freedom for a Complex Survey

Continuous NHANES uses a complex, multistage probability sampling design. The number of independent pieces of information, or degrees of freedom, depends upon the number of PSUs rather than on the number of sample persons. Sample persons within a given PSU are not independent.

For a complex survey, the design degrees of freedom are properly calculated by subtracting the number of clusters (strata) in the first stage of sampling from the number of primary sampling units (PSUs) selected in the first stage of sampling, as shown the in equation below. Most two-year public data releases of Continuous NHANES have 15 degrees of freedom (30 PSUs – 15 strata.)

Degrees of freedom are needed to perform hypothesis tests and to compute confidence intervals. To calculate the correct value for the t-statistic from a t-distribution and a selected level of significance, you must calculate the proper degrees of freedom for the estimate.

Degrees of Freedom when Analyzing Subgroups in NHANES

Estimates are often calculated for various subgroups of interest within the total NHANES population. For an analysis on a subgroup, the degrees of freedom should be based on the number of strata and PSUs containing the observations of interest. When you analyze data on a subgroup of sample persons who may not be represented in all strata and PSUs (e.g., some racial and ethnic groups), those estimates would have fewer degrees of freedom, compared to estimates for the overall sample.

Software packages differ in how they define the degrees of freedom for subgroups, and many packages do not correct for the reduction in the degrees of freedom for subgroups where not all PSUs and strata are represented. Analysts should be aware of how the software package they are using determines the degrees of freedom. It may be necessary to output the number of PSUs and strata from the survey procedure to calculate the correct degrees of freedom. In SUDAAN, the DESCRIPT procedure allows users to output the number of strata and PSUs represented in the subpopulation. In other packages, the user may need to calculate the number of PSUs and strata separately. See the Sample Code page for examples of calculating the correct degrees of freedom for subgroups and using that information to calculate a confidence interval.

## WARNING

Many software packages do not correct for the reduction in the degrees of freedom for subgroups where not all PSUs and strata are represented. Analysts should be aware of how the software package they are using determines the degrees of freedom.

To understand more about variance estimation methods you may wish to review the Analytic Guidelines on the NHANES web site; read the text by Korn and Graubard (Korn EL and Graubard BI. Analysis of Health Surveys. Wiley Series in Probability and Statistics. 1999. New York, New York.); or take a course in SUDAAN or complex survey sampling.

This section provides a brief overview of how to request the Taylor series linearization method, specify the survey design variables, and correctly calculate the variance for subpopulations of interest using SUDAAN. These code portions include only the statements required to account for the complex sample design of NHANES, and do not include all code required to request statistical estimates. See the sample code page for complete, specific examples. The software tips page contains additional helpful hints about each software package.

In SUDAAN, the user must specify the survey design variables within each procedure step. This example shows the SUDAAN procedure PROC DESCRIPT, but the same statements would be used in other procedures (e.g. PROC CROSSTAB, PROC REGRESS, etc.)

PROC DESCRIPT data=one design=wr;
NEST sdmvstra sdmvpsu;
WEIGHT WTMEC4YR;
SUBPOPX ridageyr>=20;
* more statements...;
run;

Statements Explanation
PROC SORT data=one;
By sdmvstra sdmvpsu;
run;

Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN.

PROC DESCRIPT data=one design=wr;

The design=wr option specifies that the sample design option is Taylor series linearization and that the first-stage sampling can be treated as "with replacement." (WR is the default design option in SUDAAN. If you omit the DESIGN= option from your PROC statement, SUDAAN assumes a WR design and Taylor series linearization.)

NEST sdmvstra sdmvpsu;

The NEST statement specifies that the first-stage sampling is described by the strata (variable sdmvstra) and PSU (variable sdmvpsu) variables.

WEIGHT WTMEC4YR;

The WEIGHT statement specifies the sampling weight to be used for the analysis.

See Module 3: Weighting for more discussion of how to select the correct weight for your analysis.

SUBPOPX ridageyr>=20;

The SUBPOPX statement specifies the subpopulation of interest. In this example, the subpopulation of interest is adults aged 20 and over.

Alternatively, the SUBPOPN statement does the same thing as SUBPOPX, but it has less flexibility in coding; SUBPOPX was added in SUDAAN 11.

* more statements...;
run;

This example only shows the statements required to specify the sample design parameters using SUDAAN procedures. Your code will include additional statements to request estimates for your analytic project.

SUDAAN syntax is described as of release 11.0.0, but the syntax may change in future releases. Review the documentation for the software version you are using for any changes.

This section provides a brief overview of how to request the Taylor series linearization method, specify the survey design variables, and correctly calculate the variance for subpopulations of interest using SAS Survey Procedures. These code portions include only the statements required to account for the complex sample design of NHANES, and do not include all code required to request statistical estimates. See the sample code page for complete, specific examples. The software tips page contains additional helpful hints about each software package.

SAS provides a number of survey analysis procedures that properly account for the sample design. The procedure names all start with SURVEY – such as SURVEYMEANS, SURVEYFREQ, SURVEYREG, and others. (Note that the Base SAS procedures – PROC MEANS, PROC FREQ, etc. – do not account for the complex sample design of NHANES.)

In SAS Survey procedures, the user must specify the survey design variables within each procedure step. This example shows PROC SURVEYMEANS, but the same statements would be used in most other survey procedures (e.g. PROC SURVEYREG, PROC SURVEYLOGISTIC, etc.)

PROC SURVEYMEANS data=one varmethod=taylor nomcar;
STRATA sdmvstra;
CLUSTER sdmvpsu;
WEIGHT WTMEC4YR;
DOMAIN Select;
* more statements...;
run;

Statements Explanation
PROC SURVEYMEANS data=one varmethod=taylor nomcar;

The varmethod=taylor option on the procedure statement specifies that the procedure should use Taylor series linearization for variance estimates. This is the default method if you do not specify the varmethod= option.

The nomcar option treats missing values in the variance computation as "not missing completely at random." Use of the nomcar option is recommended in SAS Survey procedures to obtain the correct standard error estimates.

STRATA sdmvstra;

The STRATA statement identifies the variables that form the strata.

CLUSTER sdmvpsu;

The CLUSTER statement identifies the variables that form the clusters (PSUs) in a clustered sample design such as NHANES. If both STRATA and CLUSTER statements are specified, then the SAS Survey procedure assumes the clusters are nested within strata (as is the case for NHANES.)

WEIGHT WTMEC4YR;

The WEIGHT statement specifies the sampling weight to be used for the analysis. See Module 3: Weighting for more discussion of how to select the correct weight for your analysis.

DOMAIN Select;

The DOMAIN statement specifies the subpopulation of interest. In this generic example, the variable Select would have been created in an earlier SAS data step as a binary indicator for whether the observation was in the subpopulation of interest, e.g. adults aged 20 and over.

Note that the DOMAIN statement is not valid syntax in the SURVEYFREQ procedure, and instead the TABLES statement may be used to specify subpopulations of interest. See the Software Tips page and the SAS documentation for more information.

 * more statements...;
run;

This example only shows the statements required to specify the sample design parameters using SAS Survey procedures. Your code will include additional statements to request estimates for your analytic project.

SAS syntax is described as of version 9.4, maintenance release 3 (SAS/STAT version 14.1), but the syntax may change in future releases. Review the documentation for the software version you are using for any changes.

This section provides a brief overview of how to request the Taylor series linearization method, specify the survey design variables, and correctly calculate the variance for subpopulations of interest using Stata. These code portions include only the statements required to account for the complex sample design of NHANES, and do not include all code required to request statistical estimates. See the sample code page for complete, specific examples. The software tips page contains additional helpful hints about each software package.

Stata's SVY commands are a series of commands specifically designed to analyze complex survey designs like NHANES. In Stata, the user must first declare the survey design for a dataset using the svyset command. Stata then remembers these survey design characteristics and applies them to every subsequent SVY command. (You can issue a new svyset command if you want to update the survey design specification within your session.) Generally, the survey analysis commands in Stata use similar syntax as the standard data analysis commands but require the prefix svy: be used, which adjusts the results for the survey design as specified in the svyset command.

svyset [w=wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)
svy, subpop(inAnalysis): mean Depression_Indicator

Statements Explanation
svyset [w=wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

The svyset command defines the weight, PSU, and strata variables.

The vce(linearized) option specifies that Taylor series linearization should be used for variance estimation. This is also the default method, if you do not specify the vce option.

svy, subpop(inAnalysis): mean Depression_Indicator

The second command requests the mean of the variable Depression_Indicator for the subpopulation of interest, accounting for the survey design previously specified by the svyset command.

The svy: prefix requests that the survey design be applied, and the subpop option restricts the analysis to the subpopulation of interest. In this generic example, the variable inAnalysis would have been created by an earlier Stata command as a binary indicator for whether the observation was in the subpopulation of interest, e.g. adults aged 20 and over.

Stata syntax is described as of version 15, but the syntax may change in future releases. Review the documentation for the software version you are using for any changes.

This section provides a brief overview of how to request the Taylor series linearization method, specify the survey design variables, and correctly calculate the variance for subpopulations of interest using R. These code portions include only the statements required to account for the complex sample design of NHANES, and do not include all code required to request statistical estimates. See the sample code page for complete, specific examples. The software tips page contains additional helpful hints about each software package.

The R "survey" package provides functions for analyzing data from complex surveys. In R, the user must use the svydesign function to create a "survey design object" that contains the data frame along with all the survey design information required to analyze it. This survey design object is then passed as an argument to the survey analysis functions.

NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
NHANES <- subset(NHANES_all, inAnalysis==1)
svymean(~Depression, NHANES)

Statements Explanation
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)

The svydesign command defines a survey design object called NHANES_all, which contains the data frame One (specified by the data=One argument) and information about the survey design information specified in the other arguments.

Note that the id (i.e. the PSU), strata, and weights arguments are specified as R formulas, which is why the tilde operator (~) is used.

The nest=TRUE option must be used for continuous NHANES data because the unique PSUs are identified by the combination of the strata and the PSU variables (i.e. the PSU identifiers reuse the same values for the PSUs within each stratum.)

NHANES <- subset(NHANES_all, inAnalysis==1)

The subset statement restricts your survey design object NHANES_all to a subpopulation (where inAnalysis is equal to 1) while keeping the original design information about the number of PSU and strata, and creates a new survey design object named NHANES.

In this generic example, the variable inAnalysis would have been created by an earlier R command as a binary indicator for whether the observation was in the subpopulation of interest, e.g. adults aged 20 and over.

Note that this command calls the subset function for objects of class "survey.design" (subset.survey.design from the survey package) and is the recommended way to specify your analysis population. If you instead subset your data frame before defining your survey design object, you may produce incorrect variance estimates.

svymean(~Depression, NHANES)

The svymean command requests the mean and standard error of variable Depression for our subpopulation of interest, as defined in the survey design object NHANES.

R syntax is described as of version 3.5.2 and survey package version 3.35.1, but the syntax may change in future releases. Review the documentation for the software version you are using for any changes.