# Module 2: Sample Design

The NHANES samples are not simple random samples. Rather, a complex, multistage, probability sampling design is used to select participants representative of the civilian, non-institutionalized US population. Oversampling of certain population subgroups is also done to increase the reliability and precision of health status indicator estimates for these particular subgroups. Researchers need to take this into account in their analyses by appropriately specifying the sampling design parameters. This can be done using any statistical software that can analyze complex survey designs, such as SAS Survey procedures, SUDAAN, Stata, or R. This module provides an overview of the sample design parameters in NHANES.

## IMPORTANT NOTE

For NHANES datasets, the use of sampling weights and sample design variables is recommended for all analyses because the sample design is both a clustered design and incorporates differential probabilities of selection. If you fail to account for the sampling parameters, you may obtain biased estimates and overstate significance levels.

NHANES is designed to be representative of the civilian, non-institutionalized resident population of the United States. NHANES excludes all persons in supervised care or custody in institutional settings, all active-duty military personnel, active-duty family members living overseas, and any other U.S. citizens residing outside the 50 states and the District of Columbia. Non-institutional group quarters (such as college and university residence halls) are included in the survey. See the Analytic Guidelines, 1999-2010 for more details.

NHANES Sampling Procedure

The NHANES sampling procedure consists of four stages, shown and described below.

• Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties. PSUs are selected with probability proportional to a measure of size (PPS). PPS sampling means that sampling units with larger populations are more likely to be selected than those with smaller populations. For NHANES, the "measure of size" (MOS) used for PPS is a weighted average of population counts, where the weights are calculated to give relatively higher probability of selection to PSUs with higher proportions of individuals within the demographic subgroups chosen for oversampling.

Some PSU ("certainty PSUs") have an MOS large enough that they are selected into the sample with probability = 1. The remaining non-certainty PSUs are partitioned into mutually-exclusive groups, and PSUs are selected from each stratum. Stratification is used to increase the precision of survey estimates for subpopulations important to the survey's objectives.
• Stage 2: The sampled PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.
• Stage 3: Dwelling units or households within each segment are listed, and a sample is randomly drawn. In geographic areas where the population has a greater proportion of a particular age, ethnic, or income group selected for oversampling, the probability of selection for those groups is greater than in other areas.
• Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, about 2 sample persons are selected per eligible household.

Although this general description of the four-stage sampling procedure applies to all cycles of Continuous NHANES, some details of the sample design have changed over time. Detailed technical information can be found in the Sample Design documents, available on the NHANES Survey Methods and Analytic Guidelines page.

The sample is designed so that each annual sample is nationally representative. However, statistical estimates from single-year data are relatively unstable (have large variance estimates) since NHANES can only survey a small number of locations each year due to the time and cost involved in moving the mobile examination centers (MECs) between study locations. Therefore, data are publicly released in two-year cycles. However, data from any one NHANES cycle may not be sufficient for certain analysis, such as those that examine subsamples or outcomes with low prevalence. For this reason, combining cycles into samples of four years or more is recommended whenever possible. See the Analytic Guidelines for more information.

What is a Sample Weight?

A sample weight is assigned to each sample person. It is a measure of the number of people in the population represented by that sample person in NHANES, reflecting the unequal probability of selection, an adjustment for sample person non-response, and adjustment to account for differences between the final sample and the total population based on independent population control totals. When unequal selection probability is applied, as in the Continuous NHANES samples, the sample weights are used to produce an unbiased national estimate. More information about sample weights and how they are created can be found in the Weighting module of the tutorial and in the Estimation and Weighting Procedures documentation.

Oversampling

NHANES is designed to sample larger numbers of certain subgroups of particular public health interest. Oversampling is done to increase the number of individuals in the sample in particular subgroups and therefore increase the reliability and precision of estimates of health status indicators for these population subgroups. Sample weights allow estimates from these subgroups to be combined to obtain national estimates that reflect the true relative proportions of these groups in the U.S. population as a whole.

The oversampled subgroups in the 2015-2016 survey cycle were as follows:

• Hispanic persons;
• Non-Hispanic black persons;
• Non-Hispanic Asian persons;
• Non-Hispanic white and other † persons at or below 185 percent of the Department of Health and Human Services (HHS) poverty guidelines; and
• Non-Hispanic white and other † persons aged 80 years and older.

† Other: Non-Hispanic persons who reported races other than black, Asian, or white.

Different subgroups have been oversampled in other survey years. For example, survey cycles prior to 2007-2008 oversampled Mexican American persons rather than all Hispanic persons. The oversample of non-Hispanic Asian persons began in the 2011-2012 survey cycle. In future surveys, different subgroups may be oversampled depending on public health trends.

A brief description of the oversampled subgroups can be found on the overview page for each survey cycle (e.g. NHANES 2011-2012 Overview.) For more details, consult the Analytic Guidelines and the Sample Design documents on the NHANES Survey Methods and Analytic Guidelines page. In particular, see the National Health and Nutrition Examination Survey: Sample Design, 2011–2014 report, "Table A. Selected sample design parameters: Health and Nutrition Examination Surveys, 1971–2014" for a description of the oversampling domains in the surveys back to the NHANES I (1971–1974.)

## WARNING

For your own analyses, it is critical to carefully review the documentation for each survey cycle to determine which subgroups were oversampled. Sample sizes may be too small to provide reliable nationally-representative estimates for subgroups that were not oversampled in a particular survey cycle.

For example, NCHS recommends that researchers not calculate estimates for all Hispanic persons for survey periods prior to 2007 or for Hispanic subgroups other than Mexican American in any survey cycle.

Strata, PSUs, and Masked Variance Units

NHANES visited 12 PSUs in 1999 and 15 PSUs in each year since 2000. In stage one of our sampling procedure, these PSUs were selected from strata defined by geography (e.g. census region), metropolitan statistical area status, and various population demographics. Two PSUs were selected from most strata. Together, these strata and the PSUs represent the variance units (sampling units used to estimate sampling error).

Statistical software procedures used to analyze complex survey data need information about the first stage sampling procedure (i.e. the strata and PSU) in order to properly estimate the variance.

However, to protect the confidentiality of information provided by survey participants and to reduce disclosure risks, the "true" design strata and PSUs are not released on the public-use data files. Instead, masked variance units (MVUs) are constructed. MVUs are equivalent to Pseudo-PSUs used to estimate sampling errors in past NHANES. The MVUs on the data file are not the "true" design PSUs. They are a collection of secondary sampling units aggregated into groups for the purpose of variance estimation. They produce variance estimates that closely approximate the variances that would have been estimated using the "true" design variables. These MVUs have been created for each two-year cycle of NHANES and have been created in a way that allows them to be used for any combination of data cycles without recoding by the user. These MVUs are used to define the strata and PSU variables on the public release files. The variable name for the masked variance unit pseudo-stratum is sdmvstra and the variable name for the masked variance unit pseudo-PSU is sdmvpsu.