Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 2: Sample Design

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

NHANES uses a complex, multistage, probability sampling design. Researchers need to take this into account in their analyses by appropriately specifying the sampling design parameters. This can be done using any statistical software that can analyze complex survey designs. Specifying sampling design parameters using SUDAAN and SAS Survey procedures is presented in this module.

NHANES data are not obtained using a simple random sample. Rather, a complex, multistage, probability sampling design is used to select participants representative of the civilian, non-institutionalized US population. Oversampling of certain population subgroups is done to increase the reliability and precision of health status indicator estimates for these groups.

IMPORTANT NOTE

For NHANES datasets, the use of sampling weights and sample design variables is recommended for all analyses because the sample design is a clustered design and incorporates differential probabilities of selection. If you fail to account for the sampling parameters, you may obtain biased estimates and overstate significance levels.

NHANES Survey Design

NHANES data are NOT obtained using a simple random sample. Rather, a complex, multistage, probability sampling design is used to select participants representative of the civilian, non-institutionalized US population. The sample does not include persons residing in nursing homes, members of the armed forces, institutionalized persons, or U.S. nationals living abroad.

NHANES Sampling Procedure

The NHANES sampling procedure consists of 4 stages, shown and described below.

Image of the four stages of the NHANES Sampling Procedure: Stage 1 Counties; Stage 2 Segments; Stage 3 Households; Stage 4 Individuals

Four Stages of NHANES Sampling Procedure

  • Stage 1: Primary sampling units (PSUs) are selected. These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS).
  • Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.
  • Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas.
  • Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.

What is a Sample Weight?

A sample weight is assigned to each sample person. It is a measure of the number of people in the population represented by that sample person in NHANES, reflecting the unequal probability of selection, nonresponse adjustment, and adjustment to independent population controls. When unequal selection probability is applied, as in the NHANES 1999-2002 sample, the sample weights are used to produce an unbiased national estimate. More information about sample weights and how they are created can be found in the Weighting module.

Oversampling

NHANES is designed to sample larger numbers of certain subgroups of particular public health interest. Oversampling is done to increase the reliability and precision of estimates of health status indicators for these population subgroups.

Examples of oversampled subgroups in the 1999-2004 surveys include:

  • African Americans
  • Mexican Americans
  • Low income White Americans (beginning in 2000)
  • Adolescents aged 12-19 years
  • Persons age 60+ years

Different subgroups have been oversampled in other survey years. For example, during the late 1960s and early 1970s, there was concern that people of very low income and women of childbearing age were at greater risk of malnutrition than the general population. Therefore, during the first National Health and Nutrition Examination Survey (NHANES I), conducted in 1971-74, these subgroups were oversampled. In future surveys, different subgroups may be oversampled depending on public health trends.

WARNING

For your own analyses, it is critical to carefully review the documentation for each survey cycle to determine which subgroups were oversampled.

Strata and Masked Variance Units

The counties in PSUs from two panels of the 1995 National Health Interview Survey (NHIS) were used as the sampling frame for NHANES 1999-2001. The PSU samples for NHANES 2002-2006 and NHANES 2007-2010, were selected from a frame of all U.S. counties, using the 2000 census data and associated estimates and projections.

NHANES visited 12 PSUs in 1999 and 15 PSUs in each year from 2000 through 2006. For NHANES 2007-2010, NHANES will again visit 15 PSUs per year. For NHANES 1999-2010, each single year and any combination of consecutive years comprise a nationally representative sample of the U.S. population. However, in order to obtain stable estimates, two years of data are necessary for sufficient sample sizes, hence the data are released in two year cycles.

PSUs are selected from strata defined by geography and proportions of minority populations. Most strata contain two PSUs. Together, these strata and the PSUs represent the variance units (sampling units used to estimate sampling error).

To protect the confidentiality of data obtained from sample persons, masked variance units are constructed. Masked Variance Units (MVUs) are equivalent to Pseudo-PSUs used to estimate sampling errors in past NHANES. The MVUs on the data file are not the "true" design PSUs. They are a collection of secondary sampling units aggregated into groups for the purpose of variance estimation. They produce variance estimates that closely approximate the variances that would have been estimated using the "true" design variables. These MVUs have been created for each two-year cycle of NHANES and have been created in a way that allows them to be used for any combination of data cycles without recoding by the user. These MVUs are used to define the strata and PSU variables on the public release files. The variable name for the stratum is sdmvstra and the variable name for the PSU is sdmvpsu.

Please refer to the Analytic Guidelines for more information on sampling and masked variance units.

Task 1: Specify Sampling Parameters in NHANES Using SUDAAN or SAS Survey Procedures

SUDAAN and SAS Survey are statistical software packages that can be used to analyze complex survey data such as NHANES.

Specifying Sampling Parameters in NHANES Using SUDAAN, SAS Survey Procedures, and Stata

Accounting for the complex sampling design of NHANES is critical when calculating statistical estimates and estimating standard errors of means, geometric means, percentages and other statistics. Replication and linearization are two statistical methods that can be used to properly address these complex design issues. SAS Survey, SUDAAN, and Stata use linearization for calculating standard errors for a variety of statistics, such as means, geometric means and percentages.

SUDAAN

Currently, SUDAAN offers six options for designating survey design (see SUDAAN manual for more details about the use and implications of all design options). SUDAAN assumes a with replacement (WR) design if the design parameter is omitted.

In the next task, you will be using the with replacement (WR) design for analyzing NHANES data.

In order to implement the WR sampling option in SUDAAN, design variables specifying the first stage of the cluster design and the sample weight are needed.

For more detailed information and sample code, see "Task 2a: How to Use SUDAAN Code to Specify Sampling Parameters in NHANES."

SAS Survey

In SAS, a group of procedures, known as the Survey procedures, produce estimates from complex sample survey data. These procedures can also produce variance estimates through linearization (see variance estimation module) and confidence limits on many estimates.In SAS 9.1 Survey Procedures, Taylor Series Linearization is the only variance estimation method available. In SAS 9.2 Survey Procedures, Jackknife and Balanced Repeated Replication (BRR) variance estiamation methods are also available. In the SAS Survey procedure, the sample design is not directly specified in the proc statement, as in SUDAAN, but rather, strata and PSU variables are specified in separate statements. Similarly, SAS Survey procedure also specifies the weight statement. For more detailed information and sample code, see "Task 2b: How to Use SAS Survey Code to Specify Sampling Parameters in NHANES."

Stata

Taylor Series Linearization, Jackknife, Bootstrap and Balanced Repeated Replication (BRR) variance estimation methods are available in Stata.

Sample Weights

A sample weight is assigned to each sample person. It is a measure of the number of people in the population represented by that sample person in NHANES, reflecting the unequal probability of selection, nonresponse adjustment, and adjustment to independent population controls. When unequal selection probability is applied, as in the NHANES 1999-2002 sample, the sample weights are used to produce an unbiased national estimate. More information about sample weights and how they are created can be found in the Weighting module.

Variance Units

The unmasked first stage sampling units are not included in the data release files. Instead, masked variance units are released. The sample design variables used in SUDAAN and SAS Survey procedures are masked variance units. Using these masked variance units yields variance estimates that closely approximate those obtained using the unmasked variance units. See the Strata and Masked Variances Units section in "NHANES Survey Design."

Task 2a: How to Use SUDAAN Code to Specify Sampling Parameters in NHANES

Once data are sorted in SAS, SUDAAN can be used to specify the sampling design parameters. In this example, the SUDAAN procedure, proc descript, is used and the name of the dataset is BP_analysis_Data. Proc descript is being used as a generic example, but these statements apply to all SUDAAN procedures.

Step 1: Sorting in SAS

To carry out the appropriate SUDAAN design option for NHANES data, the data from BP_analysis_Data must be sorted by strata first and then PSU (unless the data have already been sorted by PSU within strata). The SAS proc sort statement must precede the SUDAAN statements.

WARNING

Data must always be sorted in SAS before doing analyses in SUDAAN.

Step 2: Use proc statement in SUDAAN

This statement immediately follows the sort statement. In this example, the proc descript statement is used. In addition, the data option specifies BP_analysis_Data as the SAS dataset being used and the design option specifies with replacement (WR) as the design.

Step 3: Use nest statement in SUDAAN

The nest statement lists the variables that identify the strata and the PSU. The nest statement is required for the appropriate design option for NHANES to be used.

As in the sort statement, the nest statement lists the stratum variable (i.e., sdmvstra) first, followed by the PSU variable (i.e., sdmvpsu).

Step 4: Use weight statement in SUDAAN

In NHANES, a sample weight is assigned to each sample participant. The sample weight is a measure of the number of individuals in the target population that the sampled individual represents. Sample weights are needed to obtain unbiased estimates of population parameters when the sample participants are chosen with unequal probabilities. (See module on weighting for more details).

In this example, the MEC weight for 4 years of data is used because the dataset combines two 2-year cycles of data. (See module on Weighting for how to choose, combine and use other weight variables.)

Summary: Sample SUDAAN code for sorting and specifying sampling design parameters

The following table shows how to combine the statements described above to properly sort the data, and specify the sample design, design parameters, and sample weights. The procedure proc descript is being used as an example, but the design, nest and weight statements can be used in the same manner for all SUDAAN procedures. Additionally, other procedure options can be added to these statements to customize the analysis and output. Consult the SUDAAN manual for specifications on the options for each SUDAAN procedure.

SUDAAN descript Procedure
Statements Explanation
proc sort data =BP_analysis_Data;
by sdmvstra sdmvpsu;
run ;

Use the SAS procedure, proc sort, to sort the data by the design parameters, strata (sdmvstra) and primary sampling units (sdmvpsu), before running the procedure in SUDAAN.

proc descript data= BP_analysis_Data design= WR;

Use the proc statement to specify the SUDAAN procedure being used (proc descript here), the data set (BP_analysis_Data), and the sample design (with replacement — WR).

nest sdmvstra sdmvpsu;

Use the nest statement to specify the strata (sdmvstra) and PSU (sdmvpsu) variables to account for the sample design.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling, non-response and adjustment to population control totals. In this example, the MEC weight for 4 years of data is being used (wtmec4yr).

Reference

RTI (2004). SUDAAN User's Manual, Release 9.0 Research Triangle Park, NC: Research Triangle Institute

Task 2b: How to Use SAS Survey Code to Specify Sampling Parameters in NHANES

The code needed to specify sampling design parameters using SAS Survey procedures is described below. In this example, the SAS Survey procedure, proc surveymeans, is used and the name of the dataset is BP_analysis_Data. Proc surveymeans is being used as a generic example, but the strata, cluster and weight statements apply to all SAS Survey procedures.

Step 1: Use data statement

When using SAS Survey procedures, the input dataset must be identified. However, the dataset does not have to be presorted by the sample design variables as it does in SUDAAN. Rather, the design variables—strata and PSU—are specified in subsequent steps.

Step 2: Use strata statement

The strata statement names the variables that form the strata. For the Continuous NHANES the variable that identifies the sample strata is named sdmvstra.

Step 3: Use cluster statement

The cluster statement names the variables that identify the clusters in a clustered sample design such as NHANES. Since there is also a strata statement needed in NHANES, clusters are nested within the strata by SAS Survey procedures.

In NHANES the variable that represents the sample clusters is named sdmvpsu (masked primary sampling units or PSUs).

Step 4: Use weight statement

In NHANES, a sample weight is assigned to each sample participant. The sample weight is a measure of the number of individuals in the target population that the sampled individual represents. Sample weights are needed to obtain unbiased estimates of population parameters when the sample participants are chosen with unequal probabilities. (See module on weighting for more details).

The weight statement in SAS Survey procedures is required for all NHANES analyses. It identifies the sample weight. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

Summary: Sample SAS Survey Procedure specifying sampling design parameters

The following table shows how to combine the statements described above to properly specify the sample design parameters and sample weights using SAS Survey procedures. The procedure, proc surveymeans, is used as an example, but the strata, cluster and weight statements can be used in the same manner for all SAS Survey procedures. The steps in this task identify the most basic statements used in SAS Survey procedures to account for the complex sample design of NHANES. Additional procedure options can be added to these statements to customize the variance estimates, statistics and the output from your procedure to suit individual analytic needs. Please consult the SAS/STAT manual for specifications on the options for each SAS Survey procedure.

SAS surveymeans Procedure
Statements Explanation
proc surveymeans data= BP_analysis_Data;

Use the SAS Survey procedure, proc surveymeans, to calculate means and standard errors, and specify the data set (BP_analysis_Data).

stratum sdmvstra;

Use the stratum statement to specify the strata (sdmvstra) — this accounts for the design effects of stratification.

cluster sdmvpsu;

Use the cluster statement to specify primary sampling unit (sdmvpsu) — this accounts for the design effects of clustering.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling, survey nonresponse and adjustments to population control totals. In this example, the MEC weight for 4 years of data is used (wtmec4yr).

Reference

SAS Institute Inc., SAS/STAT User's Guide, Version 9.1; see: Survey Means Procedure

TOP