National Health and Nutrition Examination Survey

Sample Code

This module provides sample SAS, SUDAAN, Stata, and R code (see Matrix) for generating an analytic dataset, descriptive statistics, hypothesis testing (including confidence intervals and regression analysis), age standardization, and population counts for select NCHS publications using NHANES data.

Matrix: Sample code for select statistical analysis topics in 7 NCHS publications

Below is a listing of sample code demonstrating some statistical analysis topics in select NCHS publications using NHANES publicly available data. The sample code is accessible by clicking the colored boxes in the matrix. Use the Legend in the blue panel below to know which code is available.

Matrix: Sample code for select statistical analysis topics in 7 NCHS publications
Legend: SAS Code SUDAAN Code STATA Code R Code
NCHS Publications	DB 52 - Blood lead and mercury levels	DB 270 - Sugar sweetened beverages	DB 303 - Depression prevalence	DB 511 - Hypertension prevalence	DB 369 - Prescription medication use	DB 405 - Osteoporosis prevalence	VHS S3 N46 - Anthropometric reference data
Downloading and importing data files
Using correct sample weights
Descriptive Statistics
Frequency distribution and normality
Percentiles
Geometric means
Arithmetic means
Proportions (categorical variable)
Hypothesis Testing and Confidence Intervals
T-test/Pairwise contrasts
Chi-square test
Analysis of trends
Korn and Graubard confidence intervals for proportions (Download required macro: KG_macro.sas)
Age Standardization and Population Estimates
Age-adjusted prevalence rates and means
Population counts
Regression
Linear / logistic regression
Dietary Data
Ratio of means
Means ratios

Sample Code Matrix

These code examples use the standard public-release NHANES datasets with statements to download the required data files from the NHANES website and to import them into the software package for use. The Continuous NHANES data (since 1999-2000) is released in SAS transport file (.xpt) format. SAS transport files can be imported into many software packages. See the Datasets and Documentation module, for a detailed description of how the data files are organized.

Unless otherwise noted, the example programs use procedure syntax available in the following software versions:

SAS-callable SUDAAN version 11
SAS version 9.4 with SAS/STAT 14.1
R version 3.5.2 with "survey" package version 3.35.1
STATA/SE version 15

Review the documentation for the software version for different capabilities or syntax changes.

IMPORTANT NOTE

Some lines of the code may need to be edited in order to run. For example, the path to a directory where a permanent dataset is saved may need to be edited.

Special Considerations for Analyzing NHANES Data

Orthogonal polynomial contrasts and trends

See the National Center for Health Statistics Guidelines for Analysis of Trends for more information.

Confidence Intervals

Korn and Graubard confidence intervals, along with confidence interval widths, sample size, and degrees of freedom are standards for determining the reliability of estimated proportions. For more information, see the Reliability of Estimates module. Other information on Korn and Graubard confidence intervals for proportions can be found in the National Center for Health Statistics Data Presentation Standards for Proportions.

Code included for:

Age-standardization
Population counts or estimates

Specific notes related to NHANES data analysis:

Age-Adjusted Prevalence Rates and Means

Age standardization, sometimes referred to as age adjustment, is a method that applies observed age-specific rates to a standard age distribution. The method adjusts for the confounding effect of age. Standard age proportions are calculated by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions (P/T) should sum to 1.

There are two steps:

Choose a standard population. The example code provided uses the 2000 Census data.
The age-specific prevalence from the study population is multiplied by the proportion of people in that age group in the standard population, and results summed up to get the age-adjusted estimates.

Example:

Standard Proportions for 20-year Age Groups Based on the 2000 U.S. Census Standard Population
Age Group	Age-Specific Census Population (in thousands)	Total Census Population (in thousands)	Standard Age Proportions
	P	T	P/T
20-39	77,670	195,850	.396579
40-59	72,816	195,850	.371795
60+	45,364	195,850	.231626
Total:	195,850	Sum:	1

$$\frac{\text{77,670 thousand people age 20-39 years}} {\text{195,850 thousand population ages 20+ years}} = 0.396579$$

Equation for the standard age proportion for people 20-39 years old

IMPORTANT NOTE

See source census data (Table 2 in report). Age groups can be combined to reflect the age range of the population used in the specific analysis.

Population counts

The non-institutionalized population totals are used to calculate the final sample weights for the NHANES survey. However, it is NOT RECOMMENDED to use the sum of the final sample weights for sampled persons with the health condition of interest to calculate population estimates of, or number of people with, the health condition. This is because, if there are exclusions or missing data for a health condition, summing the weights will underestimate the population estimate. Consequently, the steps below are recommended for calculating the population count or number of people with a given condition from NHANES:

Calculate the crude percentage who have the outcome or characteristic overall and by subgroups of interest.
Obtain the relevant population totals for the NHANES survey cycle(s) being used.
Combine population totals (if desired). Ages, sexes, or race/Hispanic origin subgroups can be combined. It also is possible to combine NHANES survey cycles. For example, to combine two survey cycles (e.g., 2015-2016 and 2017-2018), the midpoint of each cycle is used: ½ (NHANES 2015-2016 population totals) + ½ (NHANES 2017-2018 population totals).
In the last step, the prevalence estimates are multiplied with corresponding population totals to estimate the total number of civilian, non-institutionalized U.S. residents affected with the health condition. Percent prevalence estimates as well as lower and upper 95% confidence limits will be multiplied to the corresponding population total for that subgroup. Confidence intervals for the population estimates of those affected with the health condition are estimated using Wald, and Korn and Graubard methods. To calculate age-, sex-, or race/Hispanic origin- specific population estimates, multiply the prevalence of the health condition in each sub-domain by the population total for the respective sub-domain.

IMPORTANT NOTE

Age standardization of the prevalence estimates is NOT performed because the population counts should be based on the crude (unadjusted) prevalence in the population.

Content source: CDC/National Center for Health Statistics