# Module 6: Sample Code

This module provides sample SAS, SUDAAN, Stata, and R code (see Matrix) for generating an analytic dataset, descriptive statistics, hypothesis testing (including confidence intervals and regression analysis), age standardization, and population counts for select NCHS publications using NHANES data.

Below is a listing of sample code demonstrating some statistical analysis topics in select NCHS publications using NHANES. The sample code is accessible by clicking the colored boxes in the matrix. Use the Legend in the blue panel below to know which code is available.

Matrix: Sample code for select statistical analysis topics in 7 NCHS publications
Legend:
SAS Code
SUDAAN Code
STATA Code
R Code
NCHS Publications
Using correct sample weights
Descriptive Statistics
Frequency distribution and normality
Percentiles
Geometric means
Arithmetic means
Proportions (categorical variable)
Hypothesis Testing and Confidence Intervals
T-test/Pairwise contrasts
Chi-square test
Analysis of trends
Korn and Graubard confidence intervals for proportions
Age Standardization and Population Estimates
Population counts
Regression
Linear / logistic regression
Dietary Data
Ratio of means
Means ratios

Sample Code Matrix

These code examples use the standard public-release NHANES datasets with statements to download the required data files from the NHANES website and to import them into the software package for use. The Continuous NHANES data (since 1999-2000) is released in SAS transport file (.xpt) format. SAS transport files can be imported into many software packages. See Module 1, Datasets and Documentation, for a detailed description of how the data files are organized.

Unless otherwise noted, the example programs use procedure syntax available in the following software versions:

• SAS-callable SUDAAN version 11
• SAS version 9.4 with SAS/STAT 14.1
• R version 3.5.2 with "survey" package version 3.35.1
• STATA/SE version 15

Review the documentation for the software version for different capabilities or syntax changes.

## IMPORTANT NOTE

Some lines of the code may need to be edited in order to run. For example, the path to a directory where a permanent dataset is saved may need to be edited.

Orthogonal polynomial contrasts and trends

See the National Center for Health Statistics Guidelines for Analysis of Trends for more information.

Confidence Intervals

Korn and Graubard confidence intervals, along with confidence interval widths, sample size, and degrees of freedom are standards for determining the reliability of estimated proportions. For more information, see Module 5 of the NHANES Tutorials. Other information on Korn and Graubard confidence intervals for proportions can be found in the National Center for Health Statistics Data Presentation Standards for Proportions.

Code included for:

1. Age-standardization
2. Population counts or estimates

Specific notes related to NHANES data analysis:

Age standardization, sometimes referred to as age adjustment, is a method that applies observed age-specific rates to a standard age distribution. The method adjusts for the confounding effect of age. Standard age proportions are calculated by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions (P/T) should sum to 1.

There are two steps:

1. Choose a standard population. The example code provided uses the 2000 Census data.
2. The age-specific prevalence from the study population is multiplied by the proportion of people in that age group in the standard population, and results summed up to get the age-adjusted estimates.

Example:

Standard Proportions for 20-year Age Groups Based on the 2000 U.S. Census Standard Population
Age Group Age-Specific Census Population (in thousands) Total Census Population (in thousands) Standard Age Proportions
P T P/T
20-39 77,670 195,850 .396579
40-59 72,816 195,850 .371795
60+ 45,364 195,850 .231626
Total: 195,850 Sum: 1

## IMPORTANT NOTE

See source census data (Table 2 in report). Age groups can be combined to reflect the age range of the population used in the specific analysis.

Population counts

The non-institutionalized population totals are used to calculate the final sample weights for the NHANES survey. However, it is NOT RECOMMENDED to use the sum of the final sample weights for sample persons with the health condition of interest in order to calculate population estimates of, or number of people with, the health condition. This is because, if there are exclusions or missing data for a health condition, summing the weights will underestimate the population estimate. Consequently, the steps below are recommended for calculating the population count or number of people with a given condition from NHANES:

1. Calculate the crude percentage who have the outcome or characteristic overall and by subgroups of interest.
2. Obtain the relevant population totals for the NHANES survey cycle(s) being used.
3. Combine population totals (if desired). Ages, sexes, or race/Hispanic origin subgroups can be combined. It also is possible to combine NHANES survey cycles. For example, to combine two survey cycles (e.g., 2015-2016 and 2017-2018), the midpoint of each cycle is used: ½ (NHANES 2015-2016 population totals) + ½ (NHANES 2017-2018 population totals).
4. In the last step, the prevalence estimates are multiplied with corresponding population totals to estimate the total number of civilian, non-institutionalized U.S. residents affected with the health condition. Percent prevalence estimates as well as lower and upper 95% confidence limits will be multiplied to the corresponding population total for that subgroup. Confidence intervals for the population estimates of those affected with the health condition are estimated using Wald, and Korn and Graubard methods. To calculate age-, sex-, or race/ethnicity- specific population estimates, multiply the prevalence of the health condition in each sub-domain by the population total for the respective sub-domain.

## IMPORTANT NOTE

Age standardization of the prevalence estimates is NOT performed because the population counts should be based on the crude (unadjusted) prevalence in the population.