Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 1: Datasets

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

The NHANES website is the most important data source and analytical resource for all data users. The website contains both historic and current datasets, and covers a wide range of critical topics. It is very important to learn to navigate this website and use these resources effectively. This module describes how Continuous NHANES data are structured and organized.

Finding Datasets

Publicly-Released Datasets

Throughout the years, NHANES datasets and related information have been released in a variety of formats and different media. However, since the late 1990s, all publicly available data and related documentation are released and updated in a centralized location: the NHANES website.

Datasets contain data for persons who participated in the selected survey. The National Health and Nutrition Examination Survey (NHANES) datasets are labeled by cycle year. The website contains the public-use data files for each of National Center for Health Statistic's national surveys starting with the initial National Health Examination Survey (NHES) I dataset up to the most current dataset. Codebooks and documentation are part of each data file.

There are several pages:

  • The Questionniares, Data Sets, and Related Documentation page lists all the survey cycles from the most recent to most historic, and includes the NHANES Analytic Guidelines, and the suggested citation for NHANES to use in publications.
  • The survey cycle page (titled by the survey cycle, e.g., NHANES 2001-2002) contains documentation about the survey, documentation on how to use the data and links to each of the four component pages. It is divided into four sections:
    • Contents at a Glance, which has information about the survey in general terms
    • Contents in Detail, which has detailed descriptions about the different items used in the survey including survey questionnaires, manuals, brochures and consent forms
    • Using the Data, which has Analytic and Reporting Guidelines and documentation on how the data was released, and
    • Data, Documentation, Codebooks, SAS Code, which has links to the five component pages:
      • Demographics
      • Dietary
      • Examination
      • Laboratory
      • Questionnaire
  • The component page (titled by the survey cycle and then the component name, e.g., 2001-2002 Demographics> links to the data files, documentation, and variable list for that component. The component name is listed with the corresponding links to its Documentation file (Docs) and Data file (Data).

IMPORTANT NOTE

This layout only applies to the most Continuous NHANES datasets starting from 1999. Datasets earlier than 1999 will have a different format and layout.

Non-Publicly Released Datasets

High quality data is released according to these guiding principles:

  • as widely as practicable,
  • as soon as possible after data collection, and
  • in as much detail as possible,
  • while maintaining survey participant confidentiality.

As a result, some variables or entire data files are not publicly released due to disclosure concerns, for example, geographic identifiers. These files are only available through the Research Data Center (RDC). You may review the Data Release and Access Policy for more information. For some non-publicly released data, the documentation, including a codebook with frequencies, is available to assist data users preparing proposals to use the Research Data Center.

Finding Survey Background Information

The NHANES survey content is determined after a rigorous evaluation process including consideration of criteria such as public health importance, feasibility of the proposed survey items and burden to survey participants. This information is important as you determine the scope of your analysis and which variables to include. For example, your analysis may require background information, such as:

  • Survey Contents, which shows the years components were collected and when changes to the components occurred,
  • Sample Person Questionnaire, and
  • MEC Components Description, which is the survey protocol for obtaining the physical examination measures in the survey.

Step 1: Find Survey Contents

From the homepage, select the link titled Questionnaires, Data Sets, and Related Documentation, then select the NHANES 2001-2002 link. Now, in the Contents at a Glance section, and click the Survey Contents link. After the PDF file of the Survey Contents opens, scroll to the third page, where the chart of when variables were collected is displayed. By reviewing this chart you can determine if the components you are interested in were collected for all the years available or only some of the years, and design your analysis accordingly.

Step 2: Find Questionnaire Questions

Select the Survey Questionnaires, Examination Components and Laboratory Components link in the Contents in Detail section. Now, select on the Blood Pressure link under the Sample Person Questionnaire heading. This link opens to the blood pressure questions asked of the survey participants and lists possible responses. Be sure to review these questions, the possible answers, and whether or not the question was part of a skip pattern.

Step 3: Find MEC Component Descriptions

Select the Examination Components link under the Examination and Laboratory heading. A PDF file of the MEC Components Description opens. Review this file for descriptions of the components offered in the MEC. Be sure to review the protocol as necessary.

Data Structure

Cycles

The current NHANES, also known as Continuous NHANES, refers to the two-year cycles of data produced since 1999.

Components

Each cycle is divided into five sections labeled by collection method: Demographics, Dietary, Examination, Laboratory, and Questionnaire.

  • Demographics files contain survey design variables such as weights, strata and primary sampling units, as well as, demographic variables.
  • Dietary files contain data collected from participants on their dietary intake, which includes, foods, beverages, and dietary supplements.
  • Examination files contain information collected through physical exams and dental exams.
  • Laboratory files contain results from analyses of blood, urine, hair, air, tuberculosis skin test, and household dust and water specimens.
  • Questionnaire files contain data collected through household and mobile examination center interviews.

Component Data Files

Within each section are many components — groups of related variables packaged in a data file. This division allows for efficiency in posting the data files to the website as soon as each component is completed and reviewed and faster download times.

In the picture below, each of the Continuous NHANES survey cycles consists of the five components listed below. Each of the components may have one or more data files, and examples of these data files are listed below.

Components and Data Files

  • Demographics - Demographics including survey weights)
  • Examination - Audiometry, Blood Pressure, Body Measures, Muscular Strength, Oral Health, Vision Exam, etc...
  • Laboratory - Urine Collection, Hepatitis, HIV, Heavy Metals, Plasma Glucose, Total Cholesterol, Triglycerides, etc...
  • Questionnaire - Alcohol Use, Balance, Blood Pressure, Diabetes, Drug Use, Social Support, Vision, Weight History, etc...
  • Dietary - Dietary Interview, Supplement Use, etc...

Diagram of the Continuous NHANES Components adn Data Files

Decide Which NHANES Variables to Include

Your analysis will require a subset of the variables available in NHANES. To decide which variables are needed in your analysis, you need to review the survey documentation. The survey documentation for each component is slightly different. For instance:

  • the questionnaire component contains the questions on the Sample Person Questionnaire and the codes for all possible responses,
  • the examination component contains information about measurements and how each was collected, and
  • the laboratory component contains information about how samples were taken and how each was processed in the lab.

These files, as well as the survey documentation for other components, are available on the Survey Questionnaires, Examination Components and Laboratory Components page for each two-year cycle of the survey. Read the documentation for each "hit" in your search results carefully though as not every result returned will be relevant to your analysis.

For example, assume you are preparing for an analysis using blood pressure variables, and search the examination survey documentation file for "blood pressure." Some blood pressure questions and measurements are used for safety exclusions in the Cardiovascular Fitness portion of the examination survey. However, the main collection of standardized blood pressure measurements is conducted on all eligible participants aged 8 years and older in the Physician's Exam section of the survey. These are the blood pressure examination variables you would want to use in your analysis.

You must read the documentation and identify the correct variables for your analysis.

Identifying Variable Names and File Locations

Variables are stored in different data files. Data files are organized by their collection method, which can fall under one of five components: Demographic, Dietary, Examination, Laboratory, and Questionnaire. Usually, analysis will require more than one component. For instance age and gender are in the Demographics component, while blood pressure measurements are in the Examination component, cholesterol variables are in the Laboratory component, and questions about previous diagnoses or taking medications for hypertension are in the Questionnaire component. All these variables would be required in a complete analysis of cardiovascular disease.

Below is a list of the file types and a summary of their contents.

  • Demographics files: survey design (e.g. weights, design strata) and demographic variables
  • Dietary files: data collected from participants on their dietary intake, which includes foods, beverages and dietary supplements.
  • Examination files: information collected through physical exams, dental exams, and dietary interview components (Note: not every survey participant agreed to a physical examination)
  • Laboratory files: results from analyses of blood, urine, hair, air, tuberculosis skin test, and household dust and water specimens
  • Questionnaire files: data collected through household interview and mobile examination center (MEC) interview

The component's variable list contains the list of all the publicly released variables and their file locations. The variables lists are available as web pages and list the

  • filename the variable is found in,
  • variable name, and
  • a short description of the variable.

Use the find feature to speed up your search for variables relevant to your analysis. Note the file names that the variables are stored in, you will use this to identify the data file and documentation to download.

All variables listed in the component variable lists have been publicly released and are available for download in the associated data file. If you wish to use a variable that is not listed in a component variable list, you will need to use the Research Data Center. You can review the Research Data Center for more information about how to obtain access to non-publicly released variables.

Consulting Documentation

There are three parts of the data documentation for each data file that you will consult to finish gathering background information on your variables.

  • codebook,
  • data file documentation, and
  • frequency tables.

Codebook

The codebook portion lists all the variables in the data file. Use it to determine what the values associated with a variable mean.

The Codebook and Frequencies are contained in the Data Documentation file.

Data File Documentation

Use the data file documentation to determine if the collection or measurement is appropriate for your analysis. The data file documentation outlines

  • brief description of the component,
  • data processing and file preparation, and
  • analytic recommendations and specific notes on using the data file.

Frequency Counts

The frequency files for each data file contains the frequency count for each item in the data file and can be used to verify the sample size for an particular data item. The frequency counts are found with the codebook in the documentation file.

Download Data Files

NHANES data are saved in a SAS transport (.XPT) file. The SAS transport format allows extraction of the data file on Windows, UNIX, or Macintosh based systems. However, this tutorial focuses on Windows-based systems.

  • If you have SAS installed, your browser will recognize it as a SAS transport file and prompt you to save the file. Navigate to the TEMP folder you created in Task 1 and save the file there.
  • If you do not have SAS installed, clicking the link will result in gibberish appearing in your window as the browser tries to read the file. In this case, you will need to right-click the file and save it to your TEMP folder from Task 1.

After downloading the SAS transport files, you will need to extract or import them as datasets. Transport files are not usable without completing this task. See code in the links below:

SAS:

https://wwwn.cdc.gov/nchs/data/tutorials/DownloadData_Task3.sas

STATA:

https://wwwn.cdc.gov/nchs/data/tutorials/preparing_download_import.do

Append NHANES Data

An NHANES dataset for analysis will typically include data from 2 or more years and variables from more than one component. You will append to combine the years of data and merge to include variables from different components.

The process of combining years is called appending. Check the contents of the data files before appending the data because variable names may be different from cycle to cycle and recoded or derived variables may be added in different cycles. If the names or labels of the variables of interest have changed, you will have to find out whether the wording, definition, and/or response categories have been modified, and then recode the variables to make their names and response categories consistent before appending. Also note that NHANES adds or deletes survey items from time to time.

When appending NHANES data you should always include the SEQN number. SEQN stands for sequence number and is a unique identifier for each observation (participant) in NHANES. Every time you extract variables from an NHANES data file, you should include the SEQN variable in your selection. Failing to do so will lead to problems if you want to sort or merge your data files at a later time.

Merge NHANES Data

For each data cycle, data files are organized by their collection method, which can fall under one of four components: Demographic, Examination, Laboratory, and Questionnaire. Putting the components of these data files together in a dataset is called merging.

The first step in merging data is to sort each of the data files by a unique identifier. In NHANES data, this unique identifier is known as the sequence number (SEQN). NHANES uses SEQN to identify each sample person, so SEQN is the variable you must use to merge data files. To ensure that all observations are ordered in the same way in each data file, you need to sort each data file by the SEQN variable.

Missing Data in NHANES

Missing values may distort your analysis results. You must evaluate the extent of missing data in your dataset to determine whether the data are useable without additional re-weighting for item non-response. As a general rule, if 10% or less of your data for a variable are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment. However, if more than 10% of the data for a variable are missing, you may need to determine whether the missing values are distributed equally across socio-demographic characteristics, and decide whether further imputation of missing values or use of adjusted weights are necessary. (Please see Analytic Guidelines for more information.)

NHANES assigns missing values in the following way:

NHANES codes Description Action
. (period) missing numeric value None
(blank) missing character value None
7 or 77 or 777 "refused" response Code as missing (period or blank)
9 or 99 or 999 "don't know" response Code as missing (period or blank)

Skip Patterns in NHANES Data

The significance of a skip pattern depends on the question leading to the skip pattern, the questions within that skip pattern, and the variables you intend to analyze. If you fail to check for skip patterns, you may obtain only a portion of the population, instead of the entire study population.

For example, in the blood pressure questionnaire, respondents were not asked any other questions relevant to high blood pressure (hence, the questions were skipped over) if they said "No" to question BPQ.020: "Have you ever been told by a doctor or other health professional that you had hypertension?" Those who answered "Yes" to this question were asked additional questions related to blood pressure, such as BPQ.030, "Were you told on 2 or more different visits that you had hypertension, also called high blood pressure?"

If you would like to estimate the prevalence of diagnosed hypertension (defined as at least two occurrences of a person ever being told by a doctor that he or she had hypertension) among US adults, you must recode BPQ.030 to include those who answered "No" in BPQ.020. Thus, until you recode BPQ.030 (or define a new variable based on it) to include those who answered "No" to BPQ.020, these people will be left out of the denominator value. If you fail to do this step, you will obtain the proportion of diagnosed hypertension only among a subpopulation of people who have ever been told by a doctor that they had hypertension, instead of the entire study population.

Check the codebook to determine if a skip pattern affects the variables in your analysis.

Outliers in NHANES Data

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables with a univariate analysis. If the distribution is highly skewed, you can do a data transformation to make the distribution of the data closer to normal (the underlying assumption in most statistical analyses is that the distribution of the data is normal). The common types of transformation are LOGIT, LOG, LOG10, SQRT, INVERSE, or ARCSIN.

After checking the distribution and normality of the data, plot the survey weight against the variable to determine which of the extreme values identified in the univariate analysis are outliers. You must also determine if the outliers represent valid values and, if so, also carry extremely large survey weights. Outliers with extremely large weights could have an influential impact on your estimates. Please consult the Analytical Guidelines for more information on this topic.

Saving NHANES Data

To save your original dataset or dataset in memory as a permanent file in your directory or library see code in the link below (scroll to "Save Datasets"):

https://wwwn.cdc.gov/nchs/nhanes/tutorials/samplecode.aspx

TOP