# Module 1: Datasets and Documentation

The NHANES website is the most important data source and analytical resource for all data users. The website contains both historic and current datasets, and covers a wide range of critical topics. This module describes how Continuous NHANES data and documentation are structured and organized.

Publicly-Released Datasets

Throughout the years, NHANES datasets and related information have been released in a variety of formats and different media. However, since the late 1990s, all publicly available data and related documentation are released and updated in a centralized location: the NHANES website.

Datasets contain data for persons who participated in the selected survey. The National Health and Nutrition Examination Survey (NHANES) datasets are labeled by cycle year. The website contains the public-use data files for each of National Center for Health Statistic's national surveys starting with the initial National Health Examination Survey (NHES) I dataset up to the most current dataset. Codebooks and documentation are part of each data file.

There are several pages:

• The Questionnaires, Datasets, and Related Documentation page lists all the survey cycles from the most recent to most historic, and includes an additional link to the Survey Methods and Analytic Guidelines and a link for a variable search tool for the current NHANES (1999 and onward). The Survey Methods and Analytic Guidelines link directs the user to the information on 1-Plans and Operations, 2-Sample Design, 3-Estimation and Weighting, 4-Analytic Guidelines,5-Response rates and Population totals, 6-Other Resources including a suggested citation for NHANES for use in publications.
• Each survey cycle page (titled by the survey cycle, e.g., NHANES 2015-2016) contains documentation about the survey, documentation on how to use the data and links to each of the component pages. It is divided into four sections:
• Data, Documentation, Codebooks, SAS Code, which has links to the primary component pages:
• Demographics Data
• Dietary Data
• Examination Data
• Laboratory Data
• Questionnaire Data, and
• Limited Access Data (documentation and codebooks only)
• Using the Data, which has an overview of data collection methods, another link to the survey methods and analytic guidelines, and documentation on how the data were released,
• Contents in Detail, which has detailed descriptions about the different items used in the survey including questionnaire instruments, manuals, brochures and consent forms, and
• Contents at a Glance, which has information about the survey in general terms
• The primary component page (titled by the survey cycle and then the component name, e.g., 2015-2016 Examination Data) links to the individual component data files (data), documentation (docs), and publication dates for each individual component.

## IMPORTANT NOTE

This layout only applies to most Continuous NHANES datasets starting from 1999. Datasets earlier than 1999 will have a different format and layout.

Limited Access Released Datasets

High quality data is released according to these guiding principles:

• as widely as practicable,
• as soon as possible after data collection, and
• in as much detail as possible,
• while maintaining survey participant confidentiality.

As a result, some variables or entire data files are not publicly released due to disclosure concerns, for example, geographic identifiers and some sensitive topics. These files are only available through the Research Data Center (RDC). You may review the Data Release and Access Policy for more information. The Limited Access Data component page for each survey cycle contains documentation, including a codebook with frequencies, to assist data users preparing proposals to use the Research Data Center.

The NHANES survey content is determined after a rigorous evaluation process including consideration of criteria such as public health importance, feasibility of the proposed survey items and burden to survey participants. This information is important as you determine the scope of your analysis and which variables to include. For example, your analysis may require background information, such as:

• Overviews of the survey design, the target population, and data collection procedures
• Survey Contents, which shows the years components were collected and when changes to the components occurred,
• Questionnaire Instruments, and
• Examination and Lab Procedures, which is the survey protocol for obtaining the physical examination measures in the survey.

Find Survey Overviews

On any survey cycle page (e.g., NHANES 2015-2016), the Using the Data section contains links to overviews containing key background information on the overall survey and the laboratory, questionnaire, and examination components.

Find Survey Contents

On the Survey Methods and Analytic Guidelines page, scroll to the Other Resources for Analysts section and click the Survey Contents link to open the PDF file. A link to the survey contents brochure is also available on each survey cycle page (e.g., NHANES 2015-2016) in the Contents at a Glance section.

This PDF file contains tables depicting the questionnaire, examination, and laboratory components that were conducted in each survey cycle, as well as when components changed from the original sample descriptions. By reviewing this chart you can determine if the components you are interested in were collected for all the years available or only some of the years, and design your analysis accordingly.

Find Questionnaire Instruments

From any survey cycle page (e.g., NHANES 2015-2016), select the Questionnaire Instruments link in the Contents in Detail section. This page contains links to the questionnaire instruments used both in participants' homes and in the Mobile Examination Center trailers.

For example, select the Blood Pressure link under the Sample Person Questionnaire heading. This link opens to the blood pressure questions asked of the survey participants and lists possible responses. As you design your analysis, be sure to review these questions, the possible answers, and whether or not the question was part of a skip pattern.

Find Examination and Laboratory Procedures and Laboratory Methods

From any survey cycle page (e.g., NHANES 2015-2016), select the Examination and Laboratory Procedures link in the Contents in Detail section. This page contains links to the manuals documenting the procedures for various examination components.

From any survey cycle page (e.g., NHANES 2015-2016), select the Laboratory Methods link in the Contents in Detail section. This page contains links to the laboratory methods for each laboratory component.

Be sure to review the protocols as necessary.

Cycles

The current NHANES, also known as Continuous NHANES, refers to the two-year cycles of data produced since 1999.

Primary Components

Each cycle is divided into five sections labeled by collection method: Demographics, Dietary, Examination, Laboratory, and Questionnaire.

• Demographics files contain survey design variables such as weights, strata and primary sampling units, as well as demographic variables.
• Dietary files contain data collected from participants on their dietary intake, which includes foods, beverages, and dietary supplements.
• Examination files contain information collected through physical exams and dental exams.
• Laboratory files contain results from analyses of blood, urine, hair, air, tuberculosis skin test, and household dust and water specimens.
• Questionnaire files contain data collected through household and mobile examination center interviews.

Individual Component Data Files

Within each section are many individual components — groups of related variables packaged in a data file. This division allows for efficiency in posting the data files to the website as soon as each component is completed and reviewed and allows for faster download times.

The data structure pictured below applies to each of the Continuous NHANES survey cycles and consists of the five primary components listed previously and examples of the individual data files that make up each primary component. Examples of these data files are listed below.

Primary Components and Data Files

• Demographics – One file that includes demographic variables as well as survey weights and other survey design variables
• Examination – Individual files on Audiometry, Blood Pressure, Body Measures, Muscular Strength, Oral Health, Vision Exam, etc.
• Laboratory – Individual files on Urine Collection, Hepatitis A virus, HIV, Heavy Metals, Plasma Glucose, Total Cholesterol, Triglycerides, etc.
• Questionnaire – Individual files on Alcohol Use, Balance, Blood Pressure, Diabetes, Drug Use, Social Support, Vision, Weight History, etc.
• Dietary – Individual files for the Dietary Interview, Supplement Use, etc.

Usually, analysis will require data from more than one component. For instance, age and gender are in the Demographics component, while blood pressure measurements are in the Examination component, cholesterol variables are in the Laboratory component, and questions about previous diagnoses or taking medications for hypertension are in the Questionnaire component. All these variables may be required in a complete analysis of cardiovascular disease.

Your analysis will require a subset of the variables available in NHANES, from one or more survey cycles. To decide which variables are needed in your analysis, you need to identify potential analysis variables and review the survey documentation. There are several ways to identify potential variables for your analysis.

The Survey Content Brochure can help you determine which survey cycles contained questionnaire topic areas or examination and laboratory components relating to your topic of interest and whether the component changed across survey cycles.

To look for specific variables, you can perform a keyword search by following the Search Variables link from the Questionnaires, Datasets, and Related Documentation page. You can search across all survey cycles or restrict your search to a single data release cycle. Read the documentation for each "hit" in your search results carefully, as not every result returned will be relevant to your analysis.

You can also browse variable lists organized by survey cycle and data collection method. On each component page (e.g. 2015-2016 Examination), the first link is a variable list, containing the variable name, short description, and data file name. All variables listed in the component variable lists have been publicly released and are available for download in the associated data file. If you wish to use a variable that is not listed in a component variable list, you will need to use the Research Data Center. You can review the Research Data Center for more information about how to obtain access to limited access variables.

As you identify potential variables for your analysis, note the data file name associated with each variable so you can review the variable's documentation and download the data file.

Read the documentation for each of your potential analysis variables carefully, as not every result returned will be relevant to your analysis. For example, assume you are preparing for an analysis using cholesterol variables, and search by keyword "triglycerides." The Standard Biochemistry Profile file (BIOPRO) contains a variable for Triglycerides (variable LBXSTR). However, the laboratory test results for triglycerides using the reference analytic method (variable LBXTR) are contained in the Cholesterol - LDL & Triglycerides file (TRIGLY). This is the appropriate variable to use for the most accurate data analysis.

You must read the documentation and identify the correct variables for your analysis.

Each primary component page (e.g. 2015-2016 Examination Data) contains a list of individual components that have been publicly released. For each component, there is a data file and a documentation file that contains the data file documentation, codebook, and frequency counts.

Data File Documentation

Use the data file documentation to determine if the collection or measurement is appropriate for your analysis. The data file documentation outlines

• brief description of the component
• the eligible sample for this component
• protocol and procedures
• quality assurance and quality control
• data processing and file preparation, and
• analytic recommendations and specific notes on using the data file.

Codebook and Frequency Counts

The codebook portion lists all the variables in the data file. Use it to determine what the values associated with a variable mean. The frequency counts can be used to understand the coding of skip patterns among the variables in the individual component and to verify the sample size for a particular data item.

For more information on the metadata (information about a data item) included in the documentation files, refer to the General Information about NHANES Documentation Files.

NHANES data are saved in a SAS transport (.XPT) file. In addition to SAS, other software packages such as SUDAAN, SPSS, Stata, and R can extract SAS transport files. For statistical/analytical packages that do not support the SAS transport file format, you can convert the file to a different format using the free SAS Universal Viewer.

• If you have SAS installed, your browser may recognize it as a SAS transport file and prompt you to save the file on your hard drive when you left-click the link to the data file.
• If you do not have SAS installed, clicking the link may result in gibberish appearing in your window as the browser tries to read the file. In this case, you will need to right-click the file and save it to a folder on your hard drive. After downloading the SAS transport files, you will need to extract or import them as datasets. Transport files are not usable without completing this task.
• Many software packages can also download SAS transport files programmatically (within the code.)

The sample code page Downloading and importing Continuous NHANES data files contains sample SAS, Stata, and R code to import SAS transport files from your hard drive or to download the files from the NHANES website programmatically.

An NHANES dataset for analysis will typically include data from two or more years (i.e. one or more survey cycles) and variables from more than one individual component. You will append to combine the years of data and merge to include variables from different components.

The process of combining years is called appending. Check the contents of the data files before appending the data because variable names may be different from cycle to cycle and recoded or derived variables may be added in different cycles. If the names or labels of the variables of interest have changed, you will have to find out whether the wording, definition, and/or response categories have been modified, and then recode the variables to make their names and response categories consistent before appending. Also note that NHANES adds or deletes survey items from time to time.

When appending NHANES data you should always include the SEQN number. SEQN stands for sequence number and is a unique identifier for each observation (participant) in NHANES. Every time you extract variables from an NHANES data file, you should include the SEQN variable in your selection. Failing to do so will lead to problems if you want to sort or merge your data files at a later time.

For each data cycle, data files are organized by their collection method, which can fall under one of the five primary components: Demographic, Dietary, Examination, Laboratory, and Questionnaire. Putting the individual component data files together into one dataset is called merging.

The first step in merging data is to sort each of the data files by a unique identifier. In NHANES data, this unique identifier for each sample person is known as the sequence number (SEQN). Most NHANES data files contain exactly one record for each sample person who participated in that component (though note that not all sample persons participated in every component.) For example, the Demographic Variables and Sample Weights (DEMO) file contains one record for each sample person, and the Body Measures (BMX) file contains one record for each sample person who participated in the MEC examination. For these files, SEQN is a unique identifier on that file, so you must use SEQN as the key variable to merge.

However, some NHANES data files may contain multiple records for each sample person. For files with this structure, SEQN is not a unique identifier. Some examples of data files with this multiple record structure include:

• Prescription Medications (RXQ_RX),
• Dietary Interview - Total Nutrient Intakes:
• First Day (DR1TOT) and
• Second Day (DR2TOT)
• Dietary Interview - Individual Foods:
• First Day (DR1IFF) and
• Second Day (DR2IFF)
• Dietary Supplement Use - Total Dietary Supplements:
• First Day of 24-Hour recall (DS1TOT)
• Second Day of 24-Hour Recall (DS2TOT)
• 30-Day (DSQTOT)
• Dietary Supplement Use - Individual Dietary Supplements:
• First Day of 24-Hour recall (DS1IDS)
• Second Day of 24-Hour Recall (DS2IDS)
• 30-Day (DSQIDS)
• Physical Activity Monitor (PAXRAW) for NHANES 2003-2006

Analysts need to be aware of this data structure when merging files. For example, analysts using the Prescription Medications data (RXQ_RX) would need to transform the detailed drug-level file into a person-level file (with one record for each person) before merging it with NHANES demographic and other data files by using SEQN as the unique identifier.

Review the data file documentation for information about the file structure for each data file you are using, and check your record counts after each merge step to ensure that your analysis dataset is created as intended.

Missing values may distort your analysis results. You must evaluate the extent of missing data in your dataset to determine whether the data are useable without additional re-weighting for item non-response. As a general rule, if 10% or less of data for the main outcome variable for a specific component are missing from your analytic dataset, it is usually acceptable to continue your analysis without further evaluation or adjustment. However, if more than 10% of the data for a variable are missing, you may need to further examine respondents and non-respondents with respect to the main outcome variable, and decide whether imputation of missing values or use of adjusted weights are necessary. (Please see the Analytic Guidelines for more information.)

NHANES assigns missing values in the following way:

NHANES codes Description Action
. (period) missing numeric value None
(blank) missing character value None
7 or 77 or 777 "refused" response Code as missing (period or blank)
9 or 99 or 999 "don't know" response Code as missing (period or blank)

Check the codebook to identify the missing value codes for your analysis variables. Failing to identify these other types of missing data, and treating the assigned values for "refused" or "don't know" as numerical values, may distort analysis results.

If the codebook and frequency counts show that an item has a large number of "true" missing values (i.e. a period for missing numeric value or blank for missing character value), review the documentation to understand why that item was not assessed for those respondents. The item may have a more restricted target gender or target age than other items in that data file, or it may be part of a skip pattern. Be sure to account for any such features when you design your analysis.

## WARNING

Do not drop any records from your analysis dataset, even records for participants with missing values for a variable of interest. In order to properly account for the sample design and obtain the correct variance estimates, you must include all individuals with a positive sample weight in your analysis procedure.

See Module 4: Variance Estimation for more details.

The significance of a skip pattern depends on the question leading to the skip pattern, the questions within that skip pattern, and the variables you intend to analyze. If you fail to check for skip patterns, you may obtain only a portion of the population, instead of the entire study population.

For example, suppose you want to conduct an analysis of current cigarette smoking status among adults aged 18 and over. The questions in the Smoking - Cigarette Use (SMQ) component allow data users to categorize smoking status using the same categories – current smokers, former smokers, and never smokers – as in the National Health Interview Survey. Current smokers are defined as adults who have smoked 100 cigarettes in their lifetimes and who currently smoke cigarettes every day or some days. (See the NHIS Adult Tobacco Use Information website for more details on the definitions of smoking status used in the NHIS.)

Participants aged 18 and over are first asked whether they have smoked at least 100 cigarettes in their entire life (variable SMQ020.) Only those participants who answer "Yes" to this question are asked question SMQ040, "Do you now smoke cigarettes?" If you would like to estimate the percentage of the US population aged 18 and over who currently smoke cigarettes, you must recode variable SMQ040 (or create a new variable based on it) to include those who answered "No" to question SMQ020. Thus, until you recode SMQ040 (or define a new variable based on it) to include those who answered "No" to SMQ020 ("never smokers"), these people will be left out of the denominator value. If you fail to do this step, you will obtain the proportion of the adult population who currently smoke ("current smokers") among a subpopulation of adults who have ever smoked 100 or more cigarettes (i.e. "current smokers" plus "former smokers.")

Check the codebook to determine if a skip pattern affects the variables in your analysis. Even if your variable of interest is in the middle of the documentation file, review the codebook from the beginning. Be aware that responses to earlier items may trigger a skip pattern that affects your item; this would be indicated in the "skip to item" column of the frequency tables. The codebooks for some data files also include numbered items labeled "check item" that provide skip pattern instructions. If the codebook and frequencies show a large number of "true" missing values (i.e. a period for missing numeric value or blank for missing character value), this may be a sign that the item is part of a skip pattern.

For example, consider the skip pattern described above for the cigarette smoking questions. Take a moment to review the documentation for the Smoking – Cigarette Use component of NHANES 2015–2016. The data file documentation at the top of the page provides important information about this component as a whole. Participants aged 12 years and older were eligible for this component, and youths aged 12-17 years were asked a limited set of the questions. Next, examine the codebook entry for variable SMQ020 (smoked at least 100 cigarettes in life.) Note that the target for this question is participants aged 18 and over, and that there are 1,009 missing values. These missing values are for youths aged 12-17, who were not asked this question. The codebook entry shows that this variable is part of a skip pattern, since the "Skip to Item" column is filled in for participants who answer "No", "Refused", or "Don't know" to this question. Finally, examine the codebook entry for variable SMQ040 (Do you now smoke cigarettes?) Again, the target is adults aged 18 and over, but there are 4,579 missing values for this question. This question has more missing values because it is part of a skip pattern. The missing values are for the 1,009 youths aged 12-17 plus the 3,570 adults aged 18 and over who answered "No," "Refused," or "Don't know" to the earlier question SMQ020.

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables with a univariate analysis. These descriptive statistics are useful in determining whether parametric or non-parametric methods are appropriate to use, and whether you need to recode or transform data to account for extreme values and outliers. If the distribution is highly skewed, you can do a data transformation to make the distribution of the data closer to normal. The common types of transformation are LOGIT, LOG, LOG10, SQRT, INVERSE, or ARCSIN.

After checking the distribution and normality of the data, plot the survey weight against the variable of interest to examine if the extreme values identified in the univariate analysis may be influential outliers. Records with large sample weights can be influential in an analysis, especially when extreme weights are associated with extreme data points for the variable of interest. In analyses that involve categorizing a continuous variable, records with large sample weight can be influential if the value of the analysis variable lies near the threshold between categories. For example, in an analysis of the prevalence of obesity in adults, records with large sample weights and with Body Mass Index (BMI) values just above or just below 30 kg/m2 could be influential outliers, since obesity is defined as BMI >= 30.

For a more detailed discussion and examples of identifying influential observations in NHANES data, see the paper by Carroll and Curtin (2002). The sample code page also contains examples of checking for outliers and influential observations.

Reference

Carroll, MD and Curtin, LR. "Influential Observations in the National Health and Nutrition Examination Survey, 1999-2000." Proceedings of the Survey Research Methods Section, ASA (2002) http://www.asasrms.org/Proceedings/y2002/Files/JSM2002-000924.pdf