First published: May 27, 2021
Last Updated: June 22, 2021
A full report is available here: National Health and Nutrition Examination Survey, 2017–March 2020 Prepandemic File: Sample Design, Estimation, and Analytic Guidelines
The coronavirus disease 2019 pandemic required suspension of the National Health and Nutrition Examination Survey (NHANES) 2019-2020 field operations in March 2020. As a result, data collection for the NHANES 2019-2020 cycle was not completed and the collected data are not nationally representative. As of March 2020, data collection was completed in 18 of 30 locations or primary sampling units ([PSUs] usually a county or a group of counties) in the 2019-2020 sample. Data collection was canceled for the remaining 12 PSUs. Because the collected data from these 18 PSUs are not nationally representative, any analyses based solely on the partial 2019-2020 data would not be generalizable to the U.S. civilian non-institutionalized population. Therefore, the partial 2019-2020 data (herein 2019-March 2020 data) were combined with the full data set from the previous cycle (2017-2018) to create nationally representative 2017-March 2020 pre-pandemic data files.
A comprehensive National Center for Health Statistics’ (NCHS) Series 2 report describing the sample design, weighting methodology, response rate, and variance estimation procedures for these files will be released later in 2021. Estimates of selected prevalent health conditions from these files and more information about the suggested uses of the file are presented in an NCHS National Health Statistics Report. The following brief overview discusses the process taken to construct and release the 2017-March 2020 pre-pandemic data files, which may be used to make nationally representative estimates for this period, and the implications of combining a partial data cycle (2019-March 2020) with a previously released data cycle (2017-2018).
Impact of the pandemic on 2019-2020 data collection
The sample for the NHANES 2019-2020 survey cycle was selected using a 4-year sample design for 2019-2022. Each 2-year data cycle from this design was intended to be nationally representative and included 30 PSUs each, for a total of 60 PSUs in the 4-year 2019-2022 sample following methods from previous sample designs (1). However, the 2019-2020 NHANES operations were suspended in mid-March 2020 due to the pandemic, and all data collection for the remainder of the cycle was canceled. Because data collection was truncated, the 2019-March 2020 sample is not nationally representative and unbiased estimates cannot be produced from this partial cycle alone. To provide nationally representative estimates, the 2019-March 2020 data were combined with the 2017-2018 data using additional weighting procedures. The resulting files are referred to as the NHANES 2017-March 2020 pre-pandemic data files.
Creation of the 2017-March 2020 pre-pandemic data set
Combining NHANES 2017-2018 and 2019-March 2020 data presented challenges because PSUs for these cycles were selected from different 4-year sample designs: the 2017-2018 cycle was drawn from the 2015-2018 sample design (1) and the 2019-2020 cycle was drawn from the 2019-2022 sample design. In the first sampling stage of both sample designs, states were categorized into 4 health groups according to health index values and PSUs were split into three or four major strata within each health group, creating 14 major strata in total, based on urban-rural population distribution and other characteristics of the locality (1). One PSU was selected from each stratum, each year, with probability related to their measure of size (a weighted population count) resulting in an equal number of sampled PSUs per major stratum in each NHANES survey design (1). An additional “certainty” PSU (a large metropolitan area with such a large measure of size that its inclusion in the survey is guaranteed) was also selected each year (1). Thus, both designs called for 15 PSUs to be sampled each year, for a total of 30 PSUs for each 2-year cycle and 60 PSUs for the 4-year period covered by the sample design. However, because population size and other characteristics that determine major strata membership change over time, the strata comprised different PSUs in the 2015-2018 design than in the 2019-2022 design.
An operational change in data collection, starting in 2019, had additional implications for the 2019-March 2020 data. Prior to 2019, the 15 PSUs fielded each year were nationally representative (i.e., the 2017 and 2018 samples were each nationally representative). However, to reduce travel costs in the 2019-2020 cycle, the sequence in which PSUs were visited for data collection was reordered by geography to minimize travel time and maximize time spent in each location for data collection; consequently, only the full 2-year sample was representative. That is, neither the PSUs fielded in 2019 nor the PSUs fielded in 2019-March 2020 were nationally representative.
Because the 18 PSUs in the 2019-March 2020 sample did not represent the nation (or any defined population), the 2019-2020 PSUs were reassigned to the 2015-2018 sample design strata and combined with the 2017-2018 data to create a data set that could be used for nationally representative estimates. An alternative approach to reclassifying 2017-2018 PSUs into 2019-2022 strata was initially considered, but at the time of the halt in field operations, some of the 2019-2022 sample design strata included no PSUs with data collected in either 2019-March 2020 or in 2017-2018, so analyzing the completed PSUs with respect to the 2019-2022 sample design strata could not yield nationally representative estimates. Consequently, the 2015-2018 sample design strata were used instead. The combined 2017-March 2020 pre-pandemic data file thus comprises a total of 48 PSUs: 30 PSUs from the 2017-2018 cycle and 18 PSUs from the 2019-March 2020 data collection, with unequal numbers of PSUs per major stratum. Special weighting measures were needed to calibrate the data set back to an equal number of PSUs across major strata as specified in the 2015-2018 sample design.
Weighting and adjustment factor for incomplete data collection
A PSU-level adjustment factor was created to equalize the contribution of each stratum to the total survey sample and applied to participant base weights. This approach is similar to an approach that could have been taken to combine a probability sample with a non-probability sample, modeling the contribution of the 2019-2020 data based on the 2015-2018 stratification. In this application, the fact that the 2019-2020 PSUs were a subset of a probability sample adds credibility to the approach. The PSU-level adjustment was derived to effectively increase the weights from under-represented strata and reduce the weights from overrepresented strata, while improving the efficiency of the combined sample (2). Sample weights (interview weights and MEC weights) then were calculated using previous methodology (1).
Evaluation of weighting procedures
The performance of interview weights was assessed by comparing the demographic characteristics of the weighted NHANES 2017-March 2020 pre-pandemic sample to nationally representative estimates from the 2018 5-year American Community Survey (ACS). The 2018 5-year ACS and NHANES 2017-March 2020 pre-pandemic estimates were moderately- or well-matched for adult education, marital status, household composition, and health insurance coverage (comparisons used the same methodology used to compare 2017-2018 NHANES estimates to ACS estimates) (3). However, there was a statistically significant difference in urban/rural distribution between NHANES and ACS across the four categories: large central, large fringe, medium/small metropolitan areas, and micropolitan/noncore. The weights were recalibrated to NCHS rural-urban code categories in addition to the three dimensions used for the NHANES 2017-2018 weighting calibration (race-Hispanic origin-age-sex demographic subgroups, race-Hispanic origin-sex-education level subgroups, and area-level household income) (1). The introduction of the additional dimension resulted in the weighted 2017-March 2020 sample being well matched to the urban/rural distribution of the 2018 5-year ACS file, with no statistically significant differences in the proportion of sample in each urban/rural designation category between NHANES and ACS.
Recommendations for analyses of NHANES 2017-March 2020 pre-pandemic data files
Although a thorough evaluation was conducted to assess the reliability of using this dataset to produce national estimates, some considerations should be made before analyses of these data are performed.
First, the sample weights that appear on the data file should be used to calculate estimates using the combined 2017-March 2020 pre-pandemic data. The weights are designed to yield nationally representative estimates for the entire period covered by the 2017-March 2020 pre-pandemic files. As for all analyses using NHANES data, the complex survey design should be taken into account for estimation of variance (4).
Second, comparisons or examination of trends between 2017-2018 and 2019-March 2020 data are not possible and should not be conducted because the 2019-March 2020 data do not represent any defined population. To prevent data analysts from separating the two cycles, changes have been made to respondent sequence identification numbers and PSU and strata variables have been masked. Furthermore, there are no appropriate survey weights to make national estimates using 2019-March 2020 data.
Third, PSU-level adjustments to survey weights were designed for overall estimates and not specific subgroups. Any trend comparisons for subgroups (e.g., by age, sex, race and Hispanic origin, etc.) between the 2017-March 2020 pre-pandemic file and previous NHANES cycles should be interpreted with caution. The magnitude and direction of the trend within a certain subgroup may vary from the overall trend. In conducting analyses and interpreting results, analysts should consider the historical context of the trends in addition to the methodological approach to create the 2017-March 2020 pre-pandemic file. When analyzing trends that include 2017-March 2020 data, analysts should also take unequally spaced intervals into account. For example, if observed time points are 2011-2012, 2013-2014, 2015-2016, and 2017-March 2020, then the interval midpoints could be used to represent these time points in a trend model (i.e., 2012, 2014, 2016, and 2018.6)(5).
Fourth, the variances of estimates from the 2017-March 2020 pre-pandemic files are generally smaller compared to the 2017-2018 files due to increased sample size achieved by combining data cycles. However, for some estimates and some demographic subgroups, adding the 2019-March 2020 PSUs may increase the variance estimates due to the increased variation in the sampling weights and the increased variation in underlying variables.
Fifth, the 2017-March 2020 pre-pandemic files represent a 3.2-year period, in contrast to previous data releases which represent a 2-year period. Analysts may wish to combine 2017-March 2020 data files with previous cycles to increase sample size for outcomes with low prevalence or for subgroups. If done, the survey weights should be adjusted to reflect the longer period and larger population represented by the 2017-March 2020 files. For example, combining the 2015-2016 and 2017-March 2020 files would result in a data file representing a 5.2-year period, and the survey weights should be adjusted as follows: 2015-2016 survey weights should be multiplied by 2/5.2 (the fraction of the 5.2-year period represented by the 2015-2016 cycle) and likewise, the 2017-March 2020 survey weights should be multiplied by 3.2/5.2.
Lastly, only a limited number of 2017-March 2020 pre-pandemic data files are being released online. Due to the unusual circumstances surrounding the construction of this file and the need to reduce disclosure risk for participants in the partially completed cycle, the 2017-March 2020 pre-pandemic data release is more limited in content compared to the previous 2-year files. Additional files will be released to the Research Data Center where extra measures can be taken to protect confidentiality.
Specific data file documentation can be found via the link next to the respective data file on the NHANES website. This documentation is always the most up-to-date source of information about the variables on each data file.
- Chen TC, Clark J, Riddles MK, Mohadjer LK, Fakhouri THI. National Health and Nutrition Examination Survey, 2015−2018: Sample design and estimation procedures. National Center for Health Statistics. Vital Health Stat 2(184). 2020. https://www.cdc.gov/nchs/data/series/sr_02/sr02-184-508.pdf.
- Krenzke T, Mohadjer L. Application of Probability-Based Link-Tracing and Nonprobability Approaches to Sampling Out-of-School Youth in Developing Countries, Journal of Survey Statistics and Methodology, 2020; 0: p. 1-26. https://doi.org/10.1093/jssam/smaa010.
- Fakhouri THI, Martin CB, Chen TC, Akinbami LJ, Ogden CL, Paulose-Ram R, et al. An investigation of nonresponse bias and survey location variability in the 2017−2018 National Health and Nutrition Examination Survey. National Center for Health Statistics. Vital Health Stat 2(185). 2020. https://www.cdc.gov/nchs/data/series/sr_02/s02-185-508.pdf.
- National Center for Health Statistics. National Health and Nutrition Examination Survey: Analytic guidelines, 2011–2014 and 2015–2016. Available from: https://wwwn.cdc.gov/nchs/data/nhanes/analyticguidelines/11-16-analytic-guidelines.pdf.
- Ingram DD, Malec DJ, Makuc DM, Kruszon-Moran D, Gindi RM, Albert M, et al. National Center for Health Statistics Guidelines for Analysis of Trends. National Center for Health Statistics. Vital Health Stat 2(179). 2018. Available from: https://www.cdc.gov/nchs/data/series/sr_02/sr02_179.pdf.