Preparing an NHANES Dietary Analytic Dataset
Dietary data are among the most complex of all the data in NHANES. For this reason, preparing a dataset for dietary analysis involves critical steps and often may be more time-consuming than the analysis itself.
Analysts working with NHANES dietary data frequently want to answer the following types of questions:
- What is the mean intake of a given food?
- What is the mean intake of a given nutrient from all foods and beverages?
- What is the mean intake of a given nutrient from supplements?
- Which foods are the major sources of a given nutrient?
- What is the distribution of intake of a given food or nutrient across a selected population?
- How does dietary intake relate to some health paramete?
The basic unit of analysis in NHANES is the individual participant, identified by the participant sequence ID (SEQN). However, because of the way the dietary data are structured—with individual participants having multiple food and dietary supplement records, which in turn have their own accompanying sets of variables—the unit of analysis for some types of analyses is at the level of the food or supplement, rather than the individual.
Dietary data are also challenging to work with because many analyses require the creation of new variables from variables that are found in the survey data files, grouping similar foods together. For example, to answer the question "What is the mean intake of milk among survey participants?" defining "milk" (e.g., all types of fluid milk consumed as a beverage, or milk also consumed as an ingredient in other foods, or servings of milk)may require the creation of several new variables based on analytic needs.
Analysts can group foods for their own purposes or use previously developed grouping schemes. Examples of such schemes include the What We Eat in America Food Categories and the Food Patterns Equivalents Database described earlier.
Special Considerations in Analysis of Dietary Data
Different Types of Data
The dietary recalls, food frequency questionnaire, and supplement questionnaire each measure different aspects of the diet. Each type of dietary data is gathered differently, which could lead to differential cognition (comprehension, recall, decisions & judgment, and response processes) and how individuals respond. Because of these differences, the various types of dietary data lend themselves to different types of analyses and require different assumptions as shown below. For more information on analysis of different type of dietary data, see the Dietary Assessment Primer from the National Cancer Institute (NCI), including the Summary Tables: Recommendations on Potential Approaches to Dietary Assessment for Different Research Objectives Requiring Group-level Estimates.
The recall data and supplement questionnaire data can be used alone, but each one does not represent total nutrient intakes. Similarly, the NHANES 2003-2006 food frequency data were not designed to be used alone, but as supplementary (covariate) information in modeling data from the 24-hour recalls in estimating usual intakes when examining them in relation to some other variable of interest. A single 24-hour recall is sufficient for analyzing mean nutrient intakes from foods and beverages, whereas both days of data are required when estimating the usual intake distribution and prevalence of nutrient intake from foods and beverages. There may be a sequence effect—that is, the number and amount of foods is sometimes lower on the first versus subsequent recalls—that can be controlled for by adding a variable for recall day (first versus second) in the usual intake analysis. See Usual Dietary Intakes on the NCI, Risk Factor Assessment Branch website for more information and sample statistical programs on the NCI method for estimating usual intake.
Notes: using the 24-hour recall and Supplement data in mean intake estimates assumes that the 24-hour recall has no bias, hence for simple analysis such as mean intakes, no sophisticated statistical manipulation is required. Nevertheless, estimation of the population distribution of usual intake requires statistical modeling.
Choosing Whether or Not to Include Non-Consumers
Another consideration in estimating mean food intake is whether the mean is among all persons in the population or only among consumers of the food. If interested in the per capita amount consumed, non-consumers with their intake value of zero must be included; if interested in the average amount consumed by users of the food on days when the food is consumed, non-consumers must be excluded.
Measurement Error
Measurement error in dietary data seriously attenuates the association between dietary data and other factors, such as a health outcome. That is, the analyses would be less likely to indicate a relationship between diet and disease even if one truly existed. Measurement error models can be used to analyze diet-disease relationships, and methods have been developed to estimate usual intakes that adjust for the problems associated with large within-person variation. For more information on measurement error, see Measurement Error: Impact on Nutrition Research and Adjustment for its Effects on the NCI website.
Data Symmetry
An underlying assumption in many statistical analyses is that the distribution of the data is normal. However, almost all distributions of dietary data are skewed. For some dietary constituents, many people have zero intake, and a few people may have very large intakes. Skewness does not affect simple analyses, such as differences in mean intakes between population subgroups, therefore, no special corrections are necessary. However, for more complex analyses, such as when estimating the distribution of usual intakes, skewness must be considered.
Weighting
Appropriate sample weights, including dietary weights, should be applied if the data are being used to represent the U.S. civilian noninstitutionalized population. The dietary sample weights for each recall day account for day of the week of the 24-hour recall in addition to nonresponse, noncoverage, and unequal probabilities of selection. For more information on selecting correct weights for your analysis, see the Weighting module.
In addition, special statistical procedures are required to estimate standard errors when using data from a complex sample such as the NHANES. For more information, see the Variance Estimation module.
Statistical Analysis Methods for Dietary Data
See the Sample Code module for example code to download and modify.
Usual Intake and Day-to-Day Variation in Dietary Intakes
For most surveillance, epidemiologic, and behavioral research purposes, dietary analyses are concerned with measuring
usual
intake—that is, long-term average daily intake. This is because dietary recommendations are intended to be met over time, and diet-health hypotheses are based on dietary intakes over the long term. For more information on usual intake estimation, please see Usual Dietary Intakes on the National Cancer Institute (NCI), Risk Factor Assessment Branch website.
Estimating Mean Food Intakes
Although dietary recall data are known to contain random errors, especially large day-to-day variability, these errors are assumed to cancel out when estimating means. The mean of the population's intake on a given day can be estimated from a sample of individuals' 24-hour recalls, without sophisticated statistical adjustment if the data are collected evenly throughout the year and the days of the week are evenly represented. The second day of dietary recall is generally not used to estimate means but is used for more advanced analyses, such as Usual Intake analysis.
Dietary recall data also are known to contain bias, at least insofar as a tendency toward underreporting of energy intake. Little is known regarding the extent to which energy intake underreporting extends to underreporting of different foods. For that reason, and for practical purposes, the current statistical convention is to assume that the recalls are not biased (i.e., that no underreporting of foods occurs). However, this assumption is more troubling than the one regarding random error and should be noted as a limitation or caveat in any analysis of food intake.
Estimating Mean Nutrient Intakes from Dietary Supplements
Very little is known regarding the extent of bias or random error associated with dietary supplement data. For that reason, and for practical purposes, the supplement data are generally treated as though none of either type of error occur. However, the possibility of both should be noted as a caveat in an analysis of dietary supplement intake.
There are a few key points to note when calculating supplement intake.
-
Nutrients for 34 nutrients are summarized in the dietary supplement files and therefore researchers do not need to estimate nutrient intake for these specific nutrients. If researchers want to calculate nutrients that are not part of those 34, then the following needs to be considered:
- Each supplement could be reported with a different frequency, based on use over the past 30 days, so care must be taken in deriving the intakes from all supplements.
- The measurement unit for a given supplement may not be the same across all brands, so conversions may need to be made to combine nutrient values.
- Some nutrients may be listed as compounds, and thus may need to be converted to elemental form and amounts (e.g., calcium carbonate would need to be converted to the corresponding amount of elemental calcium to determine total calcium). This impacts very few products, like antacids, but still needs to be considered.
- When linking supplement ID from the product-level files (e.g., DSPI) and the participant-level files, it is important to note that NHANES 1999-2016 cycles are linked by the old ID code (variable DSDSUPID) and NHANES 2017-2020 cycles are linked by the new ID code (variable DSDPID).
-
Missing data can be a limitation with several of the dietary supplement variables. The number of cases of missing data and the possible remedies vary by variable, as follows:
- Number of days the supplement was taken in the past 30 days: Because this variable is needed to determine usual intake, analysts can either impute number of days or drop these records from the dataset. Imputation requires an assumption that the supplement was taken regularly and is usually based on some other information the respondent provided, such as the number of days that the respondent reported taking certain other types of dietary supplements.
- Variable on quantity and units consumed, "On the days you took the supplement, how much did you take?": Analysts may want to impute data, which requires an assumption that the respondent took the serving sizes recorded in the variables that capture label information.
- Missing supplement name: It is assumed that individuals did take a supplement, even though the name is unknown so they should be retained for prevalence estimates. It may be best to exclude these data from analyses in which mean intakes are being estimated. This action also would reduce missing data for some other variables.
- The calculated total calcium intake from supplements (available on the Total Dietary Supplements files DS1TOT and DS2TOT) include calcium from antacids. Caution is advised because antacids may be used as a medication and not as a supplement. Usual intake estimates for calcium may be skewed, overestimating intake for some individuals.
- Similar to the consideration for food intake, mean supplement amount can be obtained among all persons in the population or only among users of the supplement. If interested in the per capita amount consumed, non-consumers with intake value of zero should be included. If interested in the average amount consumed by users of the supplement, non-consumers should be excluded.
NOTE: When estimating the mean of the population distribution of usual nutrient intakes from supplement data, no standard convention for statistical adjustment currently exists.
Estimating Mean Total Nutrient Intakes from Foods and Supplements
Estimating total nutrient intake requires using data from both the 24-hour recalls and dietary supplement questionnaire. Since 2007, 24-hour dietary supplement data are collected during 24-hour recalls so nutrients from all sources can easily be combined. For more information see the Doc Files for each 24-hour dietary supplement data.
The 30-day dietary supplement intake files have different reference periods and measurement error characteristics. Therefore, some data manipulation is required to combine and summarize these data with 24-hour recalls (i.e., for survey cycles before 2007). For more information on merging data files, see Continuous NHANES Tutorials. Also, the study sample sizes may differ because some participants who report supplement use do not complete the dietary recall interview and participants who complete the dietary recall may not report supplement use. Exploratory analyses are useful to identify the characteristics of participants who report supplement use versus participants who complete dietary recalls.
All the key concepts and caveats regarding estimating nutrient intakes from dietary (foods and beverages) intake apply when estimating total nutrient intake from both foods and supplements. For these analyses, the sample of participants with reliable data for both supplement intake and the First Day recall are selected. Then, for each participant, nutrient intake from 24-hour dietary supplements is added to the nutrient intake from the 24-hour recall, or the average daily nutrient intake of supplements from the 30-day supplements use files are determined and added to the nutrient intake from the 24-hour recall (for survey cycles before 2007). Finally, a weighted mean of those values is obtained. The assumption is that the sample of participants with satisfactory 24-hour recall and supplement data is representative of the population.
If the units of measure are different in the 24-hour recalls from the supplement data, ingredient units (DSDUNIT) for each nutrient of interest on the supplement files need to be converted to units used in the dietary intake data files. Also, nutrients listed as compounds need to be converted to elemental form and amounts. For example, there may be some instances of calcium carbonate, which will need to be converted to the corresponding amount of elemental calcium.
As in the case of estimating nutrient intakes from supplements alone, analysts must consider the possibility of missing data and whether to include antacids (in the case of calcium or magnesium).
Means should be examined along with their standard errors, to get an indication of the variation about the mean. If the data are highly skewed, as dietary data often are, means may not provide a very good representation of central tendency and the median should also be estimated. The simple median of reported intakes from a sample based on one 24-hour recall per participant represents the median on a given day, not usual intake. In addition to medians, transformations of skewed dietary data which result in normally distributed values could be considered.
Estimating Ratios
Ratios depict the value of one variable divided by the value of another. The mathematical properties of ratios are the same, whether one is considering simple ratios, proportions, or percentages. A proportion, often expressed as a percentage, is a kind of ratio that can be used to represent the value of a single variable for one class divided by the value for all classes combined.
Whenever multiple ratios are involved—either across many individuals in a group or over numerous days of intake for each individual—analysts can use different ways to summarize them. These different calculations can lead to different answers because the calculations involve both summation and division, and an elementary principle of mathematics dictates that the order of these operations matter.
In survey analyses involving multiple dietary recalls per person, consideration of which kind of summary ratio to use must be made at both the group and individual levels. Two different, but equally correct, answers can be given in response to a question such as "What proportion of the calcium that is consumed comes from milk?" This is because the question can have two different meanings:
- "How much of all the calcium consumed by the group comes from milk?" (Ratio of Means) or
- "What is the group's daily contribution of milk to calcium intake?" (Mean Ratio)
Ratio of Means
The ratio of means yields information about the diet of the whole population because both the numerator and the denominator are computed for the whole population before the ratio is derived. The ratio of means can be obtained for various subgroups in the population if comparisons are warranted. The ratio of means is used to answer questions such as, "How much of all the calcium consumed by the group comes from milk?" It is calculated by summing the amount of calcium from milk for all persons and then dividing that by the sum of the calcium from all foods for all persons. The answer would be the same if both the numerator and denominator were divided by a constant, such as the sample size. Therefore, it can also be calculated by dividing the group's mean amount of calcium from milk by the group's mean total calcium, and for this reason it is described as a ratio of means.
The ratio of means has been used to identify sources of nutrients in the US diet as a whole and to examine diet quality. There are two different ways to consider food sources of nutrients—as either "important" or "rich" sources. Important sources are those that contribute the most to a population's dietary intake; rich sources are those foods with the greatest concentration of a nutrient. For example, sardines are a rich source of calcium, but they are not a very important source in the US diet because they are consumed relatively infrequently. A food composition table or database can provide information about rich sources of nutrients, whereas population intake as well as food composition data are needed to identify important sources.
NOTE: When data from one or two 24-hour recalls are used to estimate a ratio of means, the mean in the numerator and the mean in the denominator can each be considered an estimate of usual intake. Therefore, no specific statistical adjustments are necessary.
Mean Ratio
When the intent is to say something about how intake varies among the population, or how the ratio relates to other factors, deriving the ratio for each person before summarizing (as with the mean ratio) is the method of choice. The mean ratio is used to answer questions such as, "What is the group's daily contribution of milk to calcium intake?" It is determined by first calculating the proportion of calcium from milk for each person and then taking an arithmetic mean of all the proportions. Often, the mean ratio is close to the ratio of means; however, sometimes they are quite different, depending on the variability in the ratio, variation in the denominator, and the correlation between the ratio and the denominator.
When the ratio itself varies among the population, its distribution can be examined. The distribution of ratios provides other summary statistics, such as the mean and median, the ratio at other percentiles, and the proportion of the population above or below a certain cut-off. The generalizability of mean ratios is subject to whatever data limitations the individual ratios impose. For example, if the individual ratios each represent only a single day, then the mean ratio can only be used to make inferences "for a given day," and relating a single day's ratio to some other factor is rarely of interest.
Individual-Level Ratios
As with group-level ratios, two different questions could be posed: "How much of all the calcium consumed by this person, has come from milk?" or "What is the person's daily contribution of milk to calcium intake?" And again, because the ratios would involve the division of one variable by another, these two ratios could be different from one another. If using only one observation per person—such as a single 24-hour recall—then there is only one value for the numerator and one for the denominator and, therefore, only one way to derive the individual-level ratio. If data were available for each person's intake on every day over an extended period, then the individual's daily ratios would need to be summarized.