1999-2004 Dual Energy X-ray Absorptiometry (DXA) Multiple Imputation Data Files
Frequently Asked Questions
Question 1: If we use only a subgroup for our analysis (i.e., the group with measured DXA data or those with only a single imputed value) will there be any issues with the weights that might affect this group disproportionately?
If you use a single imputation you will have a complete dataset but the sampling errors will be underestimated because you will not have included the imputation variance. You also subject yourself to a reviewer asking if you possibly "selected" the one out of 5 imputation datasets that best fits your analysis. NCHS recommends using all 5 imputations. If you try to analyze only the cases with measured DXA data, then you have a classic missing data problem - the weights as provided are not appropriate as they are not adjusted for DXA non-response and DXA non-response is clearly NOT missing at random. Because the missingness is directly related to the outcome, you should expect the cases with complete data to differ from the cases with imputed values. For example, the percentage with a BMI greater than 30 will be higher for imputed cases than for complete cases. If this wasn't the case, NCHS probably would have simply re-weighted the data for non-response rather than go through the imputation process. NCHS does not recommend analysis based on only complete cases with original sampling weights.
Question 2: Should we expect any major changes as we move from unweighted to weighted data analysis?
No more so than usual. Sample weights vary by age, sex, race/ethnicity as well as those variables included in the non-response weighting adjustment. So, for example, if your outcome variable differs by race/ethnicity then you could expect differences in the weighted and unweighted measures. For multiple imputation datasets, the sample weights do not change from imputation to imputation.
Question 3: Are there statistical outliers that will be resolved through weighting?
Outliers are not "resolved" through weighting unless the weighting specifications call for trimming adjustments based on values of the outcome variable. As NHANES is a multi- purpose survey, a trimming adjustment for one variable may not be appropriate for other variables. As with any NHANES dataset, there will be influential data values and influential sample weights. You can run a cross-tab, or generate a scatter diagram to identify any observations that have BOTH a large weight and a large value. If you, as the analyst, feel that such observations are overly influential, you can delete, trim, or otherwise adjust. In the imputation model, influential observations were examined carefully, but none were removed from the model fitting exercise. The imputed values also were examined; some of the outer percentile points in the dataset are imputed values because either the average imputed value was large as predicted by the model or both the average value and the imputation variance were large. In a statistical sense, because you are dealing with 5 imputations, large imputed values were left in the dataset because they accurately reflect the imputation variance. Early in the modeling it was noted that some transformations were generating values that seemed to be too large (or too small); so transformations were avoided that seemed to elongate the tails of the distribution of imputed values.
Question 4: Does the data set contain ONLY imputed data? That is, if a sample person had a measured value, was that measured value replaced by an imputed value or does the data set contain some measured values and some imputed values?
DXA data items/variables were imputed ONLY for participants whose observations on the DXA variable were invalid or missing. If a participant's DXA data are complete, then no imputations were done on the DXA data. Any individual DXA value and any summary statistics on the DXA dataset for those with measured data only should not change among the 5 datasets. For these participants each of the 5 data records will be identical. You should be able to verify this with a few line listings of the DXA records for sample persons with the same SEQN across the 5 datasets. That is, sort the 5 data files by SEQN, merge the datasets by SEQN, take the first 100 or so records (or a middle 100 or the last 100, and print the line listing for all or some of the DXA variables. For participants with multiply imputed data, each of the 5 datasets will contain a different set of imputed values.
In doing your analysis, you will have to merge the DXA data with other NHANES datasets to get the demographic variables, sample design information, and covariates. For some data files, there may be some missing covariates. Make sure your sort.merge steps are merge all cases/sample persons (you don't want case-wise deletion based on missing covariates).
Question 5. How do I compute standard deviations for DXA measures using the multiply imputed data to determine the prevalence of a disease? We would like to use the mean and standard deviations of various measures to create gender and gender/race specific cut points that will be used to define this disease.
Take each of the five complete (i.e. measured and imputed values) datasets separately; calculate five standard deviations, one for each dataset, then take the average of the five.