National Health and Nutrition Examination Survey

Software Tips

This page contains tips for using SUDAAN, SAS Survey, Stata, and R software to analyze NHANES data.

Tip 1: Do not drop observations from the SAS dataset before calling SUDAAN procedures.

To properly calculate the standard errors of your statistics (such as means and percentages), the Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 years and over, the DESCRIPT procedure needs to read in the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.

As a rule of thumb, it is recommended that you NEVER drop records from your SAS dataset prior to calling SUDAAN procedures. Instead, you should use the SUBPOPX statement to specify your population of interest.

For more details on analyzing subgroups, see Module 4: Variance Estimation.

WARNING

Do not drop observations from the dataset. This may affect variance estimation.

Tip 2: Remember to sort your SAS dataset by the design variables before calling SUDAAN procedures.

SUDAAN procedures expect the input dataset to be sorted by the design variables specified on the NEST statement. You should use the SORT procedure in SAS to sort your analysis dataset by the strata and cluster variables (SDMVSTRA and SDMVPSU) before calling any SUDAAN procedures.

Alternatively, the notsorted option on the SUDAAN procedure call will request that the SUDAAN procedure make a temporary copy of the input dataset and sort it by the required variables prior to conducting any calculations. However, since an analysis session typically includes multiple SUDAAN procedure calls with the same input dataset, it is generally more computationally efficient to sort the dataset once, prior to running any SUDAAN procedures.

Tip 3: Check for minor syntax differences if running the example code in stand-alone SUDAAN.

There are a few minor syntax differences between the standalone and SAS-callable SUDAAN versions. For example, several statements are renamed in SAS-callable SUDAAN to avoid conflicts with existing SAS commands. The code examples were developed using SAS-callable SUDAAN, so you may need to modify the example codes slightly in order to run them successfully in stand-alone SUDAAN.

A few differences are highlighted below. Review the SUDAAN documentation for more information.

SAS-callable SUDAAN command Stand-alone SUDAAN command
RLOGIST LOGISTIC
RTITLE TITLE
RFOOTNOTE FOOTNOTE
RFORMAT FORMAT

Tip 1: Use the SAS Survey procedures to properly account for the complex sample design of NHANES.

SAS provides a number of survey analysis procedures that properly account for the complex sample design used in NHANES. These procedure names all start with SURVEY – such as SURVEYMEANS, SURVEYFREQ, SURVEYREG, and others. Note that the Base SAS procedures (PROC MEANS, PROC FREQ, etc.) do not account for the complex sample design.

Tip 2: Do not drop observations from the SAS dataset.

To properly calculate the standard errors of your statistics (such as means and percentages), the Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 years and over, the SURVEYMEANS procedure needs to read in the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.

As a rule of thumb, it is recommended that you NEVER drop records from your SAS dataset. Instead, you should use the DOMAIN statement (or the TABLES statement for PROC SURVEYFREQ) to specify your population of interest.

For more details on analyzing subgroups, see Module 4: Variance Estimation.

WARNING

Do not drop observations from the dataset. This may affect variance estimation.

Tip 3: Use the NOMCAR option in estimation procedures to treat missing values as not missing completely at random.

The NOMCAR option requests that the procedure treat missing values in the variance computation as not missing completely at random (NOMCAR) for Taylor series variance estimation. When you specify this option, the procedure computes variance estimates by analyzing the non-missing analysis values as a domain (subpopulation), where the entire population includes both non-missing and missing domains.

Use of the NOMCAR option is recommended, as the default assumptions – that the group of non-respondents do not differ in any relevant respect from the group of respondents and so may be treated as missing completely at random – are often not appropriate.

The NOMCAR option is available on procedures including SURVEYFREQ, SURVEYMEANS, SURVEYREG, and SURVEYLOGISTIC, in SAS version 9.2 and later. See the SAS documentation for more details.

Tip 4: Be aware that SAS Survey procedures generally do not correct for the reduction in the degrees of freedom for subgroups where not all PSUs and strata are represented.

The degrees of freedom associated with an estimated statistic is needed to perform hypothesis tests and to compute confidence intervals. For analyses on a subgroup of the NHANES population, the degrees of freedom should be based on the number of strata and PSUs containing the observations of interest. SAS Survey procedures generally calculate the degrees of freedom based on the number of strata and PSUs represented in the overall dataset (especially if the NOMCAR option is used, as is recommended to estimate the standard errors correctly). Estimates for some subgroups of interest will have fewer degrees of freedom than are available in the overall analytic dataset. (See Module 4: Variance Estimation for more information.)

In particular, although PROC SURVEYFREQ has an option to compute Clopper-Pearson confidence limits for proportions according to the approach of Korn and Graubard (type=clopperpearson option on the TABLES statement), these confidence intervals are not based on the correct degrees of freedom for subgroups where not all strata and PSUs are represented. See the code example about diabetes prevalence (which replicates a portion of National Health Statistics Report 123) for code to compute the degrees of freedom for subgroups and then calculate the Korn and Grabuard confidence intervals.

Tip 5: When using PROC SURVEYREG to calculate age-standardized estimates or to test for differences between subpopulations, specify the noint and vadjust=none options on the MODEL statement.

PROC SURVEYMEANS does not support direct age standardization of prevalence estimates or pairwise testing for differences in prevalence estimates between subpopulations. Instead, you can use PROC SURVEYREG for these estimates. See Module 7: Hypothesis Testing for information on conducting t-tests with PROC SURVEYREG, and see Module 8: Age Standardization and Population Estimates for details about how to calculate age-standardized prevalence estimates using PROC SURVEYREG.

The model statement will specify only one effect, which is either a classification variable or the crossed effect (interaction) of multiple classification variables.

Use the noint option on the MODEL statement to omit the intercept from the model. The parameter estimates will represent the mean value of the dependent variable at each level of the effect (e.g. the mean for each age group when producing age-standardized estimates.)

Use the vadjust=none option on the MODEL statement to request that the procedure does not apply a degrees of freedom adjustment in the computation of the matrix for the variance estimation. The vadjust=none option produces the same estimates for the standard error of the estimated parameter (i.e. standard error of the mean) as would be calculated using the default options in PROC SURVEYMEANS. If you do not specify a value for the vadjust option, PROC SURVEYREG instead uses the varadjust=df option by default.

Tip 1: There are two series of commands.

There are two series of commands you can use analyze NHANES in Stata.

SVY Commands

SVY commands are a series of commands specifically designed to analyze complex survey designs like NHANES. To calculate the means and standard errors, you would use Stata survey (svy) commands because they account for the complex survey design of NHANES data when determining variance estimates. These commands can be used for simple random samples also.

Whenever you want to use SVY commands, you need to set up Stata by defining the survey design variables using the svyset command. This command has the general structure:

svyset [w= weight], psu(psu variable) strata(strata variable)

Here is the command using the 4-year weight for data collected in the MEC and the output:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra)
(sampling weights assumed)
   pweight: wtmec4yr
       VCE: linearized
Single unit: missing
  Strata 1: sdmvstra
      SU 1: sdmvpsu
     FPC 1: <zero>

Once you do this, Stata remembers these variables and applies them to every subsequent SVY command. If you save the dataset, Stata will remember these variables and apply them automatically when you reopen the data set.

You can change these variables any time you want by typing a new SVYSET command.

Standard commands

Standard commands are regular Stata commands that can incorporate sampling weights. For example, if standard errors are not needed, you can simply use regular Stata commands with the weight variable (i.e., mean with the weight variable) to calculate means.

You only need to use these commands when there is no corresponding SVY command. When you use these commands, keep in mind that:

  • Not all standard commands will take weights.
  • With weights, these analyses will generate accurate point estimates.
  • Because standard commands do not use the design variables (i.e. strata, psu), they will NOT generate accurate standard errors.

Tip 2: Do not drop observations from the Stata dataset.

To properly calculate the standard errors of your statistics (such as means and percentages), the Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 and over, the svy:mean command needs to access the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.

As a rule of thumb, it is recommended that you NEVER drop records from your Stata dataset. Instead, you should use the subpop option available on the svy commands to specify your subpopulations of interest.

For more details on analyzing subgroups, see Module 4: Variance Estimation.

WARNING

Do not drop observations from the dataset. This may affect variance estimation.

Tip 3: Stata is case-sensitive.

Stata cares about the case of the letters - so you must refer to NHANES variables using the lowercase names provided on the data files. For example, you must refer to the respondent sequence number (the key variable) as seqn with all lowercase letters, not as SEQN in uppercase letters. If you refer to variable SEQN, you will get an error message saying variable SEQN not found. To Stata, seqn and SEQN represent two different variables.

When you generate your own derived variables, you may choose to name them using uppercase characters, lowercase characters, or a mix of the two. However, you must type the variable name consistently in all of your code.

Stata commands are also case-sensitive. There is a svyset command (in lowercase letters), but there is no SVYSET command (in uppercase letters.)

Tip 4: Missing numeric values are represented by large numeric values

Stata represents missing numeric values (".") as large numeric values. So, unlike SAS Survey Procedures or SUDAAN, which would place missing values at the bottom of the range, Stata will place them at the top of the range.

For example, to test whether the fasting sample weight (wtsaf2yr) is non-missing and has a positive value, you could use of the following expressions:

wtsaf2yr < . & wtsaf2yr > 0
!missing(wtsaf2yr) & wtsaf2yr > 0

Tip 5: Be aware that Stata procedures generally do not correct for the reduction in the degrees of freedom for subgroups where not all PSUs and strata are represented.

The degrees of freedom associated with an estimated statistic is needed to perform hypothesis tests and to compute confidence intervals. For analyses on a subgroup of the NHANES population, the degrees of freedom should be based on the number of strata and PSUs containing the observations of interest. Stata procedures generally calculate the degrees of freedom based on the number of strata and PSUs represented in the overall dataset. Estimates for some subgroups of interest will have fewer degrees of freedom than are available in the overall analytic dataset. (See Module 4: Variance Estimation for more information.)

In particular, although the svy:prop command as of Stata 15 has an option citype(exact) to compute Clopper-Pearson ("exact") confidence limits for proportions, these confidence intervals are not based on the correct degrees of freedom for subgroups where not all strata and PSUs are represented. See the code example about diabetes prevalence (which replicates a portion of National Health Statistics Report 123) for code to compute the degrees of freedom for subgroups and then calculate the Korn and Grabuard confidence intervals.

Tip 1: Do not subset your data frame before defining your survey design object.

To properly calculate the standard errors of your statistics (such as means and percentages), the Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 years and over, the svymean function needs information about all examined individuals who have an exam weight, including females and those younger than 20 years.

As a rule of thumb, it is recommended that you NEVER subset your data frame prior to using the svydesign function to define a survey design object. You can then use the subset function on the survey design object to create a new survey design object that keeps the original design information about the number of PSU and strata but contains only your subpopulation of interest.

For more details on analyzing subgroups, see Module 4: Variance Estimation.

WARNING

Do not drop observations from the dataset. This may affect variance estimation.

Page last reviewed: 8/4/2020