Tip 1: There are two series of commands.
There are two series of commands you can use analyze NHANES in Stata.
SVY Commands
SVY commands are a series of commands specifically designed to analyze complex survey designs like NHANES. To calculate the means and standard errors, you would use Stata survey (svy
) commands because they account for the complex survey design of NHANES data when determining variance estimates. These commands can be used for simple random samples also.
Whenever you want to use SVY commands, you need to set up Stata by defining the survey design variables using the svyset
command. This command has the general structure:
svyset [w= weight], psu(psu variable) strata(strata variable)
Here is the command using the 4-year weight for data collected in the MEC and the output:
svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra)
(sampling weights assumed)
pweight: wtmec4yr
VCE: linearized
Single unit: missing
Strata 1: sdmvstra
SU 1: sdmvpsu
FPC 1: <zero>
Once you do this, Stata remembers these variables and applies them to every subsequent SVY
command. If you save the dataset, Stata will remember these variables and apply them automatically when you reopen the data set.
You can change these variables any time you want by typing a new SVYSET
command.
Standard commands
Standard commands are regular Stata commands that can incorporate sampling weights. For example, if standard errors are not needed, you can simply use regular Stata commands with the weight variable (i.e., mean
with the weight
variable) to calculate means.
You only need to use these commands when there is no corresponding SVY
command. When you use these commands, keep in mind that:
- Not all standard commands will take weights.
- With weights, these analyses will generate accurate point estimates.
- Because standard commands do not use the design variables (i.e. strata, psu), they will NOT generate accurate standard errors.
Tip 2: Do not drop observations from the Stata dataset.
To properly calculate the standard errors of your statistics (such as means and percentages), the Taylor series linearization method requires information on ALL records with a non-zero value for your weight variable, including those survey participants who are not in your population of interest. For example, to estimate mean body mass index (BMI) and its standard error for men aged 20 and over, the svy:mean
command needs to access the entire dataset of examined individuals who have an exam weight, including females and those younger than 20 years.
As a rule of thumb, it is recommended that you NEVER drop records from your Stata dataset. Instead, you should use the subpop
option available on the svy
commands to specify your subpopulations of interest.
For more details on analyzing subgroups, see Module 4: Variance Estimation.
WARNING
Do not drop observations from the dataset. This may affect variance estimation.
Tip 3: Stata is case-sensitive.
Stata cares about the case of the letters - so you must refer to NHANES variables using the lowercase names provided on the data files. For example, you must refer to the respondent sequence number (the key variable) as seqn
with all lowercase letters, not as SEQN
in uppercase letters. If you refer to variable SEQN
, you will get an error message saying variable SEQN not found
. To Stata, seqn
and SEQN
represent two different variables.
When you generate your own derived variables, you may choose to name them using uppercase characters, lowercase characters, or a mix of the two. However, you must type the variable name consistently in all of your code.
Stata commands are also case-sensitive. There is a svyset
command (in lowercase letters), but there is no SVYSET
command (in uppercase letters.)
Tip 4: Missing numeric values are represented by large numeric values
Stata represents missing numeric values (".") as large numeric values. So, unlike SAS Survey Procedures or SUDAAN, which would place missing values at the bottom of the range, Stata will place them at the top of the range.
For example, to test whether the fasting sample weight (wtsaf2yr
) is non-missing and has a positive value, you could use of the following expressions:
wtsaf2yr < . & wtsaf2yr > 0
!missing(wtsaf2yr) & wtsaf2yr > 0
Tip 5: Be aware that Stata procedures generally do not correct for the reduction in the degrees of freedom for subgroups where not all PSUs and strata are represented.
The degrees of freedom associated with an estimated statistic is needed to perform hypothesis tests and to compute confidence intervals. For analyses on a subgroup of the NHANES population, the degrees of freedom should be based on the number of strata and PSUs containing the observations of interest. Stata procedures generally calculate the degrees of freedom based on the number of strata and PSUs represented in the overall dataset. Estimates for some subgroups of interest will have fewer degrees of freedom than are available in the overall analytic dataset. (See Module 4: Variance Estimation for more information.)
In particular, although the svy:prop
command as of Stata 15 has an option citype(exact)
to compute Clopper-Pearson ("exact") confidence limits for proportions, these confidence intervals are not based on the correct degrees of freedom for subgroups where not all strata and PSUs are represented. See the code example about diabetes prevalence (which replicates a portion of National Health Statistics Report 123) for code to compute the degrees of freedom for subgroups and then calculate the Korn and Grabuard confidence intervals.