In this example, you will assess the association between high density lipoprotein (HDL) cholesterol — the outcome variable — and body mass index (bmxbmi
) — the exposure variable — after controlling for selected covariates in NHANES 19992002. These covariates include gender (riagendr
), race/ethnicity (ridreth1
), age (ridageyr
), smoking (smoke
r, derived from SMQ020
and SMQ040; smoker
=1 if nonsmoker, 2 if past smoker and 3 if current smoker) and education (dmdeduc
).
IMPORTANT NOTE
There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.
Step 1: Use svyset
to define survey design variables
Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:
svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)
To define the survey design variables for your high density lipoprotein cholesterol analysis, use the weight variable for fouryears of MEC data (wtmec4yr
), the PSU variable (sdmvpsu
), and strata variable (sdmvstra
) .The vce
option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset
command for four years of MEC data:
svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)
Step 2: Determine how to specify variables in the model
For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.
It is important to exam the data both ways, since the assumption that a dependent variable has a continuous relationship with the outcome may not be true. Looking at the categorical version of the variable will help you to know whether this assumption is true.
In this example, you could look at BMI as a continuous variable or convert it into a categorical variable based on standard BMI definitions of underweight, normal weight, overweight and obese. Here is how categorical BMI variables are created:
Table of code to generate categorical BMI variable
Code to generate categorical BMI variables

BMI Category 
gen bmicat=1 if bmxbmi>=0 & bmxbmi<18.5

underweight 
replace bmicat=2 if bmxbmi>=18.5 & bmxbmi<25

normal weight 
replace bmicat=3 if bmxbmi>=25

overweight 
replace bmicat=4 if bmxbmi>=30 & bmxbmi<.

obese 
Step 3: Determine the reference group for categorical variables
For all categorical variables, you need to decide which category to use as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default.
Use the following general command to specify the reference group:
char var[omit]reference group value
For these analyses, use the following commands to specify the following reference groups.
Stata command 
Reference group 
char ridreth1[omit]3

NonHispanic White 
char smoker[omit]3

Current Smokers 
char educ[omit]3

Greater than high school education 
char bmicat[omit]2

Normal weight 
Step 4: Create simple linear regression models to understand relationships
Before you perform a regression on the data, the data needs to meet a requirement — the dependent variable must be a continuous variable and the independent variables may be either discrete, ordinal, or continuous. The association between the dependent (or outcome) and independent (or exposure) variables is expressed using the svy:regress
command. The general form of the command is:
svy:regress depvar indvar
Here is the command (and output) for the BMIHDL example. This example uses the subpop (if eligible==1
) statement to restrict the analysis to individuals with complete data for all the variables used in the final multiple regression model. The eligible
variable is defined in the program available on the Sample Code and Datasets page.
svy, subpop(if eligible==1): regress lbdhdl bmxbmi
And, here is the output of the statement.
This analysis says that for each 1 unit increase of BMI, on average, HDL decreases by 0.69 mg/dl. Or, you could do the simple regression using the BMI categories:
To perform the same analysis using the categorical BMI variable, bmicat
, the statement would be:
svy, subpop(if eligible==1): regress lbdhdl bmicat
And, the output of that statement would be:
This model says that, on average, HDL levels decrease by 5.6 mg/dl between the underweight BMI category and the normal weight BMI category, or the normal weight BMI category to the overweight BMI category.
Using the interaction expansion function (xi
) to expand categorical variables into indicator variable sets
Delving deeper into this relationship, you will look at each comparison separately to see whether this continuous relationship really holds. Stata has a function (called xi
or interaction expansion
) which creates the "indicator variables" to allow you to see these relationships. The xi
function will expand terms containing categorical variables (denoted i.varname
) into indicator (also called dummy) variable sets. It has this general form:
xi:svy:regress depvar i.indvar
For this example, you will use the HDL variable as the dependent variable and the BMI categorical variable (bmicat
) as the independent variable, denoted with the i.
prefix. This example uses the subpop(if ridageyr >= 20 & ridageyr =.
) statement to select participants who were age 20 years and older and did not have a missing value for the age variable.
xi:svy, subpop(if ridageyr >=20 & ridageyr =.): regress lbdhdl i.bmicat
Here are the results for this analysis which use "normal weight"  bmicat2
as the reference category:
This analysis using the BMI categorical variable (BMICAT
) shows that the relationship is not linear and the difference in HDL is between underweight and normal is 3.2 compared to a 7.5 difference between normal weight and overweight.
IMPORTANT NOTE
You can also just use the xi
option to generate the indicator variables or interaction terms (rather than using it with the model
command). The advantage of creating the indicators prior to the model, is that you do not need to write a command to set the reference category (you will do this implicitly by selecting the indicators you include in the model) and the output is easier to read (the xi model
command repeats the individual components in interaction terms).
It is also possible to generate indicator variables using the tab, generate
command:
tab var, gen(newvar
); for example: tab bmicat, gen(ibmicat)
This command generates four variables: ibmicat1
, ibmicat2
, ibmicat3
, and ibmicat4.
Step 5: Specify multiple linear regression models
Multiple linear regression uses the same command structure but now includes other independent variables. And if you want to create indicator variables for categorical variables, you will want to use the xi
option. So, the general structure looks the same:
xi: svy: regress depvar indvar i.var
This example will use the HDL variable (lbdhdl
) as the dependent variable. The independent categorical variables (riagendr
, ridreth1
, smoker
, educ
, and bmicat
) are specified with the i.
prefix, while ridageyr
remains an independent continuous variable. Again, it uses the subpop(if ridageyr >= 20 & ridageyr =.
) statement to select participants who were age 20 years and older and did not have a missing value for the age variable.
xi: svy, subpop(if ridageyr >=20 & ridageyr <.): regress lbdhdl i.riagendr i.ridreth1 ridageyr i.smoker i.educ i.bmicat
In this example, the output is:
Later in this module, the results of this multiple regression will be presented in a summary table comparing it with the univariate regression.
Step 6: Calculate means "adjusted" for the covariates in the model
Sometimes you may want to calculate means which are adjusted for the covariates specified in the model to allow you to see the effect of a given predictor variable. Stata has a built in command, adjust
, to do this. Adjust
is a postestimation command
The adjust
command uses only the sample mean, not the mean based on the survey design, when performing its computations. Therefore, if you want to use the survey mean, you would need to calculate it first and specify it explicitly in the adjust
command. The following commands use summarize
which is an rclass command and will not cause any trouble if run between the svy: regress
and adjust
commands; whereas svy:mean
is an eclass
command and cannot be used in between these commands. Here is the general form of the command:
sum _cat_1 [aw=weight] if conditions & e(sample)
local cat1 = r(mean)
...
adjust indvar1=cat1 indvar2=cat2.... if e(sample), by(indvar3)
The variables have to appear just as they are in the regression model. Independent categorical variables are specified with the _I
prefix, while the independent continuous variable, ridageyr
, doesn't not require the prefix. The following command, will generate mean HDL levels for BMI categories (by (bmicat
)), adjusting for every other variable in the model.
sum _Ibmicat_1 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)
local bmicat1 = r(mean)
sum _Ibmicat_3 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)
local bmicat3 = r(mean)
sum _Ibmicat_4 [aw=wtmec4yr] if ridageyr >=20 & ridageyr <. & e(sample)
local bmicat4 = r(mean)
adjust _Iriagendr_2='riagendr2' _Iridreth1_1='rid1' _Iridreth1_2='rid2' ///
_Iridreth1_4='rid4' _Iridreth1_5='rid5' ridageyr='ridage' ///
_Ieduc_1='educ1' _Ieduc_2='educ2' _Ismoker_1='smoke1' ///
_Ismoker_2='smoke2' if ridageyr >=20 & ridageyr <. & e(sample), by(bmicat) se
The output for this example is:
Later in this module, the results of this adjusted means calculation will be presented in a summary table comparing it with the crude mean.
Step 7: Compare results of crude analysis (simple linear regression) and adjusted analysis (multiple linear regression)
To understand how much adjustment matters, it is helpful to compare the regression coefficient from the simple and multiple regression models. To help you review the results, the following summary tables present the crude analysis (simple linear regression) and adjusted analysis (multiple linear regression).
Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  BMI
BMI 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
underweight 
60.44 
59.40 
3.26 (.37 — 6.15) 
2.30 (.44 — 5.04) 
.028 
.097 
normal 
57.18 
57.10 
Reference Group 
Reference Group 
Reference Group 
Reference Group 
overweight 
49.69 
50.55 
7.48 (8.50 — 6.46) 
6.55 (7.38 — 5.71) 
<.001 
<.001 
obese 
45.94 
45.10 
11.24 (12.17 — 10.31) 
12.00 (12.79 — 11.22) 
<.001 
<.001 
Here are the summary tables of the results for the other covariates in the model.
Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  Smoking
Smoking 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
current 
49.35 
51.27 
Reference Group 
Reference Group 


past 
51.64 
52.38 
2.76 (1.29 — 4.23) 
2.33 (1.00 — 3.66) 
.001 
.001 
never 
52 
50.05 
2.32 (.85 — 3.79) 
1.22 (.22 — 2.66) 
.003 
.095 
Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  Sex
Sex 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
men 
45.91 
46.08 
Ref erence Group 
Ref erence Group 


women 
56.21 
56.06 
10.30 (9.54 — 11.06) 
9.98 (9.3 — 10.64) 
<.001 
<.001 
Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  Race/Ethnicity
Race/Ethnicity 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
NonHispanic White 
51.38 
50.88 
Reference Group 
Reference Group 


NonHispanic Black 
54.5 
55.83 
3.12 (1.54 — 4.70) 
4.95 (3.61 — 6.29) 
<.001 
<.001 
Other Hispanic 
47.71 
48.64 
3.67 (5.47 — 1.88) 
2.24 (3.59 — .89) 
<.001 
.002 
MexicanAmerican 
48.92 
51.55 
2.46 (3.59 — 1.33) 
.67 (.46 — 1.80) 
<.001 
.235 
Other 
50.91 
50.32 
.47 (3.28 — 2.33) 
.56 (2.83 — 1.71) 
.733 
.619 
Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  Education
Education 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
< high school 
49.37 
49.34 
3.10 (4.41 — 1.79) 
3.03 (4.16 — 1.90) 
<.001 
<.001> 
high school 
50.30 
50.52 
2.18 (3.23 — 1.12) 
1.85 (2.98 — .73) 
<.001 
.002 
> high school 
52.47 
52.37 
Reference Group 
Reference Group 


Table Comparing Differences between Crude Analysis (Simple Linear Regression) and Adjusted Analysis (Multiple Linear Regression)  Age
Age 
Crude Analysis Mean HDL 
Adjusted Analysis Mean HDL 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
Age (years) 
 
 
.11 (.07 — .12) 
 
<.001 
<.001 
Step 8: Perform postestimation test
Use the test
post estimation command to produce the Wald F statistic and the corresponding pvalue. Use the nosvyadjust
option to produce the unadjusted Wald F. In the example, the command test
is used to test all coefficients together; all coefficients separately; and to test the hypothesis that HDL cholesterol for nonsmokers is the same as that for past smokers.
The general form of this statement is below.
test indvar 1 ind var 2 ..., [nosvyadjust]
This example tests all of the coefficients.
test
Here are the results of that statement:
The results of this test will be discussed in the next step.
This example tests the gender coefficient and, using the nosvyadjust
option, produces the unadjusted Wald F. Tests for the additional variables are included in the program available on the Sample Downloads and Datasets page.
test _Iriagendr_2, nosvyadjust
Here are the results of that statement:
The results of this test will be discussed in the next step.
This example tests the hypothesis that HDL cholesterol for nonsmokers is the same as that for past smokers.
test _Ismoker_1  _Ismoker_2 = 0
Here are the results of the statement:
The results of this test will be discussed in the next step.
Step 9: Review output
In this step, the Stata output is reviewed.
 HDL cholesterol is 6.55 mg/dL higher for overweight adults compared to normal weight adults, as defined by BMI.
 HDL cholesterol is 12.00 mg/dL higher for obese adults compared to normal weight adults, as defined by BMI.
 HDL cholesterol is 2.30 mg/dL lower for underweight adults compared to normal weight adults, as defined by BMI.
 HDL cholesterol is 9.98 mg/dL higher for women than for men, after adjusting for all other variables in the model.
 The F test for gender shows a significant effect (p < 0.001) of gender for HDL cholesterol when controlling for other covariates in the model.
 HDL cholesterol is 4.95 mg/dL higher for nonHispanic Blacks compared to nonHispanic Whites, after adjusting for all other variables in the model.
 HDL cholesterol increases 0.11 mg/dL per unit increase in age.
Stata svy:regres
Command for Multiple Linear Regression
Statements 
Explanation 
use "C:\Stata\tutorial\analysis_data.dta", clear

Use the use command to load the Stataformat dataset.
Use the clear option to replace any data in memory.

svyset sdmvpsu [pweight=wtmec4yr], strata(sdmvstra) vce(linearized)

Use the svyset command to declare the survey design for the dataset. Specifiy the psu variable sdmvpsu .
Use the [pweight=] option to account for the unequal probability of sampling and nonresponse. In this example, the MEC weight for four years of data (wtmec4yr ) is used. Use the strata ( ) option to specify the stratum identifier (sdmvstra ). Use the vce( ) option to specific the variance estimation method (linearized ) for Taylor linearization.

char ridreth1[omit]3
char smoker[omit]3
char educ[omit]3
char bmicat[omit]2

Use these options to choose your reference group for the categorical variables. For example, the 3rd race/ethnicity (ridreth1 ) category (nonHispanic White) is chosen as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default.

xi: svy, subpop(if ridageyr >=20) vce(linearized): regress lbdhdl i.riagendr i.ridreth1 ridageyr bmxbmi i.smoker i.educ i.bmicat

Use the xi command to expand terms containing categorical variables (denoted i.varname ) into indicator (also called dummy) variable sets. Use the svy: regress command to perform multiple linear regression to specify the dependent variable HDL cholesterol (lbdhdl ) and independent variables, including: gender, race, age, body mass index, smoking, education and BMI category.

test
*******************************
test _Iriagendr_2, nosvyadjust
test _Iridreth1_1 _Iridreth1_2 _Iridreth1_4 _Iridreth1_5, nosvyadjust
test ridageyr, nosvyadjust
test _Ibmicat_1 _Ibmicat_3 _Ibmicat_4, nosvyadjust
test _Ismoker_1 _Ismoker_2, nosvyadjust
test _Ieduc_1 _Ieduc_2, nosvyadjust
*******************************
test _Ismoker_1  _Ismoker_2 =0

Use the test postestimation command to produce the Wald F statistic and the corresponding pvalue. Use the nosvyadjust option to produce the unadjusted Wald F. In the example, the command test is used to test all coefficient together; all coefficients separately; and to test the hypothesis that HDL cholesterol for nonsmokers is the same as that for past smokers.

Special topic: Interactions
If you want to look for interactions, use the xi
option to create interaction terms. The general form for interaction terms is:
i.var1*i.var2
See the sample code in Sample Datasets and Code for a model with interaction term included.