In this module, you will use simple logistic regression to analyze NHANES data to assess the association between gender (riagendr
) — the exposure or independent variable — and the likelihood of having hypertension (based on bpxsar
, bpxdar
) — the outcome or dependent variable, among participants 20 years old and older. You will then use multiple logistic regression to assess the relationship after controlling
for selected covariates. The covariates include age (ridageyr
), cholesterol (lbxtc
), body mass index (bmxbmi
) and fasting triglycerides (lbxtr
).
IMPORTANT NOTE
There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.
Step 1: Use svyset to define survey design variables
Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:
svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)
To define the survey design variables for your cholesterol analysis, use the weight variable for fouryours of MEC data (wtmec4yr
), the PSU variable (sdmvpsu
), and strata variable (sdmvstra
) .The vce
option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset
command for fur years of MEC data:
svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)
Step 2: Create dependent dichotomous variable
For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.
For the dependent variable, you will create a dichotomous variable, hyper
, which defines people as having (or not having) hypertension. Specifically, a person is said to have hypertension if their systolic blood pressure (measured in the MEC) exceeds 140 or their diastolic blood pressure exceeds 90 or if they are taking blood pressure medication. Remember for logistic regression to work in Stata, this variable needs to be defined as 0 (meaning outcome did not occur, here person does not have hypertension) or 1 (outcome occurs, here person has hypertension). The code to create this variable is below:
gen hyper=1 if (bpxsar>=140 & bpxsar<.  bpxdar>=90 & bpxdar<.)  bpq050a==1
replace hyper=0 if hyper !=1 & (bpxsar !=. & bpxdar !=.)
Step 3: Create independent categorical variables
In addition to creating the dichotomous dependent variable, this example will also create additional independent categorical variables (age, hichol, bmigrp
) from the age, cholesterol, and BMI categorical variables to use in this analysis.
Code to generate independent categorical variables
Independent variable 
Code to generate independent categorical variables 
Age 
gen age=1 if ridageyr >=20 & ridageyr <40
replace age=2 if ridageyr >=40 & ridageyr <60
replace age=3 if ridageyr >=60 abd ridageyr <.

High cholesterol 
gen hichol =1 if lbxtc >=240 & lbxtc<.  bpq100d==1
replace hichol =0 if hichol ~=1 & lbxtc !=.

BMI category 
gen bmigrp=1 if bmxbmi<25
replace bmigrp=2 if bmxbmi>=25 & bmxbmi <30
replace bmigrp=3 if bmxbmi>=30 & bmxbmi <.

Step 4: Transform highly skewed variables
Because the triglycerides variable (lbxtr
) is highly skewed, you will use a log transformation to create new variable to use in this analysis.
gen logtrig = log(lbxtr)
Step 5: Choose reference groups for categorical variables
For all categorical variables, you need to decide which category to use as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default. You can use the following general command to tell Stata the reference group:
char var [omit] reference_group_value
For your analyses, use the following commands to specify the following reference groups:
Code to specify reference groups
Variable 
Code to specify reference group 
Reference group 
Gender 
char riagendr [omit] 2

Women 
Age 
char age [omit] 2

4059 year olds 
BMI 
char bmigrp [omit] 2

overweight (bmi2529) 
Cholesterol 
char hichol [omit] 1

low cholesterol (<240mg/dL) 
Step 6: Create eligibility variable
Because not every participant in NHANES responded to every question asked, there may be a different level of item nonresponse to each variable. To ensure that your analyses are done on the same number of respondents, create a variable called eligible
which is 1 for individuals who have a nonblank value for each of the variables used in the analyses, and 0 otherwise. Although this is a univariate analysis using only exam variables, the fasting subsample weight (wtsaf4yr
) is included in determining the eligible
variable. This is because you will be conducting a multivariate analysis using the triglycerides variable later and will limit the sample to persons included in both analyses. The Stata code defining eligible
is:
gen eligible=1 if wtsaf4yr!=. & hyper!=. & riagendr!=. &age!=. & hichol!=. & bmigrp!=. & logtrig~=. &wtsafyr!=0
Step 7: Create simple logistic regression model to understand relationships
The association between the dependent (or outcome) and independent (or exposure) variables is expressed using the svy:logit
command. The dependent variable must be a dichotomous variable and the independent variables may be either discrete, ordinal, or continuous.
The general form of the command to get beta coefficients is:
xi: svy, subpop(condition): logit depvar i.indvar
To get odds ratios with the logit
command, use the or
option:
xi: svy, subpop(condition): logit depvar i.indvar, or
Odds ratios are automatically produced by the logistic
command:
xi: svy, subpop(condition): logistic depvar i.indvar
An example command analyzing the relationship between gender and hypertension using the logistic
commend is shown below:
xi: svy, subpop(if eligible==1): logistic hyper i.riagendr
In this example, the output for the logistic command is:
Highlights in the output include:
 Men are less likely to have hypertension than women. Their odds of hypertension are 0.89 times the odds of women.
 Assuming a pvalue less than 0.05 indicates statistical significance, note that gender is not significantly associated with hypertension based on the pvalue. The pvalue of 0.156 indicates that this relationship is not statistically significant.
Step 8: Specify multiple logistic regression model
Multiple logistic regression uses the same command structure but now includes other independent variables. If you want to create indicator variables for categorical variables, you will want to use the xi
option. However, the general structure remains the same:
xi: svy, subpop(condition): logistic depvar indvar i.indvar
For this example, you will be using these commands to analyze the effects of gender, age, high cholesterol, BMI, and triglycerides on hypertension. Please note that the svyset
commands is using the subsample weight, wtsat4yr
, because this analysis includes the triglycerides variable that was only collected on a subsample of the survey.
svyset [w=wtsaf4yr], psu(sdmvpsu) strata(sdmvstra)
xi: svy, subpop(if eligible==1): logistic hyper i.riagendr i.age i.hichol i.bmigrp logtrig
In this example, this output is:
Highlights from the output include:
 All covariates are statistically significant at pvalue<0.05, except for gender.
 Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The adjusted odds of hypertension are 1.29 (95% C.I. 1.031.61) for each unit increase in the log of triglycerides.
Step 9: Compare results of simple and multiple linear regressions
To understand how much adjustment matters, it is helpful to compare the odds ratio from the simple and multiple regression models. The following tables summarize the results.
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression)  Sex
Sex 
Crude Analysis % with hypertension 
Crude Analysis Odds Ratio* (95% CI) 
Adjusted Analysis Odds Ratio* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
men 
27% 
0.89 (0.75  1.05) 
0.94 (0.76  1.16) 
0.16 
0.55 
women 
30% 
Reference Group 
Reference Group 
Reference Group 
Reference Group 
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression)  Age
Age (years) 
Crude Analysis % with hypertension 
Crude Analysis Odds Ratio* (95% CI) 
Adjusted Analysis Odds Ratio* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
2039 
9% 
0.25 (0.18.034) 
0.28 (0.21  .038) 
<0.001 
<0.001 
4059 
28% 
Reference Group 
Reference Group 
Reference Group 
Reference Group 
60+ 
66% 
4.87 (3.76  6.3) 
5.27 (4.00  6.94) 
<0.001 
<0.001 
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression)  BMI
BMI 
Crude Analysis % with hypertension 
Crude Analysis Odds Ratio* (95% CI) 
Adjusted Analysis Odds Ratio* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
underweight/ normal 
18% 
0.58 (0.46  0.72) 
0.67 (0.51 0.87) 
<0.001 
0.004 
overweight 
28% 
Reference Group 
Reference Group 
Reference Group 
Reference Group 
obese 
42% 
1.85 (1.52  2.25) 
2.18 (1.70  2.80) 
<0.001 
<0.001 
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression)  Cholesterol
Cholesterol 
Crude Analysis % with hypertension 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis >Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
High 
43% 
Reference Group 
Reference Group 
Reference Group 
Reference Group 
Low/Normal 
24% 
0.41 (0.34  0.49) 
0.78 (0.62  0.97) 
<0.001 
0.028 
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression)  Triglycerides
Triglycerides 
Crude Analysis % with hypertension 
Crude Analysis Coefficient* (95% CI) 
Adjusted Analysis Coefficient* (95% CI) 
Crude Analysis p value 
Adjusted Analysis p value 
Triglycerides 
N/A 
1.98 (1.65  2.37) 
1.28 (1.03  1.61) 
<0.001 
0.029 
Stata Multivariate Logistic Procedure
Statements 
Explanation 
use "C:\Stata\tutorial\analysis_data.dta", clear

Use the use command to load the Stataformat dataset. Use the clear option to replace any data in memory.

svyset sdmvpsu [pweight=wtsaf4yr], strata(sdmvstra) vce(linearized)

Use the svyset command to declare the survey design for the dataset. Specify the psu variable sdmvpsu . Use the [pweight=] option to account for the unequal probability of sampling and nonresponse. In this example, the MEC fasting weight for four years of data (wtsaf4yr ) is used. Use the strata () option to specify the stratum identifier (sdmvstra ). Use the vce( ) option to specific the variance estimation method (linearized ) for Taylor linearization.

char age[omit] 2
char riagendr[omit]2
char bmigrp[omit] 2
char hichol[omit]1

Use these options to choose your reference group for the categorical variables. For example, the 2nd age category (age 4059) is chosen as the reference group.
If you do not specify the reference group options, Stata will choose the lowest numbered group by default.

xi: svy, subpop(if ridageyr >=20) vce(linearized): logit hyper i.age i.riagendr i.hichol i.bmigrp logtrig

Use the xi command to expand terms containing categorical variables (denoted i.varname ) into indicator (also called dummy) variable sets. Use the svy: logit command to perform multiple logistic regressions to assess the association between hypertension and multiple risk factors, including: age, gender, high cholesterol, body mass index, and fasting triglycerides. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. (Note: omission of the or option as shown below will yield estimates as coefficients.)

xi: svy, subpop(if ridageyr >=20) vce(linearized): logit hyper i.age i.riagendr i.hichol i.bmigrp logtrig, or

Use the xi command to expand terms containing categorical variables into indicator (also called dummy) variable sets. Use the svy: logit command to perform multiple logistic regressions to assess the association between hypertension and multiple risk factors, including: age, gender, high cholesterol, body mass index, and fasting triglycerides.
Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the or option to produce estimates as odds ratios.

test
*******************************
test _Iage_1 _Iage_3, nosvyadjust
test _Iriagendr_1, nosvyadjust
test _Ihichol_0, nosvyadjust
test _Ibmigrp_1 _Ibmigrp_3, nosvyadjust
test logtrig, nosvyadjust

Use the test postestimation command to produce the Wald F statistic and the corresponding pvalue. Use the nosvyadjust option to produce the unadjusted Wald F. In the example, the command test is used to test all coefficient together and then all coefficients separately.

WARNING
The Stata command, svy:logit
, produces the adjusted and unadjusted Wald statistic and its p value. It does not produce the Satterthwaite χ^{2} or the Satterthwaite F and the corresponding p values recommended for NHANES analyses.
Step 10: Postestimation
You may want to know whether different comparisons (other than the reference categories you specified) are significant. In that case, you can use a postestimation command (i.e. a command that can only be run after you have run the logit
model command). This takes the general form, if you do not want the unadjusted Wald F:
test vargroup, nosvyadjust
This example will be using this command to test that the youngest age group has a statistically significant different likelihood of having hypertension than the oldest age group:
test _Iage_1 = _Iage_3, nosvyadjust
The results for this example are: