Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 9: Logistic Regression

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

Logistic Regression is a statistical method used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). There are two kinds of logistic regression, simple and multiple. Both simple and multiple logistic regression, assess the association between independent variable(s) (Xi) — sometimes called exposure or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the outcome or response variable.

Logistic Regression

Logistic Regression is used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). Both simple and multiple logistic regression, assess the association between independent variable(s) (Xi) — sometimes called exposure or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the outcome or response variable. Logistic regression analysis tells you how much an increment in a given exposure variable affects the odds of the outcome.

Simple logistic regression is used to explore associations between one (dichotomous) outcome and one (continuous, ordinal, or categorical) exposure variable. Simple logistic regression lets you answer questions like, "how does gender affect the probability of having hypertension?

Multiple logistic regression is used to explore associations between one (dichotomous) outcome variable and two or more exposure variables (which may be continuous, ordinal or categorical). The purpose of multiple logistic regression is to let you isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other variables (called covariates or confounders). Multiple logistic regression lets you answer the question, "how does gender affect the probability of having hypertension, after accounting for — or unconfounded by — or independent of — age, income, etc.?" This process — accounting for covariates or confounders — is also called adjustment.

Comparing the results of simple and multiple logistic regression can help to answer the question "how much did the covariates in the model alter the relationship between exposure and outcome (i.e., how much confounding was there)?"

Research Question

In this module, you will assess the association between gender (the exposure variable) and the likelihood of having hypertension (the outcome). You will look at both simple logistic regression and then multiple logistic regression. The multiple logistic regression will include the covariates of age, cholesterol, body mass index (BMI) and fasting triglycerides. This analysis will answer the question, what is the effect of gender on the likelihood of having hypertension — after controlling for age, cholesterol, BMI, and fasting triglycerides?

Dependent Variable and Independent Variables

As noted, the dependent variable Yi for a Logistic Regression is dichotomous, which means that it can take on one of two possible values. NHANES includes many questions where people must answer either "yes" or "no", questions like "has the doctor ever told you that you have congestive heart failure?". Or, you can create dichotomous variables by setting a threshold (e.g., "diabetes" = fasting blood sugar > 126); or by combining information from several variables. In this module, you will create a dichotomous variable called "hyper" based on two variables: measured blood pressure and use of blood pressure medications. In SUDAAN, SAS Survey, and Stata, the dependent variable is coded as 1 (for having the outcome) and 0 (for not having the outcome). In this example, for people who have been told they have hypertension or reported use of blood pressure medication, the hypertension variable would have a value of 1, while people who were never told of hypertension or not taking blood pressure medication would have a value of 0.

The independent variables Xj can be dichotomous (e.g. gender ,"high cholesterol"), ordinal (e.g. age groups, BMI categories), or continuous (e.g. fasting triglycerides).

Logit Function

Since you are trying to find associations between risk factors and a condition, you need a formula that will allow you to link these variables. The logit function that you use in logistic regression is also known as the link function because it connects, or links, the values of the independent variables to the probability of occurrence of the event defined by the dependent variable.

Logit Model

Logit model formula

In the logit formula above, E(Yi)=pi implies that the Expected Value of (Yi) equals the probability that Yi=1. In this case, ‘Log' indicates natural Log.

Optional: Learn more about odds ratios, linear and logistic regression

!! NO LINK !! Click here to read the optional material.

Output of Logistic Regression

The statistics of primary interest in logistic regression are the b coefficients ( b1,b2,b3... ), their standard errors, and their p-values. Like other statistics, the standard errors are used to calculate confidence intervals around the beta coefficients.

The interpretation of the beta coefficients for different types of independent variables is as follows:

If Xj is a dichotomous variable with values of 1 or 0, then the b coefficient represents the log odds that an individual will have the event for a person with X j=1 versus a person with Xj=0. In a multivariate model, this b coefficient is the independent effect of variable X j on Yi after adjusting for all other covariates in the model.

If Xj is a continuous variable, then the e b represents the odds that an individual will have the event for a person with Xj=m+1 versus an individual with Xj=m. In other words, for every one unit increase in Xj, the odds of having the event Y i changes by e b , adjusting for all other covariates in a multivariate model.

A summary table about interpretation of beta coefficients is provided below:

Table: What does the b Coefficient Mean?
Independent Variable Type Example Variables The b coefficient in simple logistic regression The b coefficient in multiple logistic regression
Continuous

height, weight, LDL

The change in the log odds of the dependent variable per 1unit change in the independent variable.

The change in the log odds of dependent variable per 1 unit change in the independent variable after controlling for the confounding effects of the covariates in the model.

Categorical (also known as discrete) sex (two subgroups - men and women. This example will use men as the reference group.) The difference in the log odds of the dependent variable for one value of categorical variable vs. the reference group (for example, between women, and the reference group, men). The difference in the log odds of the dependent variable for one value of categorical variable vs. the reference group (for example, between women and the reference group, men), after controlling for the confounding effects of the covariates in the model.

It is easy to transform the b coefficients into a more interpretable format, the odds ratio, as follows:

e b = odds ratio

IMPORTANT NOTE

Odds and odds ratios are not the same as risk and relative risks.

Odds and probability are two different ways to express the likelihood of an outcome.

Here are their definitions and some examples.

Table of Differences between Odds and Probability
Definition Example: Getting heads in a 1 flip of a coins Example: Getting a 1 in a single roll of a dice
Odds # of times something happens
# of times it does NOT happen
= 1/1 = 1 (or 1:1) = 1/5 = 0.2 (or 1:5)
Probability # of times something happens
# of times it could happen
= 1/2 = .5 (or 50%) = 1/6 = .16 (or 16%)

Few people think in terms of odds. Many people equate odds with probability and thus equate odds ratios with risk ratios. When the outcome of interest is uncommon (i.e. it occurs less than 10% of the time), such confusion makes little difference, since odds ratios and risk ratios are approximately equal. When the outcome is more common, however, the odds ratio increasingly overstates the risk ratio. So, to avoid confusion, when event rates are high, odds ratios should be converted to risk ratios. (Schwartz LM, Woloshin S, Welch HG. Misunderstandings about the effects of race and sex on physicians’ referrals for cardiac catheterization. N Engl J Med 1999;341:279—83) There are simple methods of conversion for both crude and adjusted data. (Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998;280:1690-1691. Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998;316:989-991)

The following formulas demonstrate how you can go between probability and odds.

Probability or Odds

Probability or Odds

References:

Logistic Regression Using the SAS System
By Paul D. Allison

Epidemiology
By Leon Gordis

Setting Up a Logistic Regression in NHANES

Simple logistic regression is used for univariate analyses when there is one dependent variable and one independent variable, while multiple logistic regression model contains one dependent variable and multiple independent variables. To run univariate and mulitple Logistic Regression in SAS-callable SUDAAN, SAS, and Stata, you will need to provide three things:

  • Correct weight
  • Appropriate procedure, and
  • model statement

IMPORTANT NOTE

Simple logistic regression is used for univariate analyses when there is one dependent variable and one independent variable, while multiple logistic regression model contains one dependent variable and multiple independent variables.

Determine the appropriate weight for the data used

It is always important to check all the variables in the model, and use the weight of the smallest common denominator. In the example of univariate analysis, the 4-year MEC weight is used, because the hypertension variable is from the MEC examination. In the multivariate analysis example, the 4-year MEC morning subsample weight is used, because the fasting triglycerides variable is from the morning fasting subsample from the lab component, which is the smallest common denominator for all variables in the model.

Examples

Simple logistic regressions for gender, age, cholesterol, and BMI:

Because these analyses use 4 years of data and includes variables that come from the household interview and the MEC (e.g. blood pressure, BMI, HDL cholesterol), the MEC 4-year weight - wtmec4yr is the right one.

Simple logistic regression for fasting triglyceride:

Because this analysis uses 4 years of data and fasting triglycerides were only done on the morning subsample, the MEC morning fasting subsample 4-year weight - wtsaf4yr is the right one.

Multiple logistic regression:

Because this analysis uses 4 years of data and includes variables from the household interview, MEC and morning subsample of the MEC, the weight for the smallest group - the morning fasting subsample 4 -year weight - wtsaf4yr is the right one.

  • See the Weighting module for more information on weighting and combining weights.

Determine the appropriate procedure

You can run logistic regression with stand-alone SUDAAN, SAS-callable SUDAAN, SAS Survey procedure, or Stata Survey commands. However, note that each version of SUDAAN, SAS-callable SUDAAN, and SAS Survey procedures has its own unique commands for executing logistic regression analysis. You need to use the correct command for the software that you are using. Please also note that different versions of SAS and SUDAAN use slightly different statements to specify categorical variables and reference groups. Make sure that you are using the correct commands for the version of software on your computer.

If you use

  • the stand-alone version of SUDAAN, the procedure is logistic
  • SAS-callable SUDAAN, the procedure is called rlogist
  • SAS survey procedures, the procedure is surveylogistic

Be sure you are using the correct procedure name because SAS also has a procedure logistic , which is used with simple random samples and not complex datasets like NHANES. Using logistic in SAS will yield different results from stand-alone SUDAAN.

Provide a model statement

Remember that when you run logistic regression analyses, you must provide a model statement to specify the dependent variable and independent variable(s), and you can have only one model statement each time you run a logistic regression analysis.

Task 2a: How to Use SUDAAN Code to Perform Logistic Regression

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between gender (riagendr) — the exposure or independent variable — and the likelihood of having hypertension (based on bpxsar, bpxdar) — the outcome or dependent variable, among participants 20 years old and older. You will then use multiple logistic regression to assess the relationship after controlling for selected covariates. The covariates include gender (riagendr), age (ridageyr), cholesterol (lbxtc), body mass index (bmxbmi) and fasting triglycerides (lbxtr).

Step 1: Create dependent dichotomous variable

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.

For the dependent variable, you will create a dichotomous variable, hyper, which defines people as having (or not having) hypertension. Specifically, a person is said to have hypertension if their systolic blood pressure (measured in the MEC) exceeds 140 or their diastolic blood pressure exceeds 90 or if they are taking blood pressure medication. Remember for logistic regression to work in SUDAAN, this variable needs to be defined as 0 (meaning outcome did not occur, here person does not have hypertension) or 1 (outcome occurs, here person has hypertension). The code to create this variable is below:

if (bpxsar >= 140 or bpxdar >= 90 or bpq050a = 1 ) then Hyper = 1 ;
else if (bpxsar ne . and bpxdar ne .) then Hyper = 0 ;

Step 2: Create independent categorical variables

In addition to creating the dependent dichotomous variable, this example will also create additional independent categorical variables (age, hichol, bmigrp) from the age, cholesterol, and BMI categorical variables to use in this analysis.

Code to generate independent categorical variables
Independent variable Code to generate independent categorical variables
Age
if 20 <=ridageyr< 40 then 1 ;
else if 40 <=ridageyr< 60 then 2 ;
else if 60 then 3 ;
High cholesterol
if (lbxtc>= 240 or bpq100d = 1) then HiChol = 1 ;
else if (lbxtc ne .) then HiChol = 0 ;
BMI category
if 0 <=bmxbmi< 25 then 1 ; 
else if 25 <=bmxbmi< 30 then 2 ; 
else if 30 then 3 ;

Step 3: Transform highly skewed variables

Because the triglycerides variable (lbxtr) is highly skewed, you will use a log transformation to create new variable to use in this analysis.

logtrig=log(lbxtr);

Step 4: Create eligibility variable

Because not every participant in NHANES responded to every question asked, there may be a different level of item non-response to each variable. To ensure that your analyses are done on the same number of respondents, create a variable called eligible which is 1 for individuals who have a non-blank value for each of the variables used in the analyses, and 0 otherwise. Although this is a univariate analysis using only exam variables, the fasting subsample weight (wtsaf4yr) is included in determining the eligible variable. This is because you will be conducting a multivariate analysis using the triglycerides variable later and will limit the sample to persons included in both analyses. The SAS code defining eligible is:

if hyper ne . and hichol ne . and bmigrp ne . and age ne . and logtrig ne . and wtsaf4yr ne 0 then eligible=1 ;

Step 5: Set up SUDAAN univariate logistic procedure

This step introduces you to the SUDAAN Univariate Logistic Regression procedure (proc rlogist). You can read the explanations in the summary table below.


IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Univariate Logistic Procedure
Statements Explanation
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;
Use the proc sort procedure to sort the data by strata and primary sampling units (PSU) before running the procedure.
proc rlogist data=analysis_data;
Use the SUDAAN procedure, proc rlogist, to run logistic regression.
nest sdmvstra sdmvpsu;
Use the nest statement with strata and PSU to account for the design effects.
weight
Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data is used.
subpopn eligible=1 ;

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Because only a subpopulation is of interest, use the subpopn statement to select this subgroup. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the data file.

class riagendr
Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.
reflevel riagendr=2 ;
Use the reflevel statement to choose your reference group for the categorical variables. By default SUDAAN uses the highest category as the reference group.
model hyper=riagendr;
Use the model statement to specify dependent variable and independent variable(s) in your Logistic Regression model.
test waldf satadjf satadjchi;
Use the test statement to produce statistics and P values for the Satterthwaite adjusted CHI square (satadjchi), the Satterthwaite adjusted F (satadjf), and Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.
rformat riagendr sexfmt. ;
rformat hyper bpfmt. ;
Use the rformat statement to read the SAS formats into SUDAAN.

Step 6: Review SUDAAN univariate logistic regression output

In this step, the SUDAAN output is reviewed.

  • 1,304 respondents have hypertension and 2,515 do not.
  • Men are less likely to have hypertension than women. Their odds of hypertension are 0.89 times the odds of women.
  • Assuming a p-value less than 0.05 indicates statistical significance, note that gender is not significantly associated with hypertension based on the p-value for Satterthwaite χ2 or F test, which gives the overall p-value for gender. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics. The p-value of 0.156 indicates that this relationship is not statistically significant.

Step 7: Set up SUDAAN multivariate logistic procedure

The SUDAAN Multivariate Logistic Regression procedure is similar to the univariate procedure explained in the table above. You can follow the steps outlined below to perform a multivariate logistic regression.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Multivariate Logistic Procedure
Statements Explanation
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the SAS procedure, proc sort, to sort the data by strata and primary sampling units (PSU) before running the procedure.

proc rlogist data=analysis_data;
Use the SUDAAN procedure, proc rlogist, to run logistic regression.
nest sdmvstra sdmvpsu;
Use the nest statement with strata and primary sampling unit to account for design effects.
weight WTSAF4YR;
Use the fasting subsample weight because the log of fasting triglycerides variable comes from a subsample of the lab data file. Not all respondents were tested on triglycerides.
subpopn eligible=1 ;

Use the subpopn statement to limit the sample to the observations included in the final logistic model.

Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the dataset.

class age riagendr hichol bmigrp;

Use the class statement to specify all categorical variables in the model.

Use a class statement for categorical variables in version 9.0 and later. In earlier versions, you need a subgroup and levels statement.

reflevel age=2 2 ;

Use the reflevel statement to choose your reference group for the categorical variables. By default, SUDAAN uses the highest category as the reference group.

model hyper=age riagendr hichol bmigrp logtrig;
Use the model statement to specify dependent variable and all independent variable(s) in your Logistic Regression model.
test waldf satadjf satadjchi;
Use the test statement to produce statistics and P values for the Satterthwaite adjusted CHI square (satadjchi), the Satterthwaite adjusted F (satadjf), and Satterthwaite adjusted degrees of freedom (printed by default). If this statement is omitted, the nominal degrees of freedom, the WALDF and the p-value corresponding to the WALDF and WALDP will be produced.

Step 8: Review SUDAAN multivariate logistic procedure output

This step reviews the SUDAAN multivariate logistic procedure output.

  • 1,304 respondents have hypertension and 2,515 do not.
  • All covariates are statistically significant at p-value<0.05, except for gender. The Satterthwaite adjusted F gives the most conservative estimate of the test statistics.
  • Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The adjusted odds of hypertension are 1.29 (95% C.I. 1.03-1.61) for each unit increase in the log of triglycerides.

Task 2b: How to Use SAS 9.2 Survey Code to Perform Logistic Regression

In this module, you will use NHANES data to assess the association between several risk factors and the likelihood of having hypertension for participants 20 years and older. The dependent variable Y is hypertension, and the independent variables Xj, or covariates, are age, gender, high cholesterol, body mass index, and fasting triglycerides. In this task , you will only be reviewing the Multivariate Logistic Procedure.

Step 1: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. You should not use a where clause or by-group processing in order to analyze a subpopulation with the SAS Survey Procedures.

In this example, the sel variable is set to 1 if the sample person is 20 years or older, and 2 if the sample person is younger than 20 years. Then this variable is used in the domain statement to specify the population of interest (those 20 years and older).

if ridageyr GE 20 then sel = 1;
else sel = 2;

Step 2: Review SAS Multivariate Logistic Procedure

This step introduces you to the SAS multivariate survey Logistic Regression procedure, proc surveylogistic. There is a summary table of the SAS program below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Multivariate Logistic Procedure
Statements Explanation
PROC SURVEYLOGISTIC DATA = Analysis_Data nomcar;
Use the proc surveylogistic procedure to perform multiple logistic regression to assess the association between hypertension and multiple risk factors, including: age, gender, high cholesterol, body mass index, and fasting triglycerides. Use the nomcar option to read all observations.
STRATUM sdmvstra;
Use the stratum statement to specify strata to account for design effects of stratification.
CLUSTER sdmvpsu;
Use the cluster statement to specify primary sampling unit (PSU) to account for design effects of clustering.
WEIGHT wtsafyr;
se the weight statement to account for the unequal probability of sampling and non-response. In this example, the 4-year fasting weight variable is used.
DOMAIN sel;
Use the domain statement to specify the subpopulation of interest.
CLASS age (PARAM=REF REF= '40-59 yrs')
riagendr (PARAM=REF REF='Female')
hichol (PARAM=REF REF='high cholesterol')
bmigrp (PARAM=REF REF='25<=BMI<30');

Use the class statement to specify all categorical variables in the model.

Use the param and ref options to choose your reference group for the categorical variables.

MODEL hyper (desc)=age riagendr hichol bmigrp logtrig/ vadjust=none;
Use the model statement to specify the dependent variable and all independent variable(s) in your Logistic Regression model. The vadjust option specifies whether or not to use variance adjustment.
format age agefmt. riagendr sexfmt. hichol chfmt. bmigrp bmifmt. ; run ;
Use the format statement to read the SAS formats for all formatted variables.

IMPORTANT NOTE

The SAS Survey Procedure, proc surveylogistic, produces the Wald statistic and its p value. It does not produce the Satterthwaite χ2 or the Satterthwaite F and the corresponding p values recommended for NHANES analyses. For this reason, it is recommended that you use proc rlogist in SUDAAN for logistic regression.

Step 3: Review SAS Multivariate Logistic Regression Output

In this step, the SAS output is reviewed. You can compare your results with the sample output, which you can download from the Sample Code and Datasets page. Or, you can view an animated version of the results with narration by clicking the link below. In the narration, the highlighted elements show that:

  • 1,304 respondents have hypertension and 2,515 do not.
  • The beta coefficients and odds ratio point estimates are identical to the SUDAAN estimates.

Task 2c: How to Use Stata Code to Perform Logistic Regression

In this module, you will use simple logistic regression to analyze NHANES data to assess the association between gender (riagendr) — the exposure or independent variable — and the likelihood of having hypertension (based on bpxsar, bpxdar) — the outcome or dependent variable, among participants 20 years old and older. You will then use multiple logistic regression to assess the relationship after controlling for selected covariates. The covariates include age (ridageyr), cholesterol (lbxtc), body mass index (bmxbmi) and fasting triglycerides (lbxtr).

IMPORTANT NOTE

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your cholesterol analysis, use the weight variable for four-yours of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for fur years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Create dependent dichotomous variable

For continuous variables, you have a choice of using the variable in its original form (continuous) or changing it into a categorical variable (e.g. based on standard cutoffs, quartiles or common practice). The categorical variables should reflect the underlying distribution of the continuous variable and not create categories where there are only a few observations.

For the dependent variable, you will create a dichotomous variable, hyper, which defines people as having (or not having) hypertension. Specifically, a person is said to have hypertension if their systolic blood pressure (measured in the MEC) exceeds 140 or their diastolic blood pressure exceeds 90 or if they are taking blood pressure medication. Remember for logistic regression to work in Stata, this variable needs to be defined as 0 (meaning outcome did not occur, here person does not have hypertension) or 1 (outcome occurs, here person has hypertension). The code to create this variable is below:

gen hyper=1 if (bpxsar>=140 & bpxsar<. | bpxdar>=90 & bpxdar<.) | bpq050a==1
replace hyper=0 if hyper !=1 & (bpxsar !=. & bpxdar !=.)

Step 3: Create independent categorical variables

In addition to creating the dichotomous dependent variable, this example will also create additional independent categorical variables (age, hichol, bmigrp) from the age, cholesterol, and BMI categorical variables to use in this analysis.

Code to generate independent categorical variables
Independent variable Code to generate independent categorical variables
Age
gen age=1 if ridageyr >=20 & ridageyr <40
replace age=2 if ridageyr >=40 & ridageyr <60
replace age=3 if ridageyr >=60 abd ridageyr <.
High cholesterol
gen hichol =1 if lbxtc >=240 & lbxtc<. | bpq100d==1
replace hichol =0 if hichol ~=1 & lbxtc !=.
BMI category
gen bmigrp=1 if bmxbmi<25
replace bmigrp=2 if bmxbmi>=25 & bmxbmi <30
replace bmigrp=3 if bmxbmi>=30 & bmxbmi <.

Step 4: Transform highly skewed variables

Because the triglycerides variable (lbxtr) is highly skewed, you will use a log transformation to create new variable to use in this analysis.

gen logtrig = log(lbxtr)

Step 5: Choose reference groups for categorical variables

For all categorical variables, you need to decide which category to use as the reference group. If you do not specify the reference group options, Stata will choose the lowest numbered group by default. You can use the following general command to tell Stata the reference group:

char var [omit] reference_group_value

For your analyses, use the following commands to specify the following reference groups:

Code to specify reference groups
Variable Code to specify reference group Reference group
Gender
char riagendr [omit] 2
Women
Age
char age [omit] 2
40-59 year olds
BMI
char bmigrp [omit] 2
overweight (bmi25-29)
Cholesterol
char hichol [omit] 1
low cholesterol (<240mg/dL)

Step 6: Create eligibility variable

Because not every participant in NHANES responded to every question asked, there may be a different level of item non-response to each variable. To ensure that your analyses are done on the same number of respondents, create a variable called eligible which is 1 for individuals who have a non-blank value for each of the variables used in the analyses, and 0 otherwise. Although this is a univariate analysis using only exam variables, the fasting subsample weight (wtsaf4yr) is included in determining the eligible variable. This is because you will be conducting a multivariate analysis using the triglycerides variable later and will limit the sample to persons included in both analyses. The Stata code defining eligible is:

gen eligible=1 if wtsaf4yr!=. & hyper!=. & riagendr!=. &age!=. & hichol!=. & bmigrp!=. & logtrig~=. &wtsafyr!=0

Step 7: Create simple logistic regression model to understand relationships

The association between the dependent (or outcome) and independent (or exposure) variables is expressed using the svy:logit command. The dependent variable must be a dichotomous variable and the independent variables may be either discrete, ordinal, or continuous.

The general form of the command to get beta coefficients is:

xi: svy, subpop(condition): logit depvar i.indvar

To get odds ratios with the logit command, use the or option:

xi: svy, subpop(condition): logit depvar i.indvar, or

Odds ratios are automatically produced by the logistic command:

xi: svy, subpop(condition): logistic depvar i.indvar

An example command analyzing the relationship between gender and hypertension using the logistic commend is shown below:

xi: svy, subpop(if eligible==1): logistic hyper i.riagendr

In this example, the output for the logistic command is:

output for the logistic command

Highlights in the output include:

  • Men are less likely to have hypertension than women. Their odds of hypertension are 0.89 times the odds of women.
  • Assuming a p-value less than 0.05 indicates statistical significance, note that gender is not significantly associated with hypertension based on the p-value. The p-value of 0.156 indicates that this relationship is not statistically significant.

Step 8: Specify multiple logistic regression model

Multiple logistic regression uses the same command structure but now includes other independent variables. If you want to create indicator variables for categorical variables, you will want to use the xi option. However, the general structure remains the same:

xi: svy, subpop(condition): logistic depvar indvar i.indvar

For this example, you will be using these commands to analyze the effects of gender, age, high cholesterol, BMI, and triglycerides on hypertension. Please note that the svyset commands is using the subsample weight, wtsat4yr, because this analysis includes the triglycerides variable that was only collected on a subsample of the survey.

svyset [w=wtsaf4yr], psu(sdmvpsu) strata(sdmvstra)
xi: svy, subpop(if eligible==1): logistic hyper i.riagendr i.age i.hichol i.bmigrp logtrig

In this example, this output is:

output of the effects of gender, age, high cholesterol, BMI, and triglycerides on hypertension

Highlights from the output include:

  • All covariates are statistically significant at p-value<0.05, except for gender.
  • Odds ratios should be interpreted as adjusted odds ratios because there are multiple covariates in the model. The adjusted odds of hypertension are 1.29 (95% C.I. 1.03-1.61) for each unit increase in the log of triglycerides.

Step 9: Compare results of simple and multiple linear regressions

To understand how much adjustment matters, it is helpful to compare the odds ratio from the simple and multiple regression models. The following tables summarize the results.

Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Sex
Sex Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
men 27% 0.89
(0.75 - 1.05)
0.94
(0.76 - 1.16)
0.16 0.55
women 30% Reference Group Reference Group Reference Group Reference Group
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Age
Age
(years)
Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
20-39 9% 0.25
(0.18-.034)
0.28
(0.21 - .038)
<0.001 <0.001
40-59 28% Reference Group Reference Group Reference Group Reference Group
60+ 66% 4.87
(3.76 - 6.3)
5.27
(4.00 - 6.94)
<0.001 <0.001
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - BMI
BMI Crude Analysis
% with hypertension
Crude Analysis
Odds Ratio*
(95% CI)
Adjusted Analysis
Odds Ratio*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
underweight/
normal
18% 0.58
(0.46 - 0.72)
0.67
(0.51- 0.87)
<0.001 0.004
overweight 28% Reference Group Reference Group Reference Group Reference Group
obese 42% 1.85
(1.52 - 2.25)
2.18
(1.70 - 2.80)
<0.001 <0.001
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Cholesterol
Cholesterol Crude Analysis
% with hypertension
Crude Analysis
Coefficient*
(95% CI)
Adjusted Analysis
>Coefficient*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
High 43% Reference Group Reference Group Reference Group Reference Group
Low/Normal 24% 0.41
(0.34 - 0.49)
0.78
(0.62 - 0.97)
<0.001 0.028
Table Comparing Differences between Crude Analysis (Simple Logistic Regression) and Adjusted Analysis (Multiple Logistic Regression) - Triglycerides
Triglycerides Crude Analysis
% with hypertension
Crude Analysis
Coefficient*
(95% CI)
Adjusted Analysis
Coefficient*
(95% CI)
Crude Analysis
p value
Adjusted Analysis
p value
Triglycerides N/A 1.98
(1.65 - 2.37)
1.28
(1.03 - 1.61)
<0.001 0.029
Stata Multivariate Logistic Procedure
Statements Explanation
use "C:\Stata\tutorial\analysis_data.dta", clear

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

svyset sdmvpsu [pweight=wtsaf4yr], strata(sdmvstra) vce(linearized)

Use the svyset command to declare the survey design for the dataset. Specify the psu variable sdmvpsu. Use the [pweight=] option to account for the unequal probability of sampling and non-response. In this example, the MEC fasting weight for four years of data (wtsaf4yr) is used. Use the strata () option to specify the stratum identifier (sdmvstra). Use the vce( ) option to specific the variance estimation method (linearized) for Taylor linearization.

char age[omit] 2
char riagendr[omit]2
char bmigrp[omit] 2
char hichol[omit]1

Use these options to choose your reference group for the categorical variables. For example, the 2nd age category (age 40-59) is chosen as the reference group.

If you do not specify the reference group options, Stata will choose the lowest numbered group by default.

xi: svy, subpop(if ridageyr >=20) vce(linearized): logit hyper i.age i.riagendr i.hichol i.bmigrp logtrig

Use the xi command to expand terms containing categorical variables (denoted i.varname) into indicator (also called dummy) variable sets. Use the svy: logit command to perform multiple logistic regressions to assess the association between hypertension and multiple risk factors, including: age, gender, high cholesterol, body mass index, and fasting triglycerides. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. (Note: omission of the or option as shown below will yield estimates as coefficients.)

xi: svy, subpop(if ridageyr >=20) vce(linearized): logit hyper i.age i.riagendr i.hichol i.bmigrp logtrig, or

Use the xi command to expand terms containing categorical variables into indicator (also
called dummy) variable sets. Use the svy: logit command to perform multiple logistic regressions to assess the association between hypertension and multiple risk factors, including: age, gender, high cholesterol, body mass index, and fasting triglycerides.

Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the or option to produce estimates as odds ratios.

test
*******************************
test _Iage_1 _Iage_3, nosvyadjust
test _Iriagendr_1, nosvyadjust
test _Ihichol_0, nosvyadjust
test _Ibmigrp_1 _Ibmigrp_3, nosvyadjust
test logtrig, nosvyadjust

Use the test postestimation command to produce the Wald F statistic and the corresponding p-value. Use the nosvyadjust option to produce the unadjusted Wald F. In the example, the command test is used to test all coefficient together and then all coefficients separately.

WARNING

The Stata command, svy:logit, produces the adjusted and unadjusted Wald statistic and its p value. It does not produce the Satterthwaite χ2 or the Satterthwaite F and the corresponding p values recommended for NHANES analyses.

Step 10: Post-estimation

You may want to know whether different comparisons (other than the reference categories you specified) are significant. In that case, you can use a post-estimation command (i.e. a command that can only be run after you have run the logit model command). This takes the general form, if you do not want the unadjusted Wald F:

test vargroup, nosvyadjust

This example will be using this command to test that the youngest age group has a statistically significant different likelihood of having hypertension than the oldest age group:

test _Iage_1 = _Iage_3, nosvyadjust

The results for this example are:

F(1, 29) = 443.30; Prob > F = 0.0000

Differences Between SUDAAN and SAS Survey Procedures Logistic Regression Output

If you ran both the SAS Survey and SUDAAN programs (or reviewed the output provided on the Sample Code and Datasets Page page), you may have noticed slight differences in the output. These differences can be caused by missing data in any paired PSU or how each software program handles degrees of freedom.

  • Both programs calculate that 1,304 respondents have hypertension and 2,515 do not.
  • The beta coefficient and odds ratio estimates are identical.
  • The variance estimates and standard errors are identical if there are no missing data in any paired PSUs (which was the case in this example). They will be different if any one of the paired PSUs contains missing data, as SAS and SUDAAN handle stratum contribution from the missing cells differently.
  • The confidence intervals are slightly different since SAS and SUDAAN handles degrees of freedom differently.
TOP