Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 6: Hypothesis Testing

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

The t-test and chi-square statistics are used to test statistical hypotheses about population parameters. This module will demonstrate the use of these statistics in NHANES data analysis.

Using the t-test Statistic

The t-test is used to test the null hypothesis that two population means or proportions, _<i2;1 and θ2, are equal OR, equivalently, that the difference between two population means or proportions is zero. To test this hypothesis, assuming the covariance is small, as is the case with NHANES data, the following formula is used

Equation for t-Test Where Covariance is Small

Equation for t-Test Where Covariance is Small

where,

Zero hat1 is an estimate of θ1 based on a probability sample,

betacoefficient hat1 is an estimate of the standard error of Zero hat1,

Zero hat2 is an estimate of θ2,

and betacoefficient hat2 is an estimate of the standard error of Zero hat2.

In instances where the t statistic is based on a small number of independent pieces of information (i.e. a small number of degrees of freedom [<30]), the statistic given in equation 1 follows a Student's t distribution with mean=0 and unit variance with n degrees of freedom. In the NHANES 1999-2002 sample, the degrees of freedom depend on the number of first stage units, or PSUs, containing observations and is defined as the number of PSUs minus the number of strata. (See Sample Design module for more information.)

The equality of means is usually tested at the .05 level of significance.

References

Cochran, WG. Sampling Techniques. John Wiley & Sons. 1977.

Lohr SL. Sampling: Design and Analysis. Duxbury Press. Pacific Grove 1999.

Task 1a: How to Set Up a t-test in NHANES Using SUDAAN

In this task, you will use SUDAAN to calculate a t-statistic and assess whether the mean systolic blood pressures (SBP) in males and females age 20 years and older are statistically different.

Step 1: Set Up SUDAAN to Produce Means

Follow the steps in the summary table below to produce the mean SBP using the SUDAAN procedure proc descript.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN proc descript Procedure for Means
Statements Explanation
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the SUDAAN procedure, proc sort, to sort the data by strata (sdmvstra) and PSU (sdmvpsu).

proc descript 
			data=analysis_data design=wr;

Use the proc descript procedure to generate means and specify the sample design using the design option WR (with replacement).

nest sdmvstra sdmvpsu;

Use the nest statement with strata and PSU to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subpopn ridageyr >= 20 ;

Use the subpopn statement to select those 20 years and older.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates of the standard error, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in SAS when preparing the data file. (See Section 5.4 of Korn and Graubard Analysis of Data from Health Surveys, pp 207-211.)

class riagendr/NoFREQ;

Use a class statement for categorical variables in version 9.0. In earlier versions, you need a subgroup and levels statement. Use the nofreq option to suppress frequencies.

var bpxsar;

Use the var statement to choose the continuous variable, systolic blood pressure (bpxsar).

print nsum mean semean/style=nchs;

Use the print statement to obtain the N (nsum), mean (mean) and standard error of the mean (semean) for the t-test.

rformat riagendr sexfmt. ;

Use the rformat statement to read the SAS formats into SUDAAN.

rtitle "Mean systolic blood pressure: NHANES 1999-2002"
run ;

Use the rtitle statement to title the output.

Step 2: Review SUDAAN Means Output

  • 9,056 respondents had information on systolic blood pressure (SBP).
  • The results indicate the mean SBP was 124 for males and 122 for females.

Step 3: Perform t-test to Test for Significance

A t-test is used to test whether the mean SBP between males and females obtained in the previous step is statistically significant different.

Request the t-test from the SUDAAN procedure proc descript and follow the steps in the summary table below.

IMPORTANT NOTE

Note that this program and the previous program to produce means in Step 1 are identical up to the varstatement.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Significance Test
Statements Explanation
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the SUDAAN procedure, proc sort, to sort the data by strata (sdmvstra) and PSU (sdmvpsu).

proc descript
data=analysis_data design=wr;

Use the proc descript procedure to generate means and specify the sample design using the design option WR (with replacement).

nest sdmvstra sdmvpsu;

Use the nest statement with strata and PSU to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subpopn ridageyr >= 20 ;

Use the subpopn statement to select those 20 years and older.

Because only those 20 years and older are of interest in this example, use the subpopn statement to select this subgroup. Please note that for accurate estimates of the standard error, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup in the SAS program while preparing the data file. (See Section 5.4 of Korn and Graubard Analysis of Data from Health Surveys, pp 207-211.)

class riagendr/NoFREQ;

Use a class statement for categorical variables in version 9.0. In earlier versions, you need a subgroup and levels statement. Use the nofreq option to suppress frequencies.

var bpxsar;

Use the var statement to choose the continuous variable, systolic blood pressure.

contrast riagendr = ( 1 - 1 )/name = "Males vs. Females" ;

Use the contrast statement to test the hypothesis that the difference equal 0, or mean SBP for males equals the mean SBP for females.

print nsum t_mean p_mean/style=nchs;

Use the print statement to obtain the N (nsum), t-test, and p-value for the t-test.

rformat riagendr sexfmt. ;

Use the rformat statement to read the SAS formats into SUDAAN.

rtitle "Significance test for difference between mean systolic blood pressure for males and females" ;
rtitle2 "NHANES 1999-2002" ;

Use the rtitle statement to title the output.

Step 4: Review SUDAAN t-test Output

  • 9,056 respondents had information on systolic blood pressure where the degrees of freedom was 29.
  • To test the hypothesis that the difference between the two means is zero, the t-statistic with 29 degrees of freedom is computed as 2.64. The p-value is 0.0132, which indicates that the probability of obtaining a value of the t-statistic whose absolute value is greater than or equal to 2.64 is 0.0132.
  • Therefore, the null hypothesis is rejected at the 0.05 level.

Task 1b: How to Set Up a t-test in NHANES Using SAS 9.2 Survey Procedures

In this task, you will use SAS Survey Procedures to calculate a t-statistic and assess whether the mean systolic blood pressures (SBP) in males and females age 20 years and older are statistically different.

Step 1: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. In this example, the sel variable is set to 1 if the sample person is 20 years or older, and 2 if the sample person is younger than 20 years. Then this variable is used in the domain statement to specify the population of interest (those 20 years and older).

if ridageyr GE 20 then sel = 1;
else sel = 2;

Step 2: Set Up SAS Survey Procedures to Produce Means

Follow the explanations in the summary table below to produce the mean SBP using the SAS Survey procedure proc surveymeans.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS proc surveymeans Procedure for Means
Statements Explanation
proc surveymeans data=analysis_data nobs mean stderr;

Use the SAS Survey procedure, proc surveymeans, to count the number of observations (nobs) and calculate means (mean) and standard errors (stderr), and specify the dataset (analysis_Data).

strata sdmvstra;

Use the strata statement to specify the strata (sdmvstra) and account for design effects of stratification.

cluster sdmvpsu;

Use the cluster statement to specify PSU (sdmvpsu) to account for design effects of clustering.

class riagendr;

Use the class statement to specify the discrete variables used to select from the subpopulations of interest. In this example, the subpopulation of interest are gender (riagendr).

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

domain sel sel*riagendr;

Use the domain statement to select those 20 years and older (sel) by gender (riagendr).

IMPORTANT NOTE

When using proc surveymeans, use a domain statement to select the population of interest. Do not use a where or by-group statement to analyze subpopulations with the SAS Survey Procedures.

var bpxsar;

Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the systolic blood pressure variable (bpxsar) is used.

ods output domain(match_all)=domain;

Use the ods statement to output the dataset of estimates from the subdomains listed on the domain statement above. This set of commands will output two datasets for each subdomain specified in the domain statement above (domain for sel; domain1 for sel*riagendr).

data all;
set domain domain1;
if sel= 1 ;

Use the data statement to name the temporary SAS dataset (all) to append the two datasets, created in the previous step, if age is greater than or equal to 20 (sel).

proc print ;
var riagendr n mean stderr;
title "Mean systolic blood pressure: NHANES 1999-2002" ;
run ;

Use the print statement to print the number of observations, the mean, and standard error of the mean in a printer-friendly format.

Step 3: Review SAS Means Output

  • 9,056 respondents had information on systolic blood pressure (SBP).
  • The results indicate the mean SBP was 124 for males and 122 for females.

Step 4: Set up SAS Survey Procedures to Test for Significance

A t-test is used to test whether the mean SBP between males and females obtained in the previous step is statistically significant different.

Request the t-test from the SAS Proc Surveyreg procedure and follow the steps in the summary table below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Use SAS Survey Procedures to Calculate Significance
Statements Explanation
PROC SURVEYREG DATA=analysis_data nomcar;

Use the SAS Survey procedure, proc surveyreg, to calculate significance. Use the nomcar option to read all observations.

STRATA sdmvstra;

Use the strata statement to specify the strata (sdmvstra) and account for design effects of stratification.

CLUSTER sdmvpsu;

Use the cluster statement to specify PSU (sdmvpsu) to account for design effects of clustering.

CLASS riagendr;

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., gender [riagendr].

WEIGHT wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

DOMAIN sel;

Use the domain statement to specify the subpopulations of interest.

IMPORTANT NOTE

When using proc surveyreg, use a domain statement to select the population of interest. Do not use a where or by-group statement to analyze subpopulations with the SAS Survey Procedures.

MODEL bpxsar = riagendr/ vadjust=none;

Use a model statement to specify the dependent variable for systolic blood pressure (bpxsar) as a function of the independent variable gender (riagendr). The vadjust option specifies whether or not to use variance adjustment.

TITLE 'Significance test for difference between mean systolic blood pressure for males and females NHANES 1999-2002';

Use the title statement to label the output.

Step 5: Review Output

  • 9,056 respondents had information on systolic blood pressure
  • The number of degrees of freedom in this example is equal to 29.
  • The t-statistic, with 29 degrees of freedom, is equal to 2.64. The p-value of 0.0132 indicates that the probability of obtaining a value of the t-statistic whose absolute value is greater than or equal to 2.64 is 0.0132. Therefore, the null hypothesis is rejected at the 0.05 level.

Task 1c: How to Set Up a t-test in NHANES Using Stata

In this task, you will use Stata commands to calculate a t-statistic and assess whether the mean systolic blood pressures (SBP) in males and females age 20 years and older are statistically different.

Step 1: Set Up Stata to Produce Means

Follow the steps in the summary table below to produce the mean SBP and the t-test to test whether the mean SBP between males and females obtained is statistically significant different using the Stata command svy:mean.

IMPORTANT NOTE

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 2: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your SBP analysis, use the weight variable for 4 years of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra). The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for four years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 3: Use svy:mean to generate means and standard errors in Stata

Now, that the svyset has been defined you can use the Stata command, svy: mean, to generate means and standard errors. The general command for obtaining weighted means and standard errors of a subpopulation is below.

svy: mean varname, subpop(if condition)

Use the svy : mean command with the systolic blood pressure variable (bpxsar) to estimate the mean systolic blood pressure for people age 20 years and older. Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. This example uses an if statement to define the subpopulation based on the age variable's (ridageyr) value. Another option is to create a dichotomous variable where the subpopulation of interest is assigned a value of 1, and everyone else is assigned a value of 0.

svy: mean bpxsar, subpop(if ridageyr>=20 & ridageyr<.)
Output of svy:mean

Output of svy:mean

Step 4: Use over option of svy:mean command to generate means and standard errors for different subgroups in Stata

You can also add the over() option to the svy:mean command to generate the means for different subgroups. When you do this, you can type a second command, estat size, to have the output display the subgroup observation numbers. Here is the general format of these commands for this example:

svy: mean varname, subpop(if condition) over(var1 var2)
estat size

Use the svy : mean command with the systolic blood pressure variable (bpxsar) to estimate the mean systolic blood pressure for people age 20 years and older. Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. This example uses an if statement to define the subpopulation based on the age variable's (ridageyr) value. Another option is to create a dichotomous variable where the subpopulation of interest is assigned a value of 1, and everyone else is assigned a value of 0. Use the over option to get stratified results. This example produces estimates by gender. Use the estate size post estimation command to display the number of subpopulation observations and weighted numbers.

svy: mean bpxsar, subpop(if ridageyr>=20 & ridageyr<.) over(riagendr)
estat size, obs size
Output of svy:mean with over option

Output of svy:mean with over option

Step 5a: Test the hypothesis using the lincom post estimation command

If you have already done some estimations, then you can use the lincom command to test the hypothesis that the difference between the mean for the subpopulations equal 0. Use square brackets around the variable you are estimating. After the variables in square brackets, put the stratifier that you want to test (e.g. the variable in the over option). If you used labels for the variable, you can use labels instead of the coded values. Here is the general format of these commands for this example:

lincom [varname]stratval1 - [varname]stratval2

Because you have done some prior estimation, you can use the lincom post estimation command to test the hypothesis that the difference between mean SBP (bpxsar) for males and females equal 0. This example uses labeled values (male, female) instead of the coded values (1,2) for the gender variable (riagendr).

lincom [bpxsar]male - [bpxsar]female
Output of lincom post estimation command

Output of lincom post estimation command

Step 5b: Test the hypothesis using svy:reg command

The svy:reg command could also be used to calculate the t-statistic. The difference between using svy:reg and lincom is that svy:reg can be used without prior estimation. The xi prefix is used before the command to denote a categorical variable and the i prefix before categorical variables. Here is the general format of these commands for this example:

xi: svy, subpop(if condition): reg dependentvar i.varname

Use the svy:reg command with the xi prefix to calculate the t-statistic and assess whether the mean SBP (bpxsar) for males and females age 20 years and older are statistically different. The i prefix denotes the categorical variable, which in this example is riagendr. Use the char function choose the reference group for the categorical variable.

char riagendr[omit]2
xi:svy, subpop(if ridageyr.=20 & ridageyr<.):reg bpxsar i.riagendr,
Output of svy:reg command

Output of svy:reg command

Step 6: Review Stata means and t-test output

Here a table summarizing the results of the previous analyses:

Summary of Results
Variable Subpopulation analyzed Number of
respondents
with data
Mean p value
Systolic blood
pressure (bpxsar)
Adults age 20 and older 9,056 123 n/a
Men age 20 and older 4,301 124 0.0132
(men vs women)
Women age 20 and older 4755 122

According to the stratified analysis, men's mean blood pressure is 2 points higher than women's. This difference is statistically significant (i.e. a difference this big or bigger would happen just by chance (in a sample of this size) only 1.3% of the time). 9,056 respondents had information on systolic blood pressure (SBP).

Confidence Intervals

Typically, a sufficiently large probability sample, will have point estimates that are approximately normally distributed. The end points of the confidence interval, then, are a function of the estimate (Zero hat), its standard error (Zero hat), and a percentile of the normal distribution with zero mean and unit variance, referred to as the standard normal deviate (z score), and are given by

Equation for confidence interval endpoints

(1) Equation for Confidence Interval Endpoints

The NHANES 1999-2002 sample is a multistage, area, probability sample. The number of independent pieces of information, or degrees of freedom, depends upon the number of PSUs rather than on the number of sample persons. (See "Degrees of Freedom for Performing Statistical Tests and Calculating Confidence Limits" in the Variance Estimation module for a more detailed discussion about correctly determining the number of the degrees of freedom.) Sample persons within a given PSU are not independent. (See module on Sample Design for more information.) Therefore, the standard normal deviate is replaced by a t-statistic with degrees of freedom equal to the difference between the number of PSUs and the number of strata containing observations. The endpoints for a confidence interval for the NHANES 1999-2002 survey are given by

Equations for confidence internal endpoints in NHANES 1999-2000

(2) Equations for Confidence Interval Endpoints in NHANES 1999-2000

Sample weights must be incorporated in calculating the estimate and its standard error (see the Weighting module for more information) and design-based methods must be used to estimate the standard error (see the Variance Estimation module for more information). Taylor Series Linearization is one example of a design-based method. The design variables needed to obtain estimates of standard errors through this method are provided on the demographic files for the continuous NHANES (see below for an example of a program).

Interpretation

Confidence intervals, as constructed above, are based on one possible sample from a finite population. Many possible samples of the same size can be obtained using the same procedures and measurements. For each of these samples, a confidence interval can be constructed. For a 95% CI, 95% percent of these intervals would then contain θ.

  • See Degrees of Freedom for Performing Statistical Tests and Calculating Confidence Limits in the Variance Estimation module for a more detailed discussion about correctly determining the number of the degrees of freedom.

Transformations

Some variables in NHANES are highly skewed. In this case, transformations are recommended. One of the most common transformations used in the literature is the loge. We recommend that users verify that the transformed variable is normally distributed before proceeding to construct a confidence interval. This can be done using SAS proc univariate with the plot and the normal options included. The output from this procedure includes a plot of the distribution of the transformed variable and a Q-Q plot, i.e. a plot of the un-weighted variable against the standard normal variable. If the plot of a straight line through the origin and at a 45o angle is obtained, the variable is normally distributed. Also included in the output are estimates of the third (skewness) and fourth (kurtosis) moments about the mean. Once users verify that the log transformed variable is approximately normally distributed, they can estimate the geometric mean and standard error and can then construct a 95% CI.

In order to do this, you can construct the 95 percent confidence interval of your estimate on the log scale using the standard t-statistic and then back transform the upper and lower limits. However, the geometric mean and its standard error can be obtained directly from SUDAAN proc descript and then outputted to a SAS dataset where the confidence interval can be constructed directly.

At the present time, SAS proc surveymeans does not have an option to produce geometric means and their standard errors. However, they can be obtained by running proc surveymeans on the log transformed variables to produce means and standard errors of the log transformed variable, constructing the confidence interval on the log-transformed scale, and then back transforming the endpoints.

Applying the log-transformation does not necessarily yield a normally distributed random variable. Furthermore, in instances in which 0 is a plausible value, the log is undefined. We recommend that users try other transformations, for example the square root, in these instances. (Reference Visualizing Data by William Cleveland.)

Task 2: How to Calculate Confidence Intervals for Geometric Means Using SUDAAN

Step 1: Calculating Confidence Intervals for Geometric Means Using SUDAAN

This task will outline how to calculate confidence intervals for geometric means. See "How to Perform Statistical Tests and Calculate Confidence Limits with Degrees of Freedom" in the Variance Estimation module for basic programming steps for calculating confidence limits.

When the data are highly skewed you will need to transform them. For example, you can obtain the geometric mean by applying a log transformation to the data.

In this example, you will be calculating geometric means for the fasting serum triglyceride variable. Obtain the geometric mean and its standard error directly from the SUDAAN proc descript procedure and then output them to a SAS dataset where the CI can be constructed directly. The explanations in the summary table below provide an example that you can follow.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generating the Geometric Mean and Standard Error from SUDAAN
Statements Explanation
proc sort 1 =analysis_data; 1 sdmvstra sdmvpsu;

Use the SAS procedure, proc sort, to sort the data by strata (sdmvstra) and PSU (sdmvpsu).

proc descript geometric atlevel1= 1 atlevel2= 2 ;

Use proc descript to specify the dataset (analysis_Data).

Use the geometric option to compute and print of geometric means and their standard errors.

The ATLEVEL1=1 and ATLEVEL2=2 options specify the sampling stages (in NHANES, the number of strata is level 1, and the number of PSUs is level 2) for which you want counts per table cell. ATLEV1 is the number of strata with at least one valid observation and ATLEV2 is the number of PSUs with at least one valid observation. These numbers are used to calculate degrees of freedom.

nest sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

1 wtsaf4yr;

Use the morning fasting sample weight for 4 years of data (wtsaf4yr) because serum triglyceride was obtained in persons examined in the morning who fasted for 9+ hours.

subpopn ridageyr>=20 /name= "Adults 20 years and older" ;

Use the subpopn statement to select the subpopulation of interest. In this example, sample persons 20 years and older (ridageyr>=20) are used.

1 age1 riagendr/nofreq;

Use a class statement to list discrete variables upon which subgroups are based. In this example, gender (riagendr) and age (age1) are used.

1 lbxtr;

Use a var statement to select the serum triglyercide variable (lbxtr) as your variable of interest.

1 riagendr*age1;

Use the table statement to request prevalence of serum triglyceride stratified on gender (riagendr) within each age group (age1).

print nsum geomean segeomean/style=NCHS geomeanfmt=f6.0 segeomeanfmt= f6.1 ;

Use the print statement to print the number of observations (nsum), geometric means (geomean), and standard errors of geometric means (segeomean).

1 nsum geomean segeomean atlev1 atlev2/filename=tg9902 replace;
run ;

Use an output statement to output the number of observations (nsum)geometric mean (geomean), standard error of the geometric mean (segeomean), number of strata (atlev1), and number of PSUs (atlev2) to a SAS file named tg9902.

Calculate Confidence Intervals from SAS Output Dataset
Statements Explanation
data newtg9902;
set tg9902;
df=atlev2-atlev1;

Use the data statement to create a new dataset (newtg9902) from the SAS dataset created previously (tg9902).

Calculate the degrees of freedom (df) from the number of PSU (atlev2) minus the number of strata (atlev1).

1 PROCNUM TABLENO VARIABLE _C1 _C2 ATLEV1 ATLEV2;

Use a drop statement to drop the selected variables from the dataset.

ll=round(geomean+tinv(.025 ,df)*segeomean);
ul=round(geomean+tinv(.975 ,df)*segeomean);
geomean=round(geomean);segeomean=round(segeomean,.1 );
ciwidth=ul-ll;

Use these statements to calculate the lower limit (ll), upper limit (ul), geometric mean (geomean), and confidence intervals (ciwidth).

proc print 1 = '/' noobs ; 1 age1 age1fmt. riagendr sexfmt. nsum 7.0 geomean 6.0 segeomean 6.1 df 2.0 ;
label ll= 'Lower' / 'Limit' ul= 'Upper' / 'limit' df= 'Degrees' / 'of' / 'freedom'
ciwidth='Confidence' / 'interval' / 'width' ;
title1 'Geometric mean of serum triglyceride and 95 % Confidence' ; 
title2 'interval of adults 20 years and older:' ; title2 'United States, 1999-2002' ; run ;

Use the proc print procedure to output the age group (age1), gender (riagendr), number of observations (nsum), geometric means (geomean), standard error of the geometric mean (segeomean), and degrees of freedom (df).

Step 2: Review Output

  • If you used the proc univariate procedure on the fasting serum triglycerides and compared the mean and median values, you would see that the difference is substantial as triglyceride is a highly skewed variable. Therefore, you should use geometric means.
  • The geometric mean for males increases up to age 40-49 years and then declines.
  • The geometric mean for females increases up to age 60-69 years and then declines.
  • The width of the confidence interval (CI) is wider for males than for females, and is the largest for males 40-49 years, indicating more variability in the mean serum triglycerides in this group.
  • Confidence intervals can also be used as a first glance to see if two groups are different, for example the CI for mean serum triglycerides for total males (CI 124, 137) and total females (CI 111, 118) do not overlap, indicating that the two groups are likely to be different. However, a test for statistical difference, such as a t-test, should be performed in order to definitively determine a significant difference between the mean for two population sub-groups.

Task 2c: How to Obtain Confidence Intervals for Geometric Means Using Stata

This task will provide you with a method to obtain confidence intervals for geometric means.

When the data are highly skewed you will need to transform them. For example, you can obtain the geometric mean by applying a log transformation to the data.

In this example, you will obtain geometric means for the fasting serum triglyceride variable. You can see that fasting triglycerides has a right skew by looking at the distribution with this command: sum lbxtr [w=wtsaf4yr], det — which shows that median value is 106 but the mean is 135. So, the geometric mean is a better representation of central tendency than the regular mean.

Obtain the mean and its standard error of the log transformed fasting serum triglyceride variable from the Stata command svy:mean and then use ereturn display, eform() to display the exponentiated coefficients (geometric mean, standard error and confidence interval). The explanations in the summary table below provide an example that you can follow.

IMPORTANT NOTE

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your fasting serum triglyceride analysis, use the weight variable for four-years of MEC data obtained from persons who fasted nine hours and were examined in the morning at the MEC(wtsaf4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for four years of MEC data obtained from persons who fasted nine hours and were examined in the morning:

svyset [w= wtsaf4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Create log transformed variable

The gen command is used to created new variables. The ln option creates the log of the variable of interest. The general format of this command is below.

gen logvar=ln(var)

In this example, you will create the log transformed triglycerides variable (lnlbxtr) for the triglycerides variable (lbxtr) using this command:

gen lnlbxtr=ln(lbxtr)

Step 3: Use svy:mean to generate geometric means and standard errors in Stata

Now, that the svyset has been defined you can use the Stata command, svy: mean, to generate means and standard errors. To display the geometric mean in the original units of the variable, use the ereturn display command with the eform option. The general command for obtaining weighted means and standard errors of a subpopulation is below.

svy: mean varname, subpop(if condition)
ereturn display, eform(varname)

Use the svy : mean command with the log transformed triglyceride variable (lnlbxtr) to estimate the mean the geometric mean of triglycerides for people age 20 years and older. Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. This example uses an if statement to define the subpopulation based on the age variable's (ridageyr) value. Another option is to create a dichotomous variable where the subpopulation of interest is assigned a value of 1, and everyone else is assigned a value of 0. Use ereturn display, eform() to display the geometric mean in the original units of triglyceride (i.e., the exponentiated coefficients) (geo_mean), standard error, and confidence interval.

svy:mean lnlbxtr, subpop(if ridageyr>=20 & ridageyr<.)
ereturn display, eform(geo_mean)
Output of svy:mean

Output of svy:mean

Step 4: Use over option of svy:mean command to generate geometric means and standard errors for different subgroups in Stata

You can also add the over() option to the svy:mean command to generate the means for different subgroups. To display the geometric mean in the original units of the variable, use the ereturn display command with the eform option. Here is the general format of these commands for this example:

svy: mean varname, subpop(if condition) over(var1 var2)
ereturn display, eform(varname)

Use the svy : mean command with the log transformed triglyceride variable (lnlbxtr) to estimate the mean the geometric mean of triglycerides for people age 20 years and older. Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. This example uses an if statement to define the subpopulation based on the age variable's (ridageyr) value. Another option is to create a dichotomous variable where the subpopulation of interest is assigned a value of 1, and everyone else is assigned a value of 0. Use the over option to get stratified results. This example produces estimates by gender and age. Use ereturn display, eform() to display the geometric mean in the original units of triglyceride (i.e., the exponentiated coefficients) (geo_mean), standard error, and confidence interval.

svy:mean lnlbxtr, subpop(if ridageyr>=20 & ridageyr<.) over(riagendr age1)
ereturn display, eform(geo_mean)
Output of svy:mean command with over option

Output of svy:mean command with over option

Step 5: Review Output

Here is a table summarizing the output for the variable fasting triglyceride (lbxtr):

Summary output for the variable fasting triglyceride (lbxtr)
Subpopulation analyzed Number of respondents with data Geometric
Mean
95% confidence interval
Adults age 20 and older 3,982 122 118-126
Men age 20 and older 1,893 130 124-137
Women age 20 and older 2,089 114 111-118
Men age 20-29 103 96-111
Men age 30-39 122 115-129
Men age 40-49 153 136-172
Men age 50-59 148 135-162
Men age 60-69 141 129-154
Men 70+ 125 117-134
Women age 20-29 97 91-104
Women age 30-39 102 96-107
Women age 40-49 104 96-112
Women age 50-59 133 123-143
Women age 60-69 144 136-152
Women age 70+ 142 133-151

According to the stratified analysis, men's fasting trigylcerides is 16 points higher than women's. Confidence intervals can also be used as a first glance to see if two groups are different, for example the CI for mean serum triglycerides for total males (CI 124, 137) and total females (CI 111, 118) do not overlap, indicating that the two groups are likely to be different. However, a test for statistical difference, such as a t-test, should be performed in order to definitively determine a significant difference between the mean for two population sub-groups. The geometric mean for males increases up to age 40-49 years and then declines. The geometric mean for females increases up to age 60-69 years and then declines. The width of the confidence interval (CI) is wider for males than for females, and is the largest for males 40-49 years, indicating more variability in the mean serum triglycerides in this group.

Chi-Square Test

The chi-square test is used to test the independence of two variables cross classified in a two-way table. (A chi-square statistic with n degrees of freedom is based on a statistic equal to the sum of the squares of n independent normally distributed random variables with mean=0 and unit variance.)

For example, suppose we wished to test the hypothesis that blood pressure cuff size is independent of gender and that we have the following observed frequencies obtained as a result of the cross-classification of blood pressure cuff sizes and gender.

Blood pressure cuff size
1 2 3 4 Cumulative
Men 63 1387 2409 453 4312
Women 222 2065 2002 493 4782
Both genders 285 3452 4411 946 9094

In a simple random sample setting (unweighted data), the expected cell frequencies under the null hypothesis that blood pressure cuff size and gender are independent could be obtained by multiplying the marginal total for the jth column by the proportion of individuals in the ith row.

For example, the expected value of blood pressure cuff size 1 for men would be 285*(4312/9094)=135; the expected value of blood pressure cuff size 4 for women would be 946*(4782/9094)=497.

Thus, if Oij = the observed frequency of the ith row and jth column, where i=1,2, … i and j=1,2, … j and

Eij = the expected frequency of the ith row and jth column

Then the formula to test the null hypothesis of independence, using the chi-square statistic, would be

Equation to Test the Null Hypothesis

(1) Equation to Test the Null Hypothesis

This statistic has degrees of freedom equal to the number of rows minus 1, multiplied by the number of columns minus 1.

In a complex sample setting, you would use a statistic similar to equation (1) above, modified to account for survey design with degrees of freedom equal to the number of PSUs minus the number of strata containing observations. This statistic can be obtained through SAS proc surveyfreq (CHISQ, based on the Rao-Scott chi-square with an adjusted F statistic). The analogous procedure in SUDAAN version 9.0 (proc crosstab), provides limited chi-square statistics based on Wald chi-square and does not provide an F adjusted p-value. However, SUDAAN regression models do provide F adjusted chi-square statistics which are recommended for analyzing NHANES data.

The Cochran Mantel Haenzel Test, an extension of the Pearson Chi-Square, can be applied to stratified two-way tables to test for homogeneity or independence in a non-survey setting. For a complex sample its analogue can be obtained in SUDAAN proc crosstab (cmh).

References

Agresti A. An Introduction to Categorical Data Analysis. Wiley Series in Probability and Statistics. 1996. New York.

Task 3a: How to Perform Chi-Square Test Using SUDAAN

In this task, you will use the chi-square test to determine whether gender and blood pressure cuff size are independent of each other.

Step 1: Set Up SUDAAN to Perform Chi-Square Test

The chi-square statistic is requested from the SUDAAN procedure proc crosstab. The summary table below provides an example of how to code for a chi-square test in SUDAAN.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Calculating chi-square Using SUDAAN Procedure proc crosstab
Statements Explanation
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the SAS procedure, proc sort, to sort the data by strata (sdmvstra) and PSUs(sdmvpsu) before running the procedure in SUDAAN.

proc crosstab
data=analysis_data design=wr;

Use proc crosstab to examine the relationship between two categorical variables.

nest sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subpopn ridageyr >= 20 ;

Use the subpopn statement to select those 20 years and older.

Please note that for accurate estimates of the standard error, it is preferable to use subpopn in SUDAAN to select a subgroup for analysis, rather than select the study subgroup SAS when preparing the data file. (See Section 5.4 of Korn and Graubard Analysis of Data from Health Surveys, pp 207-211).

recode bpacsz = (1 3 4 5);

Use the recode statement to regroup blood pressure cuff size from five categories to four categories. This collapses the infant and child groups.

class riagendr bpacsz/NoFreq;

Use the class statement for categorical variables in version 9.0. In earlier versions, you need a subgroup and levels statement. The NoFreq option suppresses printing frequencies in the output.

table riagendr*bpacsz;

Use the table statement to choose the categorical variables gender (riagendr) and blood pressure cuff size (bpacsz) for cross tabulation.

print nsum rowper colper/tests=all;

Use the print statement to obtain the N, row percent (rowper),and column percent (colper). Use the tests option to request all available statistics.

rformat riagendr sexfmt. ;
rformat bpacsz csz1fmt. ;

Use the rformat statement to read the SAS formats into SUDAAN.

rtitle "Chi-square test for blood pressure cuff size: NHANES 1999-2002" ;
run ;

Use the rtitle statement to title the output.

IMPORTANT NOTE

SUDAAN Version 9.0 proc crosstab provides only limited chi-square results (Wald) with p-values based on unadjusted F-statistics (not the recommended statistic for complex survey data). However, the SUDAAN regression procedures do produce the recommended F adjusted chi-square statistics (e.g. Rao-Scott and Satterthwaite) for use in analyzing NHANES data.

Step 2: Review Output

  • 9,094 respondents have information on blood pressure cuff size.
  • The row percentages indicate that males tend to have a larger cuff size than females.
  • Because the p-value is less than 0.05, you would reject the null hypothesis that gender and blood pressure cuff size are independent. The probability of obtaining a value of 274.74 or more is approximately zero.

Task 3b: How to Perform Chi-Square Test Using SAS Survey Procedures

In this task, you will use the chi-square test in SAS to determine whether gender and blood pressure cuff size are independent of each other.

Step 1: Set Up SAS to Perform Chi-Square Test

The chi-square statistic is requested from the SAS Survey Procedures procedure proc surveyfreq. The summary table below provides an example of how to code for a chi-square test in SAS.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Calculating chi-square Using SAS Survey Procedures proc surveyfreq
Statements Explanation
proc surveyfreq data=analysis_data;

Use the SAS Survey procedure, proc surveyfreq, to examine the relationship between two categorical variables.

strata sdmvstra;
Use the strata statement to specify the strata variable (sdmvstra) and account for design effects of stratification.
cluster sdmvpsu;
Use the cluster statement to specify PSU(sdmvpsu) to account for design effects of clustering.
weight wtmec4yr;
Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.
table sel*riagendr*bpacsz/col row nostd nowt wchisq wllchisq chisq chisq1; Use the table statement to specify cross-tabulations for which estimates are requested. In the example, the estimates are for age greater than or equal to 20 (sel) by gender (riagendr) and by blood pressure cuff size (bpacsz). The options after the slash will output the column percent (col), row percent (row), Wald chi-square (wchisq), and Wald log linear chi-square (wllchisq), and suppress the standard deviation (nostd) and weighted sums (nowt). Use the chisq option to obtain the Rao-Scott chi-square and the chisq1 to obtain the Rao-Scott modified chi-square.
format riagendr sexfmt. bpacsz csz2fmt. ;
run ;
Use the format statement to read the SAS formats.

IMPORTANT NOTE

For complex survey data such as NHANES, we recommend using the Rao-Scott F adjusted chi-square statistic since it yields a more conservative interpretation than the Wald chi-square.

Step 2: Review output

  • 9,094 respondents have information on blood pressure cuff size.
  • The row percentages indicate that males tend to have a larger cuff size than females.
  • Because the F adjusted p-value is less than 0.05, you would reject the null hypothesis that gender and blood pressure cuff size are independent. The probability of obtaining a value of 125.55 or more is approximately zero.

Task 3c: How to Perform Chi-Square Test Using Stata

In this task, you will use the chi-square test in Stata to determine whether gender and blood pressure cuff size are independent of each other. The chi-square statistics is requested from the Stata command svy:tabulate.

IMPORTANT NOTE

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your blood pressure cuff size (bpacsz) analysis, use the weight variable for four-yours of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra). The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for four years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Regroup blood pressure cuff size variable

In this example, a new variable (cuff_size) is created to regroup blood pressure cuff size (bpacsz) from five categories to four categories. This collapses the infant (1) and child (2) groups. Use the gen command to create a new variable.

gen cuff_size=1 
if bpacsz==1 | bpacsz==2
replace cuff_size=2 if bpacsz==3
replace cuff_size=3 if bpacsz==4
replace cuff_size=4 if bpacsz==5

Step 3: Generate chi-square statistics using svy:tabulate

Now, that the svyset has been defined you can use the Stata command, svy: tabulate, to produce two-way tabulations with tests of independence. Some of the options for the tab command include:

  • column and row to display column and row percentages (if you do not specify this you will get cell proportions);
  • obs lists the number of observations in each cell; count lists the weighted n in each cell and by adding format(%11.0fc) you will display the counts with commas rather than scientific notation;
  • ci gives the confidence interval around each estimate, but can only be used with either row or column, not both; and
  • the Pearson (Rao-Scott correction F-statistic) chi-square (pearson), null-based (null), and Wald (wald) test statistics.

The general command for generating two-way tabulations is below.

svy:tabulate varname, subpop(if condition) options

Use the svy : tabulate command to produce two-way tabulations for gender (riagendr) and blood pressure cuff size (cuff_size) with tests of independence for people age 20 years and older. (See Section 5.4 of Korn and Graubard Analysis of Data from Health Surveys, pp 207-211). Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. This example uses an if statement to define the subpopulation based on the age variable's (ridageyr) value. Another option is to create a dichotomous variable where the subpopulation of interest is assigned a value of 1, and everyone else is assigned a value of 0. The options specified for this example, use the column, rows, obs, percent, pearson, null and wald test statistic options.

svy:tab riagendr cuff_size, subpop (if ridageyr >=20 & ridageyr<.) column row obs percent pearson null wald
Output of svy:tabulate command with column, row, obs, percent, pearson, null and wald options

Output of svy:tabulate command with column, row, obs, percent, pearson, null and wald options

Step 4: Review output

Here is a table summarizing the output:

Variable Men
age 20 and older
(n=4312)
Women
age 20 and older
(n=4782)
p value
Cuff size
(1) Infant 0% 0% <0.0001
(2) Child 1.5% 5%
3 Adult 29% 44%
4 Large 58% 41%
5 Thigh 12% 10%

Men have a larger cuff size than women — for example, 70% of men had cuff size of 4 or 5 compared to 51% of women. Cuff size varies significantly according to gender (p<0.0001). NOTE: The grayed cells have too few observations to create stable estimates and should probably not be reported.

TOP