Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 7: Age Standardization and Population Estimates

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

This module covers two issues that commonly arise when researchers analyze population data: age standardization and population counts (estimated numbers of persons in the U.S. with a particular characteristic). Addressing these issues in NHANES analyses requires the use of Census population data.

Age is one of the most common and important confounding factors in health studies. Age can confound comparisons when the groups being compared have different age distributions and age is related to the outcome of interest (e.g. death or the prevalence of disease). Age standardization is a method that allows you to take away the confounding effect of age in order to allow you to make fair comparisons.

To understand the public health impact of a problem, it is often helpful to calculate population counts in addition to the prevalence of a health condition. By quantifying the number of people with a particular condition, counts directly speak to the magnitude of a public health problem.

Age Standardization

Age standardization, sometimes referred to as age adjustment, is a method that applies observed age-specific rates to a standard age distribution. It is used when comparing two or more populations at one point in time, or one population at two or more points in time.

This method allows you to take away the confounding effect of age which can distort comparisons between groups with different age distributions when age is related to the outcome of interest. While many factors affect health outcomes, age is generally the strongest, since the chance of developing or dying from chronic health conditions typically increases with age; also, different age groups might have differential exposure to behavioral or environmental risks. For example, imagine you were interested in comparing the burden of hypertension among non-Hispanic Whites and Mexican-Americans. Say you found that hypertension was much less common among Mexican-Americans. You need to be very careful before drawing any conclusions since the prevalence of hypertension increases with age, and the Mexican American population is substantially younger than the non-Hispanic White population.

There are a number of ways to solve this problem. One way would be to compare the prevalence of the condition within similar age groups. For example, compare prevalence of hypertension among non-Hispanic Whites and Mexican-Americans age 20-24 years, age 25-29 years, age 30-34 years, and so on. The problem with this approach is that it is tedious; also it makes it hard to draw an overall conclusion. Age standardized estimates let you compare the prevalence of the condition overall in the groups after removing any differences in age. It lets you say what the prevalence would be if the groups under consideration had exactly the same age structure.

There are two widely used methods of age standardization: direct and indirect. In both cases, the general idea is to construct an estimate based on what would be seen if the age distributions in the comparison groups were the same. This tutorial will use the direct method. There are two basics steps:

  1. Choosing a standard population. In general, the standard population can be either a single study group, a combined study group, or an external population (like the US population). For Continuous NHANES, the recommended standard population is the 2000 Census data.
  2. Applying the age-specific prevalence of the outcome observed in the study population (i.e, the population you want to age-adjust) to the standard population. This is typically done in 5 or 10 year age groups. This just means multiplying the age-specific prevalence from the study population by the proportion of people in that age group in the standard population, and summing up the results to get the age-adjusted estimate:
Age-adjusted prevalence equals the sum of the prevalence of the condition in the study population multiplied by the proportion of people in that age group in the standard population.

Age-adjusted Prevalence Equation

Where pi is the prevalence of the condition in the study population, and wi is the proportion of people in that age group in the standard population.

It should be clear that the age adjusted estimate is a fiction. The observed (or unadjusted, or crude) prevalence in the study population is real. But the age standardized estimates are extremely useful because they are not confounded by age. That is, as long as you use the same standard population, you can now safely make comparisons across groups even if their underlying age structures vary substantially.

IMPORTANT NOTE

When comparing health outcomes between subgroups, age-adjusted rates can be considerably different from crude rates - when there is a lot of confounding by age. This usually occurs because the population distribution of the subgroups is different from the distribution of the standard population, and because the health conditions and risk factors used in an analysis are associated with the confounding variable of age. It is generally good practice to use age-adjusted estimates when comparing health outcomes among subgroups, or at least compare the age-adjusted estimates with the crude rates to make sure there are no substantial differences, before using the crude estimates.

References

Klein RJ, Schoenborn, CA. Age Adjustment using the 2000 projected U.S. population. Healthy People Statistical Notes, no. 20. Hyattsville, Maryland: National Center for Health Statistics. January 2001.

Buescher PA. Age-adjusted death rates. Statistical Primer. State Center for Health Statistics. Raleigh, NC

Task 1a: How to Generate Age-Adjusted Prevalence Rates and Means in SUDAAN

In this example, you will generate age-adjusted prevalence rates and standard errors for high blood pressure (HBP) by sex and race in persons 20 years and older. An optional second example is available demonstrating how to generate age-adjusted means and standard errors for Body Mass Index (BMI) by sex and race/ethnicity for persons 20 years and older.

To calculate age-adjusted prevalence rates, you will need to know the age standardizing proportions that you want to use, and then apply them to the populations under comparison. This is called the direct method for age standardization. Typically, Census data are used as the standard population structure. For age standardization in NHANES, NCHS recommends using the 2000 Census population. A spreadsheet with the year 2000 U.S. population structure by age is attached below. Calculate the standard age proportions by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions (P/T) should sum to 1 (see the table below for the standard age proportions used in this module.)

Attachment

For your convenience, standard proportions for different NHANES population age groupings are provided in the Excel spreadsheet attached below. This file uses the 2000 Census as the standard population. The adjustment factors were calculated for four age groupings:

  1. all ages,
  2. ages 6 years and older,
  3. ages 20 years and older using 10 year age intervals, and
  4. for the blood pressure example in this module, for ages 20 years and older using 20 year age intervals.

For other age groupings, you can combine the smaller age groups provided in order to reflect the age and subpopulation you are using in your analysis.

Standard Proportions for NHANES Population Groupings link: ageadjtwt.xls

Example of How to Calculate Standard Age Proportions

Here is an example of how to calculate the standard age proportions by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions should sum to 1.

Standard Proportions for 20-year Age Groups Based on the 2000 U.S. Census Standard Population
Age Group Age-Specific
Census Population
(in thousands)
Total Census Population
(in thousands)
Standard Age Proportions
P T P/T
20-39 77,670 195,850 .396579
40-59 72,816 195,850 .371795
60+ 45,364 195,850 .231626
Total: 195,850 Sum: 1

As you can see each "standard age proportion", also referred to as "age adjustment weight", is simply the proportion of people in the 2000 Census - the standard population - in a specific age category. For example, the standard age proportion for people 20-39 years old is:

77,670 thousand people age 20-39 years over 195,850 thousand population ages 20+ equals 0.396579

Equation for the standard age proportion for people 20-39 years old

Reference

Klein RJ, Schoenborn, CA. Age Adjustment using the 2000 projected U.S. population. Healthy People Statistical Notes, no. 20. Hyattsville, Maryland: National Center for Health Statistics. January 2001.

Step 1: Generate Age-Adjusted Prevalence Rates

The SUDAAN procedure, proc descript, is used to generate age-adjusted percentages (prevalence rates) and standard errors. The age standardization variable and proportions are provided in the STDVAR and STDWGT statements. The age-adjusted estimates are requested on the print statement along with the sample size (nsum). The SUDAAN program used to obtain weighted age-adjusted prevalence rates and standard errors for high blood pressure by sex and race, among persons 20 years and older follows here.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Generating Age-Adjusted Prevalence Rates
Statements Explanation
proc sort data =analysis_data; 
by sdmvstra sdmvpsu; 
run ;
Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu).
proc descript data =analysis_data design=wr;
Use the proc descript procedure to generate adjusted means and specify the sample design using the design option WR (with replacement).
subpopn ridageyr >=20 ;

Use the subpopn statement to select the sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example.

Note: For accurate estimates, it is preferable to use subpopn in SUDAAN to select a subpopulation for analysis, rather than select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;
Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.
WEIGHT wtmec4yr;
Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used.
subgroup riagendr age race ;
Use the subgroup statement to list the categorical variables for which statistics are requested. These names will also appear in the table statement below. In this example, gender (riagendr), age (age), and race-ethnicity (race) are of interest.
levels 2 3 4 ;
Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. In this example, there are two genders, three age groups, and four race-ethnicity groups.
var hbp;

Use the var statement to name the variable(s) to be analyzed. In this example, the high blood pressure variable (hbp) is used.

catlevel 1 ;

Use the catlevel statement to indicate the var statement variable(s) are categorical and select the level of each variable to be analyzed. In this example, you are interested in hbp=1, i.e., persons who have high blood pressure.

table riagendr * race ;
Use the table statement to specify cross-tabulations that estimates are requested. If a table statement is not present, a one—dimensional distribution is generated for each variable on the subgroup statement. In this example, the estimates are for gender (riagendr) by race-ethnicity (race).

stdvar age; 
stdwgt 0.3966 0.3718 0.2316 ;

Use the stdvar and stdwgt statements to yield standardized estimates of the mean. In the example, age is the standardizing variable as defined on the stdvar statement (note that age must also appear on the subgroup statement). The stdwgt statement specifies the population proportions based on the 2000 Census estimates. The number of proportions listed should equal the number of levels in the stdvar variable and should be listed in the same order as the respective level of the variable (see levels statement above). Their sum should equal 1.
print 
nsum= "Sample Size" percent= "Percent" sepercent= "SE" ;
Use the print statement to assign names and format the desired statistics and to view the output. If the statement print is used alone, all of the default statistics are printed with default labels and formats.

In this example, the sample size (nsum), adjusted percent (percent), and standard error of the percent (sepercent) are requested.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

rtitle " Age-standardized prevalence of persons 20 years and older with high blood pressure: NHANES 1999-2002" ;
Use the rtitle statement to assign a heading to the of output.
rfootnote "Age adjusted by the direct method to the year 2000 Census population projections using the age groups 20-39, 40-59, and 60+" ;
Use the rfootnote statement to specify a footnote to the tables.

IMPORTANT NOTE

Note: To calculate the unadjusted prevalence, use the program code above, EXCEPT DO NOT USE the stdvar and stdwgt statements.

Highlights from the output include:

  • The output lists the sample sizes, age adjusted percentages (prevalence rates), and their standard errors.
  • Hypertension prevalence in Mexican Americans changed from a crude prevalence of 17% to 26% in the age-standardized estimate. The Mexican Americans are younger with a mean age of 38 years, and because hypertension increases with age, the age-adjusted estimate among Mexican Americans is higher than the crude estimate.
  • Similarly, the non-Hispanic whites are somewhat older, so their age-adjusted prevalence is lower than the crude estimate.
  • According to the unadjusted estimates, the difference between HBP prevalence for Mexican-American and non-Hispanic white groups is approximately 12%.
  • However, the age-adjusted estimates show about a 2% difference between HBP prevalence for Mexican-American and non-Hispanic white groups.
  • Non-Hispanic blacks have a higher age-adjusted prevalence of HBP (41%) than do other race/ethnic groups.

Step 2: Generate Age-Adjusted Means (Optional)

The SUDAAN procedure, proc descript, is used to generate age-adjusted means and standard errors. The age standardization variable and proportions are provided in the stdvar and stdwgt statements. These age-adjusted estimates are requested in the print statement. The SUDAAN program used to obtain weighted adjusted means and standard errors for BMI, by sex and race among persons 20 years and older follows here.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Generating Adjusted Means
Statements Explanations
proc sort data =analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu).

PROC DESCRIPT data =analysis_data design=wr;

Use the proc descript procedure to generate adjusted means and specify the sample design using the design option WR (with replacement).

subpopn ridageyr >=20 ;

Use the subpopn statement to select the sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example.

Note: For accurate estimates, it is preferable to use the subpopn in SUDAAN to select a subpopulation for analysis, rather than to select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

WEIGHT wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subgroup riagendr age race ;

Use the subgroup statement to list the categorical variables for which estimates are requested. These names will also appear in the table statement below. In this example, gender (riagendr), age (age), and race-ethnicity (race) are of interest.

levels 2 3 4 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example has two genders, three age groups, and four race-ethnicity groups.

var bmxbmi;

Use the var statement to name the variable(s) to be analyzed. In this example, the BMI variable (bmxbmi) is used.

table riagendr * race ;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one-dimensional distribution is generated for each variable on the subgroup statement. In this example, the estimates are for gender (riagendr) by race-ethnicity (race).

stdvar age;
stdwgt 0.3966 0.3718 0.2316

Use the stdvar and stdwgt statements to yield standardized estimates of the mean. In this example, age is the standardizing variable, as defined in the stdvar statement (note that age must also appear in the subgroup statement). The stdwgt statement specifies the population proportions based on the 2000 Census estimates. The number of proportions listed should equal the number of levels in the stdvar variable and should be listed in the same order as the respective level of the variable (see levels statement above). Their sum should equal 1.

PRINT 
nsum= "Sample Size" mean= "Adjusted Mean" 
semean="Standard Error"
/ 
nohead notime
style=nchs 
nsumfmt=F7.0 
meanfmt=F9.2 
semeanfmt=F9.3

Use the print statement to assign names and format the desired statistics and to view the output. If the print statement is used alone, all of the default statistics will be printed with default labels and formats.

In this example, the sample size (nsum), adjusted mean (mean), and standard error of the mean (semean) are requested.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

rtitle " Age-adjusted means & standard errors of body mass index: NHANES 1999-2002";

Use the rtitle statement to assign a heading to the output.

rfootnote "Age adjusted by the direct method to the year 2000 Census population projections using the age groups 20-39, 40-59, and 60+" ;

Use the rfootnote statement to specify a footnote to the tables.

IMPORTANT NOTE

Note: To calculate the unadjusted mean, use the program code above, EXCEPT DO NOT USE the stdvar and stdwgt statements.

Highlights from the output include:

  • The output lists the sample sizes, adjusted means, and their standard errors.
  • After adjusting for age, there appears to be no significant difference in BMI by race/ethnicity.
  • The unadjusted and age adjusted means for BMI by race/ethnicity do not appear to be different.

Task 1b: How to Generate Age-Adjusted Prevalence Rates and Means Using SAS 9.2 Survey Procedures

In this task, you will generate age-adjusted prevalence rates and standard errors for high blood pressure (HBP) by sex and race in persons 20 years and older. An optional second example is available demonstrating how to generate age-adjusted means and standard errors for Body Mass Index (BMI) by sex and race/ethnicity for persons 20 years and older.

To calculate age-adjusted prevalence rates, you will need to know the age standardizing proportions that you want to use, and then apply them to the populations under comparison. This is called the direct method for age standardization. Typically, Census data are used as the standard population structure. For age standardization in NHANES, NCHS recommends using the 2000 Census population. A spreadsheet with the year 2000 U.S. population structure by age is attached below. The standard age proportions are calculated by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions (P/T) should sum to 1 (please see the table below for the standard age proportions used in this module.)

Attachment

For your convenience, standard proportions for different NHANES population age groupings are provided in the Excel spreadsheet attached below. This file uses the 2000 Census as the standard population. The adjustment factors were calculated for four age groupings:

  1. all ages,
  2. ages 6 years and older,
  3. ages 20 years and older using 10 year age intervals, and
  4. for the blood pressure example in this module, for ages 20 years and older using 20 year age intervals.

For other age groupings, you can combine the smaller age groups provided in order to reflect the age and subpopulation you are using in your analysis.

Standard Proportions for NHANES Population Groupings link: ageadjtwt.xls OR (!! THIS FILE DOESN'T EXIST IN OLD TUTORIALS !!) ageadjwt.pdf

Example of How to Calculate Standard Age Proportions

Here is an example of how to calculate the standard age proportions by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions should sum to 1.

Standard Proportions for 20-year Age Groups Based on the 2000 U.S. Census Standard Population
Age Group Age-Specific
Census Population
(in thousands)
Total Census Population
(in thousands)
Standard Age Proportions
P T P/T
20-39 77,670 195,850 .396579
40-59 72,816 195,850 .371795
60+ 45,364 195,850 .231626
Total: 195,850 Sum: 1

As you can see each "standard age proportion", also referred to as "age adjustment weight", is simply the proportion of people in the 2000 Census - the standard population - in a specific age category. For example, the standard age proportion for people 20-39 years old is:

77,670 thousand people age 20-39 years over 195,850 thousand population ages 20+ equals 0.396579

Equation for the standard age proportion for people 20-39 years old

Reference

Klein RJ, Schoenborn, CA. Age Adjustment using the 2000 projected U.S. population. Healthy People Statistical Notes, no. 20. Hyattsville, Maryland: National Center for Health Statistics. January 2001.

Step 1: Recode High Blood Pressure Variable

You will recode the discrete variable, hbp, as (0, 100), for absence (0) or presence (100) of the health condition of interest, to use in the SAS Surveyreg procedure.

if hbp=1 then hbpx=100;
if hbp=2 then hbpx=0;
run;

Step 2: Generate Age-Adjusted Prevalence Rates

The SAS Surveyreg procedure is used to generate age-adjusted percentages (prevalence rates) and standard errors. The SAS Survey program used to obtain weighted age-adjusted prevalence rates and standard errors for high blood pressure by race, among persons 20 years and older follows here.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Survey Procedure for Generating Age-Adjusted Prevalence Rates
Statements Explanation
PROC SURVEYREG DATA=analysis_data nomcar;

Use the SAS Survey procedure, proc surveyreg, to calculate significance. Use the nomcar option to read all observations.

STRATA sdmvstra;

Use the strata statement to specify the strata (sdmvstra) and account for design effects of stratification.

CLUSTER sdmvpsu;

Use the cluster statement to specify PSU (sdmvpsu) to account for design effects of clustering.

CLASS race age;

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., race [race] and age [age]).

WEIGHT wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

DOMAIN sel;

Use the domain statement to specify the subpopulations of interest.

WARNING

When using proc surveyreg, use a domain statement to select the population of interest. Do not use a where or by-group statement to analyze subpopulations with the SAS Survey Procedures.

MODEL hbpx=race age race*age /noint solution VADJUST=none;

Use a model statement with the noint option to produce HBP means for the 12 possible race and age combinations (note that race has four groups and age has three groups so multiplying these together equal a total of 12 groups). The solution option produces a printed version of the age-adjusted prevalences. The vadjust option specifies whether or not to use variance adjustment.

ESTIMATE 'NH White' race 1 0 0 0 age .3966 .3718 .2316 race*age .3966 .3718 .2316 0 0 0 0 0 0 0 0 0;

Use the estimate statement to produce the age-adjusted prevalence of HBP for non-Hispanic whites. Please refer to the estimate statement in the SAS Manual for more information about using vectors. The vector (vectors are location indicators) 1 0 0 0 points to the non-Hispanic whites; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn, 2001).

ESTIMATE 'NH Black' race 0 1 0 0 age .3966 .3718 .2316 race*age 0 0 0 .3966 .3718 .2316 0 0 0 0 0 0;

Use the estimate statement to produce the age-adjusted prevalence of HBP for non-Hispanic blacks. The vector 0 1 0 0 points to the non-Hispanic blacks; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn. 2001).

ESTIMATE 'Mex Amer' race 0 0 1 0 age .3966 .3718 .2316 race*age 0 0 0 0 0 0 .3966 .3718 .2316 0 0 0;

Use the estimate statement to produce the age-adjusted prevalence of HBP for Mexican-Americans. The vector 0 0 1 0 points to the Mexican-Americans; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn, 2001).

TITLE 'Age-standardized prevalence of persons 20 years and older with high blood pressure: NHANES 1999-2002';

Use the title statement to label the output.

proc print data=ageadj_prev1;
var estimate label estimate stderr;
title 'Age-standardized prevalence of persons 20 years and older with high blood pressure: NHANES 1999-2002';
run;

Use the proc print procedure to print the estimate and standard error.

IMPORTANT NOTE

Note: Program code to produce age-adjusted estimates by race-ethnicity is provided above. To see program code to produce age-adjusted estimates by race-ethnicity and gender and for gender only, please go to the Sample Code and Datasets page to download the programs.

The code for estimating the crude (unadjusted) prevalence for HBP by race/ethnicity and gender follows:

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Survey Procedure for Generating Unadjusted Prevalence Rates

Statements Explanation
proc surveymeans data=analysis_data mean nobs stderr;

Use the proc surveymeans procedure to obtain number of observations, mean, standard error and confidence intervals.

strata sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class riagendr race;

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., gender [riagendr] and race [race]).

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

var hbpx;

Use the var statement to specify which variable(s) will be analyzed. In this example, the high blood pressure variable (hbpx) is used.

domain sel sel*riagendr sel*race sel*riagendr*race;

Use the domain statement to specify the subpopulations of interest.

ods OUTPUT domain(match_all)=unadj;
run;

Use the ods statement to output the SAS dataset of estimates from the subdomains listed on the domain statement. This set of commands will output four datasets for each domain specified in the domain statement above (unadj for sel unadj1 for sel*riagendr, unadj2 for sel*race, and undadj3 for sel*riagendr*race).

data stats;
set unadj unadj1 unadj2 unadj3;
if sel=1;

Use the data statement to name the temporary SAS dataset (stats) append the four datasets, created in the previous step, if age is greater than or equal to 20 (sel).

proc print;
var race riagendr n mean stderr;
run;

Use the print statement to print the number of observations, the mean, and standard error of the mean in a printer-friendly format.

Highlights from the output include:

  • The output lists the adjusted percentages (prevalence rates) and their standard errors.
  • Hypertension prevalence in Mexican Americans changed from a crude prevalence of 17% to 26% in the age-standardized estimate. The Mexican Americans are younger with a mean age of 38 years, and because hypertension increases with age, the age-adjusted estimate among Mexican Americans is higher than the crude estimate.
  • Similarly, the non-Hispanic whites are somewhat older, so their age-adjusted prevalence is lower than the crude estimate.
  • According to the unadjusted estimates, the difference between HBP prevalence for Mexican-American and non-Hispanic white groups is approximately 12%.
  • However, the age-adjusted estimates show about a 2% difference between HBP prevalence for Mexican-American and Non-Hispanic white groups.
  • Non-Hispanic blacks have a higher age-adjusted prevalence of HBP (41%) than other race/ethnicity groups.

Step 3: Generate Age-Adjusted Means (Optional)

The SAS Surveyreg procedure is used to generate age-adjusted means and standard errors. The SAS Survey program used to obtain weighted adjusted means and standard errors for BMI, by race among persons 20 years and older follows here.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Survey Procedure for Generating Adjusted Means
Statements Explanation
PROC SURVEYREG DATA=analysis_data nomcar;

Use the SAS Survey procedure, proc surveyreg, to calculate significance. Use the nomcar option to read all observations.

STRATA sdmvstra;

Use the strata statement to specify the strata (sdmvstra) and account for design effects of stratification.

CLUSTER sdmvpsu;

Use the cluster statement to specify PSU (sdmvpsu) to account for design effects of clustering.

CLASS race age;

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., race [race] and age [age]).

WEIGHT wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

DOMAIN sel;

Use the domain statement to specify the subpopulations of interest.

MODEL bmxbmi=race age race*age /noint solution vadjust=none;

Use a model statement with the noint option to produce BMI means for the 12 possible race and age combinations (note that race has 4 groups and age has 3 groups so multiplying these together equal a total of 12 groups). The solution option produces a printed version of the age-adjusted means. The vadjust option specifies whether or not to use variance adjustment.

ESTIMATE 'NH White' race 1 0 0 0 age .3966 .3718 .2316 race*age .3966 .3718 .2316 0 0 0 0 0 0 0 0 0;

Use the estimate statement to produce the age-adjusted mean BMI for non-Hispanic whites. Please refer to the estimate statement in the SAS Manual for more information about using vectors. The vector (vectors are location indicators) 1 0 0 0 points to the non-Hispanic whites; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn, 2001).

ESTIMATE 'NH Black' race 0 1 0 0 age .3966 .3718 .2316 race*age 0 0 0 .3966 .3718 .2316 0 0 0 0 0 0;

Use the estimate statement to produce the age-adjusted mean BMI for non-Hispanic blacks. The vector 0 1 0 0 points to the non-Hispanic blacks; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn, 2001).

ESTIMATE 'Mex Amer' race 0 0 1 0 age .3966 .3718 .2316 race*age 0 0 0 0 0 0 .3966 .3718 .2316 0 0 0;

Use the estimate statement to produce the age-adjusted mean BMI for Mexican-Americans. The vector 0 0 1 0 points to the Mexican-Americans; the vectors .3966, .3718 and .2316 correspond to the proportion of 20-39 , 40-59, and 60+ years adults in the U.S. population (Klein and Schoenborn, 2001).

TITLE 'Age-adjusted means & standard errors of body mass index: NHANES 1999-2002';

Use the title statement to label the output.

proc print data=ageadj_mean1;
var estimatelabel estimate stderr;
title 'Age-adjusted means & standard errors of body mass index: NHANES 1999-2002';
run;

Use the proc print procedure to print the estimate and standard error.

IMPORTANT NOTE

Note: Program code to produce age-adjusted estimates by race-ethnicity is provided above. To see program code to produce age-adjusted estimates by race-ethnicity and gender and for gender only, please go to the Sample Code and Datasets page to download the programs.

The code for estimating the crude (unadjusted) prevalence for Body Mass Index by race/ethnicity and gender follows:

SAS Survey Procedure for Generating Unadjusted Means

Statements Explanations
proc surveymeans data=ANALYSIS_DATA nobs mean stderr;

Use the proc surveymeans procedure to obtain number of observations, mean, standard error and confidence intervals.

strata sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class riagendr race;
Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., gender [riagendr] and race [race]).
var bmxbmi;
Use the var statement to specify which variable(s) will be analyzed. In this example, the Body Mass Index variable (bmxbmi) is used.
weight wtmec4yr;
Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.
domain sel sel*riagendr sel*race sel*riagendr*race;
Use the domain statement to specify the subpopulations of interest.
ods OUTPUT domain(match_all)=unadj;
run;

Use the ods statement to output the SAS dataset of estimates from the subdomains listed on the domain statement. This set of commands will output four datasets for each domain specified in the domain statement above (unadj for sel, unadj1 for sel*riagendr, unadj2 for sel*race, and undadj3 for sel*riagendr*race).

data stats;
set unadj unadj1 unadj2 unadj3;
if sel=1;

Use the data statement to name the temporary SAS dataset (stats) append the four datasets, created in the previous step, if age is greater than or equal to 20 (sel).

proc print;
var race riagendr n mean stderr;
title "Mean Body Mass Index: NHANES 1999-2002"
run;

Use the print statement to print the number of observations, the mean, and standard error of the mean in a printer-friendly format.

Highlights from the output include:

  • The output lists the sample sizes, adjusted means, and their standard errors.
  • After adjusting for age, there appears to be no significant difference in BMI by race/ethnicity.
  • The unadjusted and adjusted means for BMI by race/ethnicity do not appear to be different.

Task 1c: How to Generate Age-Adjusted Proportions or Prevalence Rates and Means Using Stata

In this module, you will generate age-adjusted prevalence rates and standard errors for high blood pressure (HBP) in persons 20 years and older in the United States by sex and race/ethnicity. An optional second example is available demonstrating how to generate age-adjusted means and standard errors for Body Mass Index (BMI) in persons 20 years and older in the United States by sex and race/ethnicity.

To calculate age-adjusted prevalence rates, you will need to know the age standardizing proportions that you want to use, and then apply them to the populations under comparison. This is called the direct method for age standardization. Typically, Census data are used as the standard population structure. For age standardization in NHANES, the National Center for Health Statistics (NCHS) recommends using the 2000 Census population. A spreadsheet with the year 2000 U.S. population structure by age is attached below. Calculate the standard age proportions by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions (P/T) should sum to 1 (see the table below in Step 2 for the standard age proportions used in this module.)

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(variance method)

To define the survey design variables for your high blood pressure analysis, use the weight variable for 4 years of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for 4 years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Create age standard proportions

For age standardization in NHANES, NCHS recommends using the 2000 Census population. To get the correct Census age distribution, you need to know two things: the age group of interest (e.g. all ages, ages 6 and older, adults 20 and older) and how wide the age strata are for adjustment (e.g. 5 year, 10 year or 20 year age intervals). In general, the more tightly you want to control for age, the narrower the age strata should be.

Attachment

For your convenience, standard proportions for different NHANES population age groupings are provided in the Excel spreadsheet attached below. This file uses the 2000 Census as the standard population. The adjustment factors were calculated for four age groupings:

  1. all ages,
  2. ages 6 years and older,
  3. ages 20 years and older using 10 year age intervals, and
  4. for the blood pressure example in this module, for ages 20 years and older using 20 year age intervals.

For other age groupings, you can combine the smaller age groups provided in order to reflect the age and subpopulation you are using in your analysis.

Standard Proportions for NHANES Population Groupings link: ageadjtwt.xls

Example of How to Calculate Standard Age Proportions

Here is an example of how to calculate the standard age proportions by dividing the age-specific Census population (P) by the total Census population number (T). The standardizing proportions should sum to 1.

Standard Proportions for 20-year Age Groups Based on the 2000 U.S. Census Standard Population
Age Group Age-Specific
Census Population
(in thousands)
Total Census Population
(in thousands)
Standard Age Proportions
P T P/T
20-39 77,670 195,850 .396579
40-59 72,816 195,850 .371795
60+ 45,364 195,850 .231626
Total: 195,850 Sum: 1

As you can see each "standard age proportion", also referred to as "age adjustment weight", is simply the proportion of people in the 2000 Census - the standard population - in a specific age category. For example, the standard age proportion for people 20-39 years old is:

77,670 thousand people age 20-39 years over 195,850 thousand population ages 20+ equals 0.396579

Equation for the standard age proportion for people 20-39 years old

Reference

Klein RJ, Schoenborn, CA. Age Adjustment using the 2000 projected U.S. population. Healthy People Statistical Notes, no. 20. Hyattsville, Maryland: National Center for Health Statistics. January 2001.

Step 3: Create new variables for analysis

You will need to create variables for age, race, standard weight, and high blood pressure. First, create variables to apply the age standard proportions and race/ethnicity groups to your analyses. Next, assign the census proportion for each corresponding age strata using the std_wgt variable. This variable is usually referred to as standard weight in statistical manuals. Finally, code the outcome variable a dichotomous variable, where the absence of of the outcome is coded as 0 and the presence of the outcome is coded as 100. Using 100 will express the proportion as a percentage (e.g., 0.23 would be represented as 23). The dichotomous variable, hbp, is already coded as 2 for the absence of outcome and 1 for the presence of outcome, so it will need to be recoded as a new variable, hpbx. Here is the code for creating the variables:

Code to generate variables
Variable Code to generate variables
Age
gen age=1 if ridageyr >=20 & ridageyr <40
replace age=2 if ridageyr >=40 & ridageyr <60
replace age=3 if ridageyr >=60 abd ridageyr <.
Race
gen race =1 if ridreth1 == 3 
replace race =2 if ridreth1 == 4 
replace race =3 if ridreth1 == 1 
replace race =4 if ridreth1 == 2 | ridreth1 ==5
Standard Weight
gen std_wgt=.3966 if age==1
replace std_wgt=.3718 if age==2
replace std_wgt=.2316 if age==3
High Blood Pressure
gen hbpx=100 if hbp==1
replace hbpx=0 if hbp==2

Step 4: Generate age-adjusted proportions

The Stata command, svy: mean, is used to generate age-adjusted proportions and standard errors. Using svy:mean is not a mistake - a proportion is the mean of a dichotomous variable.

The general form of the command is just like the mean command from descriptive statistics but uses the stdize and stdweight options.

svy, subpop(condition): mean depvar, stdize(agevar) stdweight(ageweightvar)

Here is the STATA command and output for the age-adjusted prevalence of high blood pressure and standard errors for men and women age 20 years and older:

svy, subpop(if ridageyr >=20 & ridageyr <.): mean hbpx, stdize(age) stdweight(std_wgt) over(riagendr)

Stata output of mean high blood pressure by gender

And here is the STATA command and output for the age-adjusted prevalence of high blood pressure and standard errors for race (non-Hispanic white, non-Hispanic black, Mexican American and other) age 20 years and older.

svy, subpop(if ridageyr >=20 & ridageyr <.): mean hbpx, stdize(age) stdweight(std_wgt) over(race)

Stata output of mean high blood pressure by race/ethnicity

IMPORTANT NOTE

To calculate the unadjusted prevalence, use the program code above, EXCEPT DO NOT USE the stdize and stdweight options.

Step 5: Compare results of crude and age-adjusted estimates

To understand how much age standardization matters, it is helpful to compare the estimates from the crude and age adjusted analyses. The following table summarizes the results:

Crude proportion with hypertension
Variable Mean Age % with hypertension Standard error
Gender
Male 45 27% 1.22
Female 47 31% 1.07
Race
Non-Hispanic white 48 30% 1.15
Non-Hispanic black 43 37% 1.51
Mexican American 38 17% 1.21
Other 43 28% 2.30
Age-Adjusted proportion with hypertension
Variable Mean Age % with hypertension Standard error
Gender
Male 45 28% 1.21
Female 47 30% 0.71
Race
Non-Hispanic white 48 28% 0.97
Non-Hispanic black 43 41% 1.05
Mexican American 38 26% 0.98
Other 43 31% 2.25

Highlights from the output include:

  • Using the more appropriate age-adjusted analyses for subgroup comparisons, non-Hispanic blacks have a higher prevalence of HBP (41%) than do other race/ethnic groups.
  • Hypertension prevalence in Mexican Americans changed from a crude prevalence of 17% to 26% in the age-standardized estimate. The Mexican Americans are younger with a mean age of 38 years, and because hypertension increases with age, the age-adjusted estimate among Mexican Americans is higher than the crude estimate.
  • Similarly, the non-Hispanic whites are somewhat older, so their age-adjusted prevalence is lower than the crude estimate.
  • According to the unadjusted estimates, the difference between HBP prevalence for Mexican-American and non-Hispanic white groups is approximately 12%.
  • However, the age-adjusted estimates show about a 2% difference between HBP prevalence for Mexican-American and Non-Hispanic white groups.
Stata Commands for Generating Age-Adjusted Prevalence Rates
Statements Explanations
use "C:\nhanes\data\analysis_data.dta", clear
Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.
svyset sdmvpsu [pweight=wtmec4yr], strata(sdmvstra) vce(linearized)
Use the svyset command to declare the survey design for the dataset. Specify the psu variable sdmvpsu. Use the [pweight=] option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used. Use the strata ( ) option to specify the stratum identifier (sdmvstra). Use the vce( ) option to specify the variance estimation method (linearized) for Taylor linearization.
svy, subpop(if ridageyr >=20 & ridageyr < .): mean hbpx, stdize(age) stdweight(std_wgt) over(riagendr race)
Use the svy : mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the stdize( ) and stdweight( ) options to yield standardized estimates of the mean. In the example, age is the standardizing variable as defined by the stdize( ) statement. The stdweight( ) option specifies the population proportions based on the 2000 Census estimates as defined by the variable std_wgt. Use the over( ) option to specify subgroup cross-tabulations for estimates requested. In this example, gender (riagendr), and race-ethnicity (race) are of interest.

Optional Step: Generate age-adjusted means

To generate age-adjusted means, follow Steps 1-3 of How to Generate Age-Adjusted Proportions or Prevalence Rates and Means Using Stata (see above).

The Stata command, svy: mean, is used to generate age-adjusted means and standard errors.

The general form of the command is like the svy:mean command used in the Descriptive Statistics module, but uses the stdize and stdweight option

svy, subpop(condition): mean depvar, stdize(agevar) stdweight(ageweightvar)

Here is the STATA command and output for the age-adjusted mean Body Mass Index (BMI) (and standard errors) for men and women age 20 years and older:

svy, subpop(if ridageyr >=20): mean bmxbmi, stdize(age) stdweight(std_wgt) over(riagendr)

Stata output of age-adjusted estimates of mean Body Mass Index by gender

And here is the STATA command and output for the age-adjusted mean BMIs and standard errors for the racial groups (Non-Hispanic White, Non-Hispanic Black, Hispanic, Mexican American and Other) age 20 years and older.

svy, subpop(if ridageyr >=20): mean bmxbmi, stdize(age) stdweight(std_wgt) over(race)

Stata output of age-adjusted estimates of mean Body Mass Index by race

Crude estimates of mean BMI
Variable Mean Age Mean BMI Standard error
Gender
Male 45 27.8 0.13
Female 47 28.2 0.17
Race
Non-Hispanic white 48 27.8 0.16
Non-Hispanic black 43 29.5 0.20
Mexican American 38 28.2 0.19
Other 43 27.7 0.34
Age-adjusted estimates of mean BMI
Variable Mean Age Mean BMI Standard error
Gender
Male 45 27.8 0.13
Female 47 28.2 0.17
Race
Non-Hispanic white 48 27.8 0.16
Non-Hispanic black 43 29.6 0.19
Mexican American 38 28.5 0.20
Other 43 27.7 0.33

IMPORTANT NOTE

To calculate the unadjusted prevalence, use the program code above, EXCEPT DO NOT USE the stdize and stdweight options.

Highlights from the output include:

  • After adjusting for age, there appears to be no significant difference in BMI by race/ethnicity.
  • The unadjusted and age adjusted means for BMI by race/ethnicity do not appear to be different.

Population Counts

Calculating population counts for a given condition from NHANES follows these steps:

  1. Calculate the percentage who have the outcome or characteristic by age, sex, or race/ethnicity subgroups, in which you are interested. You will output these results to a SAS or Stata file.

IMPORTANT NOTE

Note: Age standardization of the prevalence estimates is NOT performed because the population counts should be based on the crude (unadjusted) prevalence in the population.

  1. Use the relevant population totals from the Current Population Surveys (CPS) to determine population estimates in NHANES. Since NHANES is a nationally representative survey of the non-institutionalized U.S. population, population estimates are based on the CPS totals for this aspect of the U.S. population. Use CPS totals for the midpoint of each survey cycle. CPS-based population tables for NHANES by race/ethnicity, gender and age are located at: https://wwwn.cdc.gov/nchs/nhanes/ResponseRates.aspx.

IMPORTANT NOTE

Note: Population totals generated in NHANES can only be representative of the number of individuals with the health condition in the non-institutionalized U.S. population.

  1. If you wish to report multiple age, gender or race/ethnic subgroups, you can combine these population totals. It also is possible to combine NHANES survey cycles. For example, to combine two survey cycles (e.g., 2001-2002 and 2003-2004), you must use the midpoint of each cycle, and combine them as follows: ½ (NHANES 2001-2002 population totals) + ½ (NHANES 2003-2004 population totals) in order to get a population total for 2001-2004. Similarly, you would do this for each of the age-, sex-, or race/ethnicity groups you wanted to combine to get a population total for that group. Once CPS totals are combined, results should be output to a file.

IMPORTANT NOTE

Note: The only exception would be when combining NHANES 1999-2000 with 2001-2002 data. As stated in the weighting module, these survey years used a different reference population for sampling, so population totals for 1999-2002 are provided by NCHS.

  1. Multiply the prevalence of the health condition of interest by the corresponding CPS-based population total to obtain an estimate of the number of non-institutionalized U.S. individuals with the condition. To calculate age-, sex-, or race/ethnicity- specific population estimates, multiply the prevalence of the health condition in each sub-domain by the CPS population total for the respective sub-domain.

Since the non-institutionalized CPS population totals are used to calculate the final sampling weights for the NHANES survey, you may wonder why you cannot just sum the final sampling weights for all sample persons with the health condition of interest, in order to arrive at population estimates for the health condition. For example, the total population estimate for a given health condition from the interviewed sample should equal the sum of the final interview weights for that health condition within the demographic domains among all interviewed persons. However, if there are a significant number of exclusions or missing data for a health condition, summing the weights will not produce an accurate population estimate. Therefore, using this method is NOT RECOMMENDED. The differences in population estimates by the calculated method versus the summed weight method are illustrated in the table below.

Comparison of Population Estimates using Calculated and Summed Methods
Sample Domain % U.S. Population Correct Estimate Incorrect Estimate
Total 29.2% 57,859,000 55,362,000
Male 27.3% 25,844,000 24,855,000
Female 30.9% 32,039,000 30,506,000
Non-Hispanic Blacks 37.0% 8,103,000 7,277,000
Mexican American 17.1% 2,409,000 2,182,000

IMPORTANT NOTE

DO NOT use the summed weight method to determine population estimates for a given health condition because the potential for exclusions or missing data for that health condition may lead to population underestimates.

Task 2a: How to Generate Population Counts in SUDAAN

In this example, you will use SUDAAN to combine age subgroups and generate population estimates for high blood pressure (HBP) by sex and race/ethnicity for persons 20 years and older. The method outlined in this module uses a SAS data file with CPS population totals. The process for combining subgroups and calculating population estimates is then automated using the code outlined below.

Alternatively, you can use the CPS population totals located on the respective survey cycle NHANES web page (referred to in Key Concepts), plus the results from a proc descript or proc crosstab procedure and manually calculate population estimates within a spreadsheet. If you choose this option, you will need to define the age, race/ethnicity and gender subgroups of interest and calculate population totals within the spreadsheet on your own.

Step 1: Calculate Prevalence of the Health Condition of Interest

The proc descript procedure in SUDAAN and additional SAS code will be used to generate population estimates using the 3-step process outlined below.

In Step 1 of this method you will calculate the prevalence of the health condition of interest (i.e., HBP) by each sub-domain of interest (i.e., sex and race/ethnicity) using the proc descript procedure. You will need to use appropriate weights, especially when combining across survey cycles.

The health outcome must be coded as a dichotomous variable (0, 100) for absence (0) or presence (100) of the health condition of interest. (HBPX).

hbpx=.;
if hbp= 1 then hbpx= 100 ;
else if hbp= 2 then hbpx= 0;

Similar to the proc descript procedure used in Task 1, you will output the results to a SAS data file using the output statement, as shown below. Population estimates will not be age- standardized so that they reflect the true population sampled.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SUDAAN Procedure for Generating Prevalence Rates
Statements Explanation
proc sort data=analysis_data;
by sdmvstra sdmvpsu;
run ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu).

PROC descript data=analysis_data design=wr noprint mean
atlevel1=1 atlevel2= 2 ;

Use proc descript to generate means and specify the sample design using the design option WR (with replacement).

subpopn ridageyr >=20 ;

Use the subpopn statement to select sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example.

Note: For accurate estimates, it is preferable to use the subpopn statement in SUDAAN to select a subpopulation for analysis, rather than to select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

WEIGHT wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subgroup riagendr age race ;

Use the subgroup statement to list the categorical variables for which estimates are requested. These names will also appear in the table statement below. In this example, gender (riagendr), age (age), and race-ethnicity (race) are of interest.

levels 2 3 4;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example has two genders, three age groups, and four race-ethnicity groups.

var hbpx ;

Use the var statement to name the variable(s) to be analyzed. In this example, the high blood pressure variable (hbpx) is used.

table riagendr * race ;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one-dimensional distribution is generated for each variable in the subgroup statement. In this example, the estimates are for gender (riagendr) by race-ethnicity (race).

OUTPUT NSUM MEAN SEMEAN atlev2 atlev1 /
FILENAME=nh9902 FILETYPE=SAS REPLACE;
run ;

Use an output statement to output the number of observations (nsum), mean (mean), standard error of the mean (semean), number of strata (atlev1), and number of PSUs (atlev2) to a SAS file named nh9902.

These data will be fed into a SAS program where degrees of freedom, t-statistics, and 95% confidence limits will be calculated for each prevalence estimate, as shown below. Values also will be labeled and formatted.

Calculate Degrees of Freedom, T-statistic, and Confidence Intervals from SAS Output Dataset
Statements Explanation
DATA nh9902c; SET nh9902;
df=atlev2-atlev1;

Use the data statement to create a new dataset (nH9902c) from the SAS dataset created in Step 1 above (nh9902).

Calculate the degrees of freedom (df) by subtracting the number of strata (atlev1) from the number of PSU (atlev2).

tcritl=tinv( .025 ,df);
tcritu=tinv(.975 ,df);

Use these statements to calculate the t-statistics (tcritl and tcritu) based on the calculated degrees of freedom (df).

ll=round((mean+tcritl*semean),.01);
ul=round((mean+tcritu*semean),.01);

Use these statements to calculate the lower limit (ll), and upper limit (ul) for the Wald 95% confidence intervals.

percent=round(mean,.01);
sepercent=round(semean,.01);
run ;

Use these statements to round the estimates and rename them to percent and sepercent.

Step 2: Combine CPS Population Totals

In this step, you will combine appropriate CPS population totals across survey cycles AND across years of age to reflect the subpopulation of interest (i.e., those 20 and older by sex and race).

In this module, CPS population totals are supplied as a SAS dataset with values for: age (CTUTAGE) ranging from 0 to 85+ years ; gender (CTUTGNDR); race/ethnicity (CTUTRACE), where 1= non-Hispanic white, 2=non-Hispanic black, 3=Mexican American and 4=other; race/ethnicity (CTUTRETH), where 1=Mexican American, 2=non-Hispanic other, 3=non-Hispanic white, 4=non-Hispanic black, 5=other Hispanic; ethnicity (CTUTHISP) where 1=Hispanic and 2=non-Hispanic; survey cycle (CTUTSRVY); and the population total (CTUTPOPT). Appropriate age, race/ethnicity, and gender groups were created in a previous step.

The proc descript procedure in SUDAAN will be used to calculate CPS population totals for the sub-domains of interest (i.e., sex and race/ethnicity) for the subpopulation of interest (age 20 and older). In this case, no sample design factors or weights need to be used. The design specified is SRS (simple random sampling). The variable is CTUTPOPT (the population totals). Subgroup totals are output to another SAS data set (pt9902) for use in the next step. Nothing is printed.

SUDAAN Procedure to Calculate CPS Population Totals
Statements Explanation
PROC descript data= nh.cpstot9902 design=srs noprint means;

Use proc descript to generate means and specify the sample design using the design option SRS (simple random sample). Use the SAS-supplied dataset (cpstot9902) to read the CPS population totals.

SUBPOPN ctutage >= 20 ;

Use the subpopn statement to select sample persons 20 years and older (ctutage>=20) because only those individuals are of interest in this example.

Note: For accurate estimates, it is preferable to use the subpopn statement in SUDAAN to select a subpopulation for analysis, rather than to select the study population in the SAS program while preparing the data file.

SUBGROUP ctutgndr ctutrace ;

Use the subgroup statement to list the categorical variables for which estimates are requested. These names will also appear in the table statement below. In this example, gender (ctutgndr) and race-ethnicity (ctutrace) are of interest.

LEVELS 2 4 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example has two genders and four race-ethnicity groups.

VAR ctutpopt;

Use the var statement to name the variable(s) to be analyzed. In this example, the population total variable (ctutpopt) is used.

TABLES ctutgndr*ctutrace ;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one-dimensional distribution is generated for each variable on the subgroup statement. In this example, the estimates are for gender (ctutgndr) by race-ethnicity (ctutrace).

OUTPUT total / FILENAME=pt9902 FILETYPE=SAS REPLACE;
Run ;

Use an output statement to output the total number of observations to a SAS file named pt9902.

Step 3: Multiply Prevalence Estimates with CPS Population Totals

In this last step, you will multiply prevalence estimates with corresponding CPS population totals to estimate the total number of non-institutionalized U.S. citizens affected with HBP.

Note that the datasets produced in Step 1 and Step 2 will be sorted on the sub-domain variables and merged. The new dataset will be used in the final SAS program. Percent prevalence estimates as well as lower and upper 95% confidence limits will be multiplied to the corresponding population total for that subgroup. Results will be rounded, formatted, and printed in SAS.

Calculate Population Estimates from SAS Output Dataset
Statements Explanation
proc sort data =nh9902c;by riagendr race; run ;
proc sort data =pt9902(rename=(ctutgndr=riagendr ctutrace=race));
by riagendr race;
run ;

Use the proc sort procedure to sort the two datasets by sex and race. In the second dataset, rename the CPS total race and gender (ctutrace and ctutgndr) variables to match the variable names used in the original dataset.

data comb;
merge nh9902c( in =a) pt9902 ;
by riagendr race ;
if a ;

Use the data statement to create a new dataset (comb) by merging the SAS datasets created previously (nh9902c andpt9902).

popmean=(percent/100 )*total ;
popl=ll/100 *total ;
popu=ul/100 *total ;

Use these statements to calculate the population counts by applying the population totals (total) to the prevalence estimate (percent) and the 95% confidence interval limits.

poplr=round(popl,1000);
popur=round(popu,1000);
popmeanr=round(popmean,1000);
totalr=round(total,1000) ;

Use these statements to round and format the estimates to the nearest thousand.

proc print noobs split = '/' double ;
var riagendr race percent sepercent ll ul df nsum
totalr popmeanr poplr popur ;
format race racefmt. riagendr sexfmt. nsum 5.0 percent 5.2 sepercent 5.2
ll4.2 ul 4.2 df 2.0 ;
label
percent='%' / 'with' / 'high' / 'bp'
nsum='Num' / 'bp' / 'status'
sepercent='Std' / 'error'
ll='Lower' / '95%' / 'Wald' / 'CI'
ul='Upper' / '95%' / 'Wald' / 'CI'
df='degrees' / 'of' / 'freedom'
popmeanr='Pop' / 'Est' / 'US' / 'with' / 'high' / 'bp'
totalr='Pop' / 'total' / 'US'
poplr='Pop Est' / 'Lower' / '95%' / 'WALD' / 'CI'
popur='Pop Est' / 'Upper' / '95 %' / 'WALD' / 'CI' ;
title1 'Prevalence of persons with high Bp - US, 1999-2002' ;
title2 'Percent and population estimates of number with high Bp-Wald CI' ;
run ;

Use the proc print procedure to print the variables of interest.

Highlights from the output include:

  • Nearly 58 million non-institutionalized U.S. adults have high blood pressure, with approximately 26 million adult men and 32 million adult women affected.
  • The number of non-institutionalized U.S. adults with hypertension, by race/ethnicity, is as follows: approximately 42.7 million non-Hispanic whites, 8.1 million non-Hispanic Blacks, and 2.4 million Mexican-Americans.

Task 2b: How to Generate Population Counts in SAS Survey Procedures

In this example, you will use SAS Survey Procedures to combine age subgroups and generate population estimates for high blood pressure (HBP) by sex and race/ethnicity for persons 20 years and older. The method outlined in this module uses a SAS data file with CPS population totals. The process for combining subgroups and calculating population estimates is then automated using the code outlined below.

Alternatively, you can use the CPS population totals located on the respective survey cycle NHANES web page (referred to in Key Concepts), plus the results from a proc surveymeans procedure and manually calculate population estimates within a spreadsheet. If you choose this option, you will need to define the age, race/ethnicity and gender subgroups of interest and calculate population totals within the spreadsheet on your own.

Step 1: Calculate Prevalence of Health Condition of Interest

The SAS Survey Procedure, proc surveymeans, is used to generate population estimates. The general program for obtaining population estimates is outlined in the 3-step process below:

In the first step, you will calculate the prevalence of the health condition (i.e. HBP) by sub-domains of interest. You will need to use appropriate weights, especially when combining across survey cycles.

The health outcome must be coded as a dichotomous (0, 100) variable for absence (0) or presence (100) of the health condition of interest (i.e. HBP and HBPX).

hbpx=. ;
if hbp= $1 then hbpx= 100 ;
else if hbp= $1 then hbpx= $1 ;

A new variable (sel) will be created to reflect the study subpopulation of interest (age 20 years and older) used in the domain statement of the proc surveymeans procedure.

sel=. ;
If ridageyr ge 20 then sel=1;
Else sel=2;

Population estimates will not be age standardized, so the estimates reflect the true population sampled. The results will be output to a SAS data file using the ods output statement below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Survey Procedure for Generating Prevalence Rates
Statements Explanation
proc surveymeans data=ANALYSIS_DATA nobs mean stderr clm;

Use the proc surveymeans procedure to obtain number of observations, mean, standard error and confidence intervals.

strata sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., gender [riagendr] and race [race]).

var hbpx;

Use the var statement to specify which variable(s) will be analyzed. In this example, the HBP variable (hbpx) is used.

weight

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

domain sel sel*riagendr sel*race sel*riagendr*race;

Use the domain statement to specify the subpopulations of interest.

ods OUTPUT domain(match_all)=unadj;
run ;

Use the ods statement to output the SAS dataset of estimates from the subdomains listed on the domain statement. This set of commands will output four datasets for each domain specified in the domain statement above (unadj for sel unadj1 for sel*riagendr, unadj2 for sel*race, and undadj3 for sel*riagendr*race).

Format Data from SAS Output Dataset
Statements Explanation
data bp_stats;
set unadj unadj1 unadj2 unadj3;

Use the data statement to create a new dataset (bp_stats) from the SAS dataset created previously (unadj unadj1 unadj2 unadj3).

if sel= 1 ;
if race= . then race= 0 ;
if riagendr= . then riagendr= 0 ;

Use the if statement to select the subgroups of interest. Use if, then statements to recode missing values to 0 for race and riagendr.

ll=round(lowerclmean,.01);
ul=round(upperclmean,.01);

Use these statements to round and rename the lower limit (lowerclmean to ll), and upper limit (upperclmean to ul) of the Wald 95% confidence intervals.

percent=round(mean,.01);
sepercent=round(stderr,.01);
run ;

Use these statements to round the mean and standard error estimates and rename them to percent and sepercent, respectively.

Step 2: Combine CPS Population Totals

In Step 2, you will combine appropriate CPS population totals across survey cycles AND across years of age to reflect the subpopulation of interest (i.e., those 20 and older).

In this module, CPS population totals are supplied as a SAS dataset with values for: age (CTUTAGE) ranging from 0 to 85+ years ; gender (CTUTGNDR); race/ethnicity (CTUTRACE), where 1= non-Hispanic white, 2=non-Hispanic black, 3=Mexican American and 4=other; race/ethnicity (CTUTRETH), where 1=Mexican American, 2=non-Hispanic other, 3=non-Hispanic white, 4=non-Hispanic black, 5=other Hispanic; ethnicity (CTUTHISP) where 1=Hispanic and 2=non-Hispanic; survey cycle (CTUTSRVY); and the population total (CTUTPOPT). Appropriate age, race/ethnicity, and gender groups were created in a previous step.

The proc means procedure for simple random samples in SAS will be used to calculate CPS population totals for the sub-domains of interest (i.e., sex and race) for the subpopulation of interest (age 20 and older). In this case, no sample design factors or weights need to be used. Subgroup totals are output to another SAS data set (saspt9902) for use in Step 3.

SAS Procedure to Calculate CPS Population Totals
Statements Explanation
Proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctutpopt).

output out =d1n = n sum = sum ;
run ;

Use the ouput statement to create a dataset (d1) for the population totals (sum).

proc sort data =nh.cpstot9902;by ctutgndr;
run ;

Use the proc sort procedure to sort the dataset by sex.

proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctutpopt).

by ctutgndr;

Use the by statement to generate population totals by sex (ctutgndr).

output out =d2n = n sum = sum ;
run ;

Use the output statement to create a dataset (d2) for the population totals (sum).

proc sort data =nh.cpstot9902; by ctutrace; run ;

Use the proc sort procedure to sort the dataset by race.

proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctpopt).

by ctutrace;

Use the by statement to generate population totals by race.

output out =d3n = n sum = sum ;
run ;

Use the output statement to create a dataset (d3) for the population totals (sum).

proc sort data =nh.cpstot9902; by ctutgndr ctutrace; run ;

Use the proc sort procedure to sort the dataset by sex and race.

proc means data =nh.cpstot9902;
where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

var ctutpopt;
by ctutgndr ctutrace;

Use the var statement to select the variable of interest (ctutpopt).

Use the by statement to generate population totals by sex and race.

output out =d4 n = n sum = sum ;
run ;

Use the output statement to create a dataset (d4) for the population totals (sum).

data saspt9902;
set d1 d2 d3 d4;
if ctutrace= .
then ctutrace= 0 ;
if ctutgndr= . then ctutgndr= 0 ;
run ;

This data step consolidates the datasets created above into a single dataset for use in the next step (saspt9902).

Step 3: Multiple Prevalence Estimates with CPS Population Totals

In this last step, you will multiply prevalence estimates with corresponding CPS population totals to estimate the total number of non-institutionalized U.S. citizens affected with HBP.

Note that the datasets produced in Step 1 and Step 2 will be sorted on the sub-domain variables and merged. The new dataset will be used in the final SAS program. Percent prevalence estimates as well as lower and upper 95% confidence limits will be multiplied to the corresponding population total for that subgroup. Results will be rounded, formatted, and printed in SAS.

Calculate Population Estimates from SAS Output Dataset
Statements Explanation
proc sort data =bp_stats; by riagendr race ; run ;
proc sort data =saspt9902(rename=(ctutgndr=riagendr ctutrace=race)); by riagendr race ; run ;

Use the proc sort procedure to sort the two datasets by sex and race. In the second dataset, rename the CPS total race and gender (ctutrace and ctutgndr) variables to match the variable names used in the original dataset.

data comb;
merge (in =a) saspt9902 ;
by riagendr race ;
if a ;

Use the data statement to create a new dataset (comb) by merging SAS datasets created previously (bp_stats and saspt9902). Keep all data for both datasets if values for race and sex exist in bp_stats (in=a).

popmean=(percent/100 )*total ;
popl=ll/100 *sum ;
popu=ul/100 *sum ;

Use these statements to calculate the population counts by applying the population totals (sum) to the prevalence estimate (percent) and the 95% confidence interval limits.

poplr=round(popl,1000);
popur=round(popu,1000);
popmeanr=round(popmean,1000);
totalr=round(total,1000) ;

Use these statements to round and format the estimates to the nearest thousand.

proc print noobs split= '/' double;
var riagendr race percent sepercent ll ul n
totalr popmeanr poplr popur ;
formatrace racefmt. riagendr sexfmt. n 5.0 percent 5.2 sepercent 5.2
ll 4.2 ul 4.2 ;
label
percent='%' / 'with' / 'high' / 'bp'
n='Num' / 'bp' / 'status'
sepercent='Std' / 'error'
ll='Lower' / '95%' / 'Wald' / 'CI'
ul='Upper' / '95%' / 'Wald' / 'CI'
popmeanr='Pop' / 'Est' / 'US' / 'with' / 'high' / 'bp'
totalr='Pop' / 'total' / 'US'
poplr='Pop Est' / 'Lower' / '95%' / 'WALD' / 'CI'
popur='Pop Est' / 'Upper' / '95%' / 'WALD' / 'CI' ;
title1 'Prevalence of persons with high Bp - US, 1999-2002' ;
title2 'Percent and population estimates of number with high Bp-Wald CI' ;
run ;

Use the proc print procedure to print the variables of interest.

Highlights from the output include:

  • Nearly 58 million non-institutionalized U.S. adults have high blood pressure, with approximately 26 million adult men and 32 million adult women affected.
  • The number of non-institutionalized U.S. adults with hypertension, by race/ethnicity, is as follows: approximately 42.7 million non-Hispanic whites, 8.1 million non-Hispanic Blacks, and 2.4 million Mexican-Americans.

Task 2c: How to Generate Population Counts in Stata

In this example, you will use Stata to combine age subgroups and generate population estimates for high blood pressure (HBP) by sex and race/ethnicity for persons 20 years and older. The method outlined in this module uses a Stata data file with CPS population totals. The process for combining subgroups and calculating population estimates is then automated using the code outlined below.

Alternatively, you can use the CPS population totals located on the respective survey cycle NHANES web page (referred to in Key Concepts), plus the results from a syv:mean command and manually calculate population estimates within a spreadsheet. If you choose this option, you will need to define the age, race/ethnicity and gender subgroups of interest and calculate population totals within the spreadsheet on your own.

Step 1: Download and install parmest command

The program outlined in the following steps uses a command called parmest that saves a model fit as a dataset. If you do not have this command installed and run the program, it will report an error that the parmest command is not recognized. You will use Stata Help to locate, download and install this command. In Stata, open the Help menu in the top toolbar. Then select Search. In the dialog box, select the radio button next to "Search All", and enter "parmest" in the search field. In the results, click the result "dm65." If you do not have this command installed, you will see a brief description and "(Click here to install)" in blue on the right side of the window. Click it to install. (If you already have this command installed, you will get the complete Help file and list of options to use with the command.)

IMPORTANT NOTE

Do not install dm65_1. The program will generate errors if dm65_1 is installed instead of dm65. This is because the parmest command that is installed with dm65_1 does not restore the existing dataset (analysis_data) after the command parmest, (save filename) is used to save the most recently requested parameter estimates. If you accidentally installed or already installed dm65_1, go back to the search results and click dm65 and select the option to replace dm65_1 with dm65.

Step 2: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(variance method)

To define the survey design variables for your high blood pressure analysis, use the weight variable for four-years of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for four years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 3: Create new variables

You will need to create variables for age, race, standard weight, and high blood pressure. First, create a variable for the race/ethnicity groups in your analyses. Then, code the outcome variable as a dichotomous variable, where the absence of of the outcome is coded as 0 and the presence of the outcome is coded as 100. Using 100 will express the proportion as a percentage (e.g., 0.23 would be represented as 23). The dichotomous variable, hbp, is already coded as 2 for the absence of outcome and 1 for the presence of outcome, so it will need to be recoded as a new variable, hpbx. Here is the code for creating the variables:

Code to generate variables
Variable Code to generate variables
Race
gen race =1 if ridreth1 == 3
replace race =2 if ridreth1 == 4
replace race =3 if ridreth1 == 1
replace race =4 if ridreth1 == 2 | ridreth1 ==5
High Blood Pressure
gen hbpx=100 if hbp==1
replace hbpx=0 if hbp==2

Step 4: Generate proportions for the outcome of interest and save estimates

The STATA command, svy: mean, and additional STATA code will be used to generate population estimates. Similar to the svy: mean command used in Task 1, you will output the results to a STATA data file using the parmest command, as shown below. Population estimates will not be age-standardized so that they reflect the true population sampled.

quietly svy, subpop(condition): mean var 
estate size
parmest, saving("path\to\file", [option])

Use the prefix quietly before the svy command to suppress terminal output. Use the svy:mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop() option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the estate size post estimation command to display subpopulation sizes. Use the parmest command with the saving option to create a new Stata dataset of the most recently requested parameter estimates.

quietly svy, subpop(if ridageyr >=20 & ridageyr < .): mean hbpx 
estate size
parmest, saving("c:\NHANES\data\popmean1", replace)

The output data will be formatted so that it can merged in a later step.

use "c:\NHANES\data\popmean1", clear
gen riagendr=0 
gen race=0 
drop parm
save "c:\NHANES\data\popmean1", replace
Stata Commands for Generating Prevalence Rates
Statements Explanation
use "C:\nhanes\data\analysis_data.dta", clear

Use the use command to load the Stata-format dataset.

Use the clear option to replace any data in memory.

svyset sdmvpsu [pweight=wtmec4yr], strata(sdmvstra) vce(linearized)
Use the svyset command to declare the survey design for the dataset. Specifiy the psu variable sdmvpsu. Use the [pweight=] option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used. Use the strata ( ) option to specify the stratum identifier (sdmvstra). Use the vce( ) option to specific the variance estimation method (linearized) for Taylor linearization.
quietly svy, subpop(if ridageyr >=20 & ridageyr <.): mean hbpx
estate size
parmest, saving("c:\NHANES\Data\popmean1", replace)

Use the prefix quietly before the svy command to suppress terminal output. Use the svy : mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file.

Use the estate size post estimation command to display subpopulation sizes.

Use the parmest command with the saving option to create a new Stata dataset of the most recently requested parameter estimates.

quietly svy, subpop(if ridageyr >=20 & ridageyr <.) vce(linearized): mean hbpx, over(riagendr)
estate size
parmest, saving("c:\NHANES\Data\popmean2", replace)

Use the prefix quietly before the svy command to suppress terminal output. Use the svy : mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the over( ) option to specify subgroup tabulations for estimates requested. In this example, gender (riagendr) is of interest.

Use the estate size post estimation command to display subpopulation sizes. Use the parmest command with the saving option to create a new Stata dataset of the most recently requested parameter estimates.

quietly svy, subpop(if ridageyr >=20 & ridageyr <.) vce(linearized): mean hbpx, over(race)
estate size
parmest, saving("c:\NHANES\Data\popmean3", replace)

Use the prefix quietly before the svy command to suppress terminal output. Use the svy : mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the over( ) option to specify subgroup cross-tabulations for estimates requested. In this example, race-ethnicity (race) is of interest.

Use the estate size post estimation command to display subpopulation sizes.

Use the parmest command with the saving option to create a new Stata dataset of the most recently requested parameter estimates.

quietly svy, subpop(if ridageyr >=20 & ridageyr <.) vce(linearized): mean hbpx, over(riagendr race)
estate size
parmest, saving("c:\NHANES\Data\popmean4", replace)

Use the prefix quietly before the svy command to suppress terminal output. Use the svy : mean command with the high blood pressure variable (hbpx) to estimate the prevalence of HBP. Use the subpop( ) option to select a subpopulation for analysis, rather than select the study population in the Stata program while preparing the data file. Use the over( ) option to specify subgroup cross-tabulations for estimates requested. In this example, gender (riagendr), and race-ethnicity (race) are of interest.

Use the estate size post estimation command to display subpopulation sizes.

Use the parmest command with the saving option to create a new Stata dataset of the most recently requested parameter estimates.

Create Race and Gender Variables from Stata Parameters Output Dataset
Statements Explanation
use "c:\NHANES\Data\popmean1", clear
gen riagendr=0
gen race=0
drop parm
save "c:\NHANES\Data\popmean1", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the generate (gen) command to create gender (riagendr) and race variables equal to 0 to indicate all genders and races. Use the drop command to remove the parm variable.

Use the save command to save the dataset.

use "c:\NHANES\Data\popmean2", clear
gen riagendr=1 if parm=="male"
replace riagendr=2 if parm=="female"
gen race=0
drop parm
save "c:\NHANES\Data\popmean2", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the generate (gen) and replace commands to create gender (riagendr) equal 1 (male) and 2 (female) and race equal 0 for all races. Use the drop command to remove the parm variable.

Use the save command to save the dataset.

use "c:\NHANES\Data\popmean3", clear
gen riagendr=0
gen race=1 if parm=="_subpop_1"
replace race=2 if parm=="_subpop_2"
replace race=3 if parm=="_subpop_3"
replace race=4 if parm=="_subpop_4"
drop parm
save "c:\NHANES\Data\popmean3", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the generate (gen) and replace commands to create race equal 1 (NH White), 2 (NH Black), 3 (Mexican American) and 4 (other races) and riagendr equal to 0 for all genders. Use the drop command to remove the parm variable.

Use the save command to save the dataset.

use "c:\NHANES\Data\popmean4", clear
gen riagendr=1 if parm=="_subpop_1"
| parm== "_subpop_2" | parm== "_subpop_3" | parm== "_subpop_4"
replace riagendr=2 if parm=="_subpop_5" | parm== "_subpop_6" | parm== "_subpop_7" | parm== "_subpop_8"
gen race=1 if parm=="_subpop_1" | parm== "_subpop_5"
replace race=2 if parm=="_subpop_2" | parm== "_subpop_6"
replace race=3 if parm=="_subpop_3" | parm== "_subpop_7"
replace race=4 if parm=="_subpop_4" | parm== "_subpop_8"
save "c:\stata\tutorial\age adjustment\popmean4", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the generate (gen) and replace commands to create gender (riagendr) equal 1 (male) and 2 (female) and race equal 1 (Mexican American), 2 (NH White), 3 (NH Black) and 4 (other races). Use the drop command to remove the parm variable.

Use the save command to save the dataset.

use "c:\NHANES\Data\popmean1", clear
append using 
"C:\NHANES\Data\popmean2" append using
"C:\NHANES\Data\popmean3" append using
"C:\NHANES\Data\popmean4" sort riagendr race save "c:\NHANES\Data\popmeans", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the append command to combine all the datasets created above into one dataset.

Use the sort command to sort the variables by gender and race.

Use the save command to save the dataset.

Step 5: Combine CPS population tables

In this step, you will combine appropriate CPS population totals across survey cycles AND across years of age to reflect the subpopulation of interest (i.e., those 20 and older by sex and race).

In this module, CPS population totals are supplied as a Stata dataset with values for: age (CTUTAGE) ranging from 0 to 85+ years ; gender (CTUTGNDR); race/ethnicity (CTUTRACE), where 1= non-Hispanic white, 2=non-Hispanic black, 3=Mexican American and 4=other; race/ethnicity (CTUTRETH), where 1=Mexican American, 2=non-Hispanic other, 3=non-Hispanic white, 4=non-Hispanic black, 5=other Hispanic; ethnicity (CTUTHISP) where 1=Hispanic and 2=non-Hispanic; survey cycle (CTUTSRVY); and the population total (CTUTPOPT). Appropriate age, race/ethnicity, and gender groups were created in a previous step.

The collapse command in Stata will be used to calculate CPS population totals for the sub-domains of interest (i.e., sex and race/ethnicity) for the subpopulation of interest (age 20 and older). In this case, no sample design factors or weights need to be used. Use the use command to load the Stata-supplied dataset (cpstot9902) to read the CPS population totals. The variable is CTUTPOPT (the population totals). Subgroup totals are output to another Stata dataset (poptot9902) for use in the next step. Nothing is printed. Use the collapse command to convert the current dataset into a dataset of population total sums for ages greater than or equal to 20 years. Use the save command to save the dataset.

use "C:\NHANES\data\cpstot9902.dta ", clear
collapse (sum) ctutpopt if ctutage >=20 & ctutage <., 
save "c:\NHANES\data\tot9902a", replace
Stata Procedure to Calculate CPS Population Totals
Statements Explanation
use "c:\NHANES\Data\cpstot9902.dta ", clear
collapse (sum) ctutpopt if ctutage >=20,
save "c:\NHANES\Data\tot9902a", replace

Use the use command to load the Stata-supplied dataset (cpstot9902) to read the CPS population totals.

Use the collapse command to convert the current dataset into a dataset of population total sums for ages greater than or equal to 20 years.

Use the save command to save the dataset.

use "c:\NHANES\Data\cpstot9902.dta", clear
collapse (sum) ctutpopt if ctutage >=20, by(ctutgndr)
save "c:\NHANES\Data\tot9902b", replace

Use the use command to load the Stata-supplied dataset (cpstot9902) to read the CPS population totals.

Use the collapse command to convert the current dataset into a dataset of population total sums for ages greater than or equal to 20 years by gender.

Use the save command to save the dataset.

use "c:\NHANES\Data\cpstot9902.dta", clear
collapse (sum) ctutpopt if ctutage >=20, by(ctutrace)
save "c:\stata\tutorial\age adjustment\tot9902c", replace

Use the use command to load the Stata-supplied dataset (cpstot9902) to read the CPS population totals.

Use the collapse command to convert the current dataset into a dataset of population total sums for ages greater than or equal to 20 years by race.

Use the save command to save the dataset.

use "c:\NHANES\Data\cpstot9902.dta", clear
collapse (sum) ctutpopt if ctutage >=20, by(ctutgndr ctutrace)
save "c:\NHANES\Data\tot9902d", replace

Use the use command to load the Stata-supplied dataset (cpstot9902) to read the CPS population totals.

Use the collapse command to convert the current dataset into a dataset of population total sums for ages greater than or equal to 20 years by gender and race.

Use the save command to save the dataset.

use "c:\NHANES\Data\tot9902a", clear
append using "c:\NHANES\Data\tot9902b"
append using "c:\NHANES\Data\tot9902c"
append using "c:\NHANES\Data\tot9902d"
rename ctutgndr riagendr
rename ctutrace race
replace riagendr=0 if riagendr==.
replace race=0 if race==.
sort riagendr race
save "c:\NHANES\Data\poptot9902", replace

Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

Use the append command to combine all the datasets created above into one dataset.

Use the rename command to change the name of the variables

Use the replace command change the value of the variables from (.) to 0.

Use the sort command to sort the variables by gender and race.

Use the save command to save the dataset

Step 6: Multiply prevalence estimates with CPS population totals

In this step, you will multiply prevalence estimates with corresponding CPS population totals to estimate the total number of non-institutionalized U.S. citizens affected with HBP.

Note that the datasets produced in the previous steps (popmeans, poptot9902) were sorted on the sub-domain variables (riagendr, race) to be merged. After merging, the prevalence estimates output from the datasets are rounded.

use "c:\NHANES\data\popmeans", clear
merge riagendr race using "c:\NHANES\data\poptot9902.dta"
gen est=round(estimate,.01) 
gen se=round(stderr,.01)
gen ll=round(min95,.01)
gen ul=round(max95,.01)

Then, percent prevalence estimates (est), as well as lower and upper 95% confidence limits (ul, ll), will be multiplied to the corresponding population total for that subgroup (ctutpopt).

gen popmean=(est/100)*ctutpopt
gen popl=(ll/100)*ctutpopt
gen popu=(ul/100)*ctutpopt

Results will be rounded, saved, and printed.

gen popmeanr=round(popmean,1000)
gen poplr=round(popl,1000)
gen popur=round(popu,1000)
gen poptot_r =round(ctutpopt,1000) 
save "c:\NHANES\data\popmeans", replace
list riagendr race est se ll ul poptot_r popmeanr poplr popur, clean

Stata output of population totals for high blood pressure by ggender and race

Highlights from the output include:

  • Nearly 58 million non-institutionalized U.S. adults have high blood pressure, with approximately 26 million adult men and 32 million adult women affected.
  • The number of non-institutionalized U.S. adults with hypertension, by race/ethnicity, is as follows: approximately 42.7 million non-Hispanic Whites, 8.1 million non-Hispanic Blacks, and 2.4 million Mexican-Americans.
TOP