Skip directly to search Skip directly to A to Z list Skip directly to navigation Skip directly to page options Skip directly to site content

Module 5: Descriptive Statistics

The NHANES Tutorials are currently being reviewed and revised, and are subject to change. Specialized tutorials (e.g. Dietary, etc.) will be included in the future.

NHANES data are often used to provide national estimates on important public health issues. This module introduces how to generate the descriptive statistics for NHANES data that are most often used to obtain these estimates. Topics covered in this module include checking frequency distribution and normality, generating percentiles, generating means, and generating proportions.

It is highly recommended that you examine the frequency distribution and normality of the data before starting any analysis. These descriptive statistics are useful in determining whether parametric or non-parametric methods are appropriate to use, and whether you need to recode or transform data to account for extreme values and outliers.

Checking Frequency Distribution and Normality

Frequency Distribution

A frequency distribution shows the number of individuals located in each category of a categorical variable. For continuous variables, frequencies are displayed for values that appear at least one time in the dataset. Frequency distributions provide an organized picture of the data, and allow you to see how individual scores are distributed on a specified scale of measurement. For instance, a frequency distribution shows whether the data values are generally high or low, and whether they are concentrated in one area or spread out across the entire measurement scale.

A frequency distribution not only presents an organized picture of how individual scores are distributed on a measurement scale, but also reveals extreme values and outliers. Researchers can make decisions on whether and how to recode or perform data transformation based on the distribution statistics.

Frequency distributions can be structured as tables or graphs, but either should show the original measurement scale and the frequencies associated with each category. Because NHANES data have very large sample sizes with a potentially long list of different values for continuous variables, it is recommended that you use a graphic format to check the distribution for continuous variables, and either frequency tables or graphic forms for nominal or interval variables.

Statistics of Normality (for Continuous Variables)

Statistics of normality reveal whether a data distribution is normal and symmetrically bell-shaped or highly skewed. It is important to use these statistics to check the normality of a distribution because they will determine whether you will use parametric (which assume a normal distribution), non-parametric tests, or the need to use a transformation in your analysis.

IMPORTANT NOTE

Note: Before you analyze the data, it is important to check the distribution of the variables to identify outliers and determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use.

NHANES 1999-2002 is a large, representative sample of the U.S. population, and most continuous variables from this sample are expected to be normally distributed. If you conduct tests for normality, results on most variables would be significant, i.e. even the slightest deviation from normality could result in rejecting the null hypothesis due to the extremely large sample sizes. Therefore, users are discouraged from solely depending on these tests for normality. Instead you can also request a Q-Q plot to examine normality.

A Q-Q plot, or a quantile-quantile plot, is a graphical data analysis technique for assessing whether the distribution for data follows a particular distribution. In a Q-Q plot, the distribution of the variable in question is plotted against a normal distribution. The variable of interest is normally distributed, if a straight line intersects the y-axis at a 45 degree angle.

Standard Deviation

The standard deviation is a measure of the variability of the distribution of a random variable. To estimate the standard deviation

  1. calculate the weighted sum of the squares of the differences of the observations in a simple random sample from the sample mean
  2. divide the result obtained in 1 by an estimate of the population size minus 1
  3. take the square root of the result obtained in 2.

Skewness

Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.

Kurtosis

Kurtosis is a measure of the peakedness of the distribution. The kurtosis of a normally distributed random variable depends on the formula used. One formula subtracts 3, as used by SAS, which makes the value for a normal distribution equal to 0. The other formula does not subtract 3, as used by Stata, which makes the value for a normal distribution equal to 3. A kurtosis exceeding the value for a normal distribution indicates excess values close to the mean and at the tails of the distribution. A kurtosis of less than the value for a normal distribution indicates a distribution with a flatter top.

SAS Support Link: http://support.sas.com/publishing/bbu/companion_site/update/lsb_kurtosis.html

Standard Error of the Mean

The standard error of the mean based on data from a simple random sample is estimated by dividing the estimated standard deviation by the square root of the sample size. The value of the standard error obtained from SAS proc univariate using the freq option with the sample weight (i.e. freq appropriate sample weight) is obtained by dividing the estimated standard deviation (see above) by the sum of the sample weights (i.e. an estimate of the population size). In order to obtain the "correct" estimate of the simple random sample standard error of the mean, divide the estimated standard deviation by the square root of the sample size. The SRS estimate of the standard error of the mean thus obtained serves as a bench mark against which to compare the design based estimate of the standard error of mean which can be obtain from SUDAAN proc descript. (See Variance Estimation module for more information).

Task 1a: How to Check Frequency Distribution and Normality in SAS

The SAS procedure, proc univariate, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.

Step 1: Use the univariate procedure to generate descriptive statistics in SAS

Use the SAS procedure, proc univariate, to generate descriptive statistics. The frequency distribution can be presented in table or graphic format. The freq option generates the frequency distribution in tabular form by listing the number of observations for each value of the variable. Due to the large sample size and the possibility of a long list of different values, it is not reasonable to request the freq option for variables that are not nominal or ordinal. The plot option generates the frequency distribution in graphic form (histogram, box, and normal probability plots), and the normal option generates statistics to test the normality of the distribution.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Univariate Procedure for Descriptive Statistics
Statements Explanation
proc sort data=analysis_data ;
by riagendr age;
run ;

Use the sort procedure to sort data by the same variables used in the by statement of the univariate procedure. In the example, data is sorted by gender (riagendr) and age (age).

PROC UNIVARIATE PLOT NORMAL ;

Use the univariate procedure to generate descriptive statistics, which include number of missing values, mean, standard errors, percentiles, and extreme values. Use the plot option to generate histogram, box and normal probability plots, and the normal option to generate statistics to test normality.

In this example, plots (plot) and normality test statistics (normal) are requested and the results will be sorted and generated separately for each combination of the variables on the by statement.

where ridageyr >= 20 ;

Use the where statement to select those 20 years and older.

by riagendr age;

The by statement determines the groups (all combinations of the variables defined by the var statement) that separate descriptive statistics will be produced. This statement should match the by statement in the sort procedure preceding it.

VAR lbxtc;

Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the total cholesterol variable (lbxtc) is used.

FREQ wtmec4yr;
run ;

Use the freq option with the appropriate sample weight yields an estimate of the standard deviation whose denominator is the estimated population size. In this example, the 4-year examination weight (wtmec4yr) is used.

WARNING

The freq option, with the appropriate sample weight, yields an estimate of the standard deviation whose denominator is an estimate of the population size, i.e., the sum of the the sample weights. Using the weight option instead of the freq option yields an estimate of the standard error whose denominator is the sample size.

Step 2: Check output of descriptive statistics

The univariate procedure generates extensive descriptive statistics, including moments, percentiles, extremes, missing values, basic statistical measures, and tests for location. Below is a snapshot from the extensive output of the SAS program which shows the result of using the plot and normal options.

  • The output is arranged by gender and age group so you can see the results for each combination.
  • The standard deviation is a measure of the deviation of the observations for the mean.
  • Kurtosis is a measure of the peakedness of the distribution. For SAS, the kurtosis of a normally distributed random variable is 0. A kurtosis greater than 0, as in this example, indicates excess values close to the mean and at the tails of the distribution.
  • Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.
  • The standard error of the mean is not correctly calculated and will not be used in this example.
  • The output also contains the five lowest and highest values, which are useful for review.
  • The histogram for a normally distributed random variable is symmetric and bell-shaped. For variables based on data collected in a survey, such as NHANES 1999-2002, the distribution will deviate at least slightly from normality. Note the one outlier on the upper tail of the distribution.
  • The variable of interest is plotted against a normally distributed random variable. The resulting plot is called a Q-Q plot. If the variable of interest is normally distributed a straight line intersecting the y axis at a 45 degree angle would be obtained. For this example note the outliers in the upper tail of this distribution.

Step 3: Request selective statistics and output results to SAS dataset

In some instances, you may not need all of the statistics generated by proc univariate. You can use proc univariate to select a few descriptive statistics and output the results to a SAS dataset to view.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS univariate procedure for displaying selected statistics
Statements Explanation
proc sort data=analysis_data;
by riagendr age;
run ;

Use the sort procedure to sort data by the same variables that will be used in the by statement of the univariate procedure. In the example, the data are sorted by gender (riagendr) and age (age).

PROC UNIVARIATE NOPRINT;

Use the univariate procedure to generate descriptive statistics. Use the noprint option to suppress the detailed default descriptive statistics.

where ridageyr >= 20 ;

Use the where statement to select those 20 years and older.

by riagendr age;

The by statement determines the groups (all combinations of the variables defined by the var statement) that separate descriptive statistics will be produced. This statement should match the by statement in the sort procedure preceding it.

VAR lbxtc;

Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the total cholesterol variable (lbxtc) is used.

FREQ wtmec4yr;

Use the freq option with the appropriate sample weight yields an estimate of the standard deviation whose denominator is the estimated population size. In this example, the 4-year examination weight (wtmec4yr) is used.

WARNING

The freq option, with the appropriate sample weight, yields an estimate of the standard deviation whose denominator is an estimate of the population size, i.e., the sum of the the sample weights. Using the weight option instead of the freq option yields an estimate of the standard error whose denominator is the sample size.

OUTPUT out= SASdataset mean=mean Q1=p_25 median=median Q3=p_75;
run ;

Use output statement to print the results to the new SAS dataset, SASdataset, which will contain the statistics of interest. The requested statistics are labeled with the names given after the equal sign. In this example, the mean, 25th, 50th, and 75th percentiles are requested. (For a complete list of statistics that can be requested see the proc univariate entry in SAS manual.)

proc print DATA=SASdataset;
run ;

Use proc print to view the results in the new SAS dataset, SASdataset.

Step 4: Check output of selective statistics

The output is sent to a SAS dataset, which is printed to view. See results below. Note that the new SAS dataset contains only the statistics requested on the output statement.

  • Because this example used the noprint option, there is only one page of output with the requested statistics — mean, 25th percentile, median, and 75th percentile.

Task 1c: How to Check Frequency Distribution and Normality in Stata

The frequency distribution can be presented in table or graphic format. In this task, you will learn how to use the standard Stata commands - summarize, histogram, graph box, and tabstat - to generate these representations of data distributions. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use the summarize command to generate weighted summary statistics for a population subset

The Stata command, summarize, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution. Because the SVY series of commands do not include the summarize command, you will need to use the standard summarize command, but tell Stata to incorporate weights. Below are instructions on how to write these commands and interpret the output.

This command has the general structure:

summarize varname [w=weightvar], detail

IMPORTANT NOTE

Without the detail option you just get obs, mean, std. dev., minimum and maximum.

You can generate summary statistics for various population subsets (e.g. young men, young women, etc). The example below adds the by varname: prefix to the previous example to create this general format.

by var1 var2, sort: sum varname [w=weightvar] if (condition), detail

Here is the command to generate the summary statistics for six population subsets defined by gender (riagendr) and three age categories (age). The command also includes an if statement, which further restricts to age over 20 years (ridageyr>= 20) and people who have been both interviewed and examined (ridstatr==2).

by riagendr age, sort : sum lbxtc [w = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail

IMPORTANT NOTE

Stata represents missing numeric values (".") as large positive values. Therefore, a missing numeric value would be the highest value. Please see the (!! WE DON'T HAVE THIS PAGE IN OUR NEW TUTORIALS !!) Stata Tips page for more information.

Portion of Output from Example Stata Statement

Portion of Output from Example Stata Statement

Reviewing the output, notice that

  • The output is arranged by gender and age group so you can see the results for each combination.
  • The standard deviation is a measure of the deviation of the observations for the mean.
  • Kurtosis is a measure of the peakedness of the distribution. For Stata, the kurtosis of a normally distributed random variable is 3. A kurtosis greater than 3, as in this example, indicates excess values close to the mean and at the tails of the distribution.
  • Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.
  • The output also contains the four lowest and highest values, which are useful for review.
Stata Non-Survey Command for Descriptive Statistics
Statements Explanation
use "C:\Stata\tutorial\analysis_data.dta", clear
Use the use command to load the Stata-format dataset. Use the clear option to replace any data in memory.

by riagendr age, sort : summarize lbxtc [aweight = wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail
Use the sort command with the by prefix to sort and display the data by gender (riagendr) and age (age). Use the summarize command to generate univariate summary statistics (number of observations, sum of weights, mean, standard deviation) for the total cholesterol variable (lbxtc), for those who are 20 years and older and have been both interviewed and examined (ridstatr=2). Use the [aweight=] option to account for the NHANES sampling weights (obtain survey weighted estimates). In this example, the MEC weight for four years of data [aweight=wtmec4yr] is used. Note in this case the aweights as normally defined by Stata, that is weights inversely proportional to the variance of an observation, are NOT used."
histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal
Use the histogram command to draw a histogram of the total cholesterol variable (lbxtc) for a select subpopulation (ages 20 and over). Use the normal option to overlay the histogram with normal density.
graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.)& ridstatr==2
Use the graph box command to box plot the total cholesterol data, by gender and age for those who are 20 years and older and have been both interviewed and examined (ridstatr=2). Use the [pweight=] option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used. Use the medtype option to indicate how the median is indicated in the box.

Step 2: Generate histograms and box plots

To generate graphs of the distributions of a continuous variable, use the histogram and graph box commands.

In this example, the general structure of the histogram command is:

histogram varname, by(var1 var2), if (condition), [ options]

In this example, the general structure of the graph box command, including the medtype() option to specify how the median is indicated and the over() option to specify different subgroups, is:

graph box varname [w=weightvar], medtype(line) over(var1) over(var2), if (condition)

The commands to generate histograms and box plots for six population subsets defined by gender (riagendr) and three age categories (age) are below. The commands also include if statements, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.) and people who have been both interviewed and examined (ridstatr==2). In addition, the histogram command uses the normal option to add a normal density to the graph.

histogram lbxtc, by(riagendr age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal
graph box lbxtc [pweight = wtmec4yr], medtype(line) over(riagendr) over(age), if (ridageyr >=20 & ridageyr <.) & ridstatr==2
Output from Histogram Statement

Output from Histogram Statement

Output from Graph Box statement

Output from Graph Box Statement

Reviewing the output of these commands, notice that:

  • The histogram for a normally distributed random variable is symmetric and bell-shaped. For variables based on data collected in a survey, such as NHANES 1999-2002, the distribution will deviate at least slightly from normality.
  • The box plot of the weighted total cholesterol data show three outliers with variables above 600 mg/dl.

Step 3: Use tabstat to request selective statistics

In some instances, you may not need all of the statistics generated by summarize. You can use the tabstat command as a useful alternative to summarize because it allows specification of the statistics to be displayed.

The general structure for the tabstat command is very similar to the summarize command, but you can specify the statistics you want. Using the tabstat command also arranges the output in a table.

tabstat varname [w=weightvar], statistics(statname)

Here is the same cholesterol (lbxtc) analysis for six population subsets defined by gender (riagendr) and three age categories (age). The command also includes an if statement, which further restricts to age over 20 years (ridageyr >=20 & ridageyr <.) and people who have been both interviewed and examined (ridstatr==2), which now only reports the mean, 25th percentile (p25), median, and 75th percentile (p75).

by riagendr: tabstat lbxtc [w=wtmec4yr], by(age) stat(mean p25 median p75), if (ridageyr >=20 & ridageyr <.) & ridstatr==2
Output from Example Tabstat Command

Output from Example Tabstat Command

Note that there are two tables - one for each gender with three age categories and that only the statistics requested by the statistics option are displayed.

Percentiles

IMPORTANT NOTE

Although SAS 9.1 and Stata have commands for calculating estimates of weighted percentiles, they do not have commands to directly produce standard errors for the percentiles. So this tutorial will not provide sample programs in SAS 9.1 and Stata for percentiles and their standard errors. In SAS 9.2 Survey Procedures, variance estimation for percentiles using the Woodruff method is available. See the SAS 9.2 documentation for information on using this method.

The rank or percentile rank of a raw score is the percentage of individuals in the distribution with scores at or below that particular score. When a raw score is identified by its percentile rank, the score is called a percentile. Using mathematical terms, the pth percentile is a value, Y(p), such that at most (100p)% of the measurements are less than this value and at most 100(1- p)% are greater.

Percentiles are useful because raw scores, or X values, do not provide enough information by themselves. For example, if you are told that a boy is 27 inches tall and weighs 30 pounds, you may not be able to tell how well the boy is doing. You need additional information such as the average score of his age group, or the number of boys who score above or below this boy in his group. To determine the relative position of the boy's measurements in his group, you need to transform the raw scores into percentiles in order to compare. Therefore, it is much more informative if you could transform the height and weight of the boy into percentile rank, such as 75th percentile in height, and 50th percentile in weight in his age group.

In summary, percentiles provide additional information about the distribution of values. percentiles represent the relative position of the measured values within a distribution.

Task 2: How to Generate Percentiles in SUDAAN

In this example, you will use SAS-callable SUDAAN to generate percentiles and standard errors for total cholesterol levels of persons 20 years and older by sex and age group.

Step 1: Sort data

To calculate the percentiles and standard errors, you will use SAS-callable SUDAAN because this software takes into account the complex survey design of NHANES data when determining variance estimates. The data from analysis_Data must be sorted by strata first and then PSU (unless the data have already been sorted by PSU within strata). The SAS proc sort statement must precede the SUDAAN statements.

WARNING

The design variables, sdmvstra and sdmvpsu, are provided in the demographic data files and are used to calculate variance estimates. Before you call SUDAAN into SAS, the data must first be sorted by these variables.

Step 2: Use proc descript to generate percentiles in SUDAAN

The SUDAAN procedure proc descript is used to generate percentiles and standard errors. These estimates are requested on the print statement along with the sample size (nsum). The general program for obtaining weighted percentiles and standard errors is below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Percentiles in SUDAAN
Statements Explanation
PROC SORT DATA =analysis_data;
BY  sdmvstra sdmvpsu ;
RUN ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu). The data statement refers to the dataset, analysis_Data.

proc descript< data=analysis_data design=wr;

Use proc descript procedure to generate means and specify the sample design using the design option WR (with replacement).

subpopn ridageyr >= 20 ;

Use the subpopn statement to select the sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subpopulation for analysis, rather than select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subgroup riagendr age ;

Use the subgroup statement to list the categorical variables for which statistics are requested. This example uses gender (riagendr) and age (age). These variables will also appear in the table statement.

levels 2 3 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example uses two genders and three age groups.

var lbxtc;

Use the var statement to name the variable(s) to be analyzed. In this example, the total cholesterol variables (lbxtc) is used.

percentile 5 25 50 75 95 ;

Use the percentile statement to request select percentiles.

table riagendr * age;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one—dimensional distribution is generated for each variable in the subgroup statement. In this example, the estimates are for gender (riagendr) by age (age).

PRINT
nsum= "Sample Size"
qtile= "Quantile"
nohead notime
style=nchs
nsumfmt= F7.0
qtilefmt= F9.2 ;

Use the print statement to assign names, format the statistics desired, and view the output. If the statement print is used alone, all of the default statistics are printed with default labels and formats.

In this example, the sample size (nsum) and quantile (qtile) are requested.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

Use the style option equal to NCHS to produce output which parallels a table style used at NCHS.

rtitle "Percentiles of total cholesterol by sex and age: NHANES 1999-2002" ;

Use the rtitle statement to assign a heading for each page of output.

Step 3: Review output

The output will list the sample sizes, percentiles and their standard errors.

  • Reviewing the output of the program, note that 50% of the sampled population has a total cholesterol measurement less than the 50th percentile and 50% of the sampled population has a total cholesterol measurement of greater than the 50th percentile.

Means

Means are measures of a central tendency. In this section, you will learn about three types of means:

  • arithmetic,
  • weighted arithmetic, and
  • geometric.

Arithmetic Means

The finite population mean of X1 , X2 ,…. XN is defined as the sum of the values Xi divided by the population size N. Typically, in a non-survey setting an arithmetic mean is estimated by taking a simple random sample of the finite population, x1, x2,…,xn, summing the values and dividing by the sample size n.

Equation for Arithmetic Mean

Equation for Arithmetic Mean

This is often referred to as the arithmetic mean. On average, the result of the arithmetic mean would be expected to equal the result of the population mean.

Weighted arithmetic means

For NHANES 1999-2002 a sample weight, wi, is associated with each sample person. The sample weight is a measure of the number of people in the population represented by that person. For more information on sample weights, please see the Weighting module. To obtain an unbiased estimate of the population mean, based on data from the NHANES 1999-2002 sample, it is necessary to take a weighted arithmetic mean.

Equation for Weighted Arithmetic Mean

Equation for Weighted Arithmetic Mean

Geometric Means

In instances where the data are highly skewed, geometric means can be used. A geometric mean, unlike an arithmetic mean, minimizes the effect of very high or low values, which could bias the mean if a straight average (arithmetic mean) were calculated. The geometric mean is a log-transformation of the data and is expressed as the N-th root of the product of N numbers.

Task 3a: How to Generate Means Using SUDAAN

In this example, you will use SAS-callable SUDAAN to generate tables of means and standard errors for average cholesterol levels of persons 20 years and older by sex and race-ethnicity.

Step 1: Sort data

To calculate the means and standard errors, you will use SAS-callable SUDAAN because this software takes into account the complex survey design of NHANES data when determining variance estimates. Note that if standard errors are not needed, you can simply use a SAS procedure, i.e., proc means with the weight statement to calculate means. The data from analysis_Data must be sorted by strata first and then PSU (unless the data have already been sorted by PSU within strata). The SAS proc sort statement must precede the SUDAAN statements.

WARNING

The design variables, sdmvstra and sdmvpsu, are provided in the demographic data files and are used to calculate variance estimates. Before you call SUDAAN into SAS, the data must be sorted by these variables.

Step 2: Use proc descript to generate means in SUDAAN

The SUDAAN procedure, proc descript, is used to generate means and standard errors. The print statement is used to output those estimates along with the sample size (nsum), i.e., the number of survey participants with known values for the variable of interest. The general program for obtaining weighted means and standard errors is below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Means in SUDAAN
Statements Explanation
PROC SORT DATA =analysis_data;
BY  sdmvstra sdmvpsu ;
RUN ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu). The data statement refers to the dataset, analysis_Data.

proc descript data=analysis_data design=wr;

Use the proc descript procedure to generate means and specify the sample design using the design option WR (with replacement).

subpopn ridageyr >= 20 ;

Use the subpopn statement to select the sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subpopulation for analysis, rather than select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used.

subgroup riagendr age ;

Use the subgroup statement to list the categorical variables for which statistics are requested. This example uses gender (riagendr) and age (age). These variables also appear in the table statement.

levels  2 3 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example uses two genders and three age groups.

var lbxtc;

Use the var statement to name the variable(s) to be analyzed. In this example, the total cholesterol variables (lbxtc) is used.

table riagendr * age;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one—dimensional distribution is generated for each variable in the subgroup statement. In this example the estimates are for gender (riagendr) by age (age).

PRINT
nsum= "Sample Size"
mean= "Mean"
semean= "Standard Error" 
nohead notime 
style=nchs
nsumfmt= F7.0
meanfmt= F9.2
semeanfmt= F9.3 ;

Use the print statement to assign names, format the statistics desired, and view the output. If the statement print is used alone, all of the default statistics are printed with default labels and formats.

In this example, the sample size (nsum), mean (mean), and standard error of the mean (semean) are requested.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

Use the style option equal to NCHS to produce output that parallels a table style used at NCHS.

rtitle "Means of total cholesterol
and standard errors by sex and age: NHANES 1999-2002" ;

Use the rtitle statement to assign a heading for each page of output.

run ;

The run statement signifies the end of the program.

Step 3: Review output

The output will list the sample sizes, means, and their standard errors.

  • The output shows the sample size, mean, and standard error sorted into total, male and female groups with age subgroups.
  • Also notice that the mean for each group is very near the median results(50th percentile) from the descriptive program in Task 1.

Step 4: Use proc descript to generate geometric means

If you need to generate geometric means instead of arithmetic means, you would indicated this using options in the proc descript procedure, as shown below.

WARNING

The example below is for illustrative purposes only. Geometric means are not recommended for use with normally distributed data, such as the analysis_Data dataset.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Geometric Means in SUDAAN
Statements Explanation
PROC SORT DATA =analysis_data;
BY  sdmvstra sdmvpsu ;
RUN ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu). The data statement refers to the dataset, analysis_data.

proc descript data=analysis_data  geometric  design=wr ;

Use the proc descript procedure to generate means and specify geometric as an option to compute geometric means. Specify the sample design using the design option WR (with replacement).

subpopn ridageyr >= 20 ;

Use the subpopn statement to select sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subpopulation for analysis, rather than select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subgroup riagendr age ;

Use the subgroup statement to list the categorical variables for which statistics are requested. This example uses gender (riagendr) and age (age). These variables will also appear in the table statement.

levels 2 3 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example uses two genders and three age groups.

var lbxtc;

Use the var statement to name the variable(s) to be analyzed. In this example, the total cholesterol variables (lbxtc) is used.

table riagendr * age;

Use the table statement to specify cross-tabulations for which estimates are requested. If a table statement is not present, a one—dimensional distribution is generated for each variable on the subgroup statement. This example uses the estimates for gender (riagendr) by age (age).

PRINT
nsum= "Sample Size"
geomean= "Geometric Mean"
segeomean= "Standard Error"
/ 
nohead notime 
style=nchs nsumfmt= F7.0
geomeanfmt= F9.2
segeomeanfmt= F9.3 ;
output nsum geomean segeomean;

Use the print statement to assign names, format the statistics desired, and view the output. If the statement print is used alone, all of the default statistics are printed with default labels and formats.

In this example, the sample size (nsum), geometric mean (geomean), and standard error of the geometric mean (segeomean) were requested.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

Use the style option equal to NCHS to produce output that parallels a table style used at NCHS.

rtitle  "Geometric means of total cholesterol and standard errors by sex and age: NHANES 1999-2002" ;
run ;

Use the rtitle statement to assign a title (heading) to each page of output.

Task 3b: How to Generate Means Using SAS Survey Procedures

In this example, you will use SAS Survey Procedures to generate tables of means and standard errors for average cholesterol levels of persons 20 years and older, by gender and race-ethnicity.

Step 1: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. In this example, the sel variable is set to 1 if the sample person is 20 years or older, and 2 if the sample person is younger than 20 years. Then this variable is used in the domain statement to specify the population of interest (those 20 years and older).

if ridageyr GE 20 then sel = 1;
else sel = 2;

Step 2: Use proc surveymeans to generate means in SAS Survey Procedures

The SAS procedure, proc surveymeans, is used to generate means and standard errors. The general program for obtaining weighted means and standard errors is below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Means in SAS Survey Procedures
Statements Explanation
proc surveymeans data=ANALYSIS_DATA nobs mean stderr;

Use the proc surveymeans procedure to obtain number of observations, mean, and standard error.

stratum sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;
Use the cluster statement to define the PSU variable (sdmvpsu).
class riagendr age;

Use the class statement to specify the discrete variables used to select from the subpopulations of interest. In this example, the subpopulation of interest are gender (riagendr) and age (age).

var lbxtc;

Use the var statement to name the variable(s) to be analyzed. In this example, the total cholesterol variable (lbxtc) is used.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data (wtmec4yr) is used.

domain sel sel*riagendr*age;

Use the domain statement to specify the subpopulations of interest.

ods output domain(match_all)=domain;
run ;

Use the ods statement to output the dataset of estimates from the subdomains listed on the domain statement above. This set of commands will output two datasets for each subdomain specified in the domain statement above (domain for sel; domain1 for sel*riagendr*age).

data all;
set domain domain1;
if sel= 'Age 
ge 20' ;
run ;

Use the data statement to name the temporary SAS dataset (all) append the two datasets, created in the previous step, if age is greater than or equal to 20 (sel).

proc print noobs data =all split = '/';
var riagendr age N mean stderr;
format n 5.0 mean 4.2 stderr 4.2 ;
label N = 'Sample'/'Size'
stderr='Standard'/'error'/'of the' / 'mean'
mean='Mean';
title1 'Mean serum total cholesterol of adults 20 years and older, 1999-2002' ;
run ;

Use the print statement to print the number of observations, the mean, and standard error of the mean in a printer-friendly format.

Step 3: Review output

The output lists the sample sizes, means and their standard errors.

  • Reviewing the output, note that the mean for the total sample population for SAS Survey is the same as the mean reported in SUDAAN.
  • Looking further at the output, you will find the table that breaks down the genders by age group. These means are very similar to the medians reported in the descriptive statistics program in Task 1.
  • As in the SUDAAN program output, the SAS Survey output shows that the age 40-59 group has the highest mean cholesterol for the males, and the age 60+ female group has the highest mean for all groups.

Task 3c: How to Generate Means Using Stata

In this example, you will use Stata to generate tables of means and standard errors for average cholesterol levels of persons 20 years and older by sex and race-ethnicity. Following that example, is an example of calculating the geometric means.

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the svyset for your cholesterol analysis, use the weight variable for four-yours of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for four years of MEC data:

svyset [w= wtmec4yr], psu( sdmvpsu) strata(sdmvstra) vce(linearized)

Step 2: Use svy:mean to generate means and standard errors in Stata

Now, that the svyset has been defined you can use the Stata command, svy: mean, to generate means and standard errors. The general command for obtaining weighted means and standard errors of a subpopulation is below.

svy: mean varname, subpop(if condition)

Here is the command to generate the mean cholesterol (lbxtc) for the subpopulation of adults over the age of 20 (ridageyr>=20 & ridageyr <.):

svy: mean lbxtc, subpop(if ridageyr >=20 & ridageyr <.)
Output of Example survey:means Statement

Output of Example survey:means Statement

Step 3: Use over option of svy:mean command to generate means and standard errors for different subgroups in Stata

You can also add the over() option to the svy:mean command to generate the means for different subgroups. When you do this, you can type a second command, estat size, to have the output display the subgroup observation numbers. Here is the general format of these commands for this example:

svy: mean varname, subpop(if condition) over(var1 var2)
estat size

The prefix quietly before any svy command suppresses the appearance of the output of a command on the screen. In the following example, the first command is done "quietly"; the second command is executed to show the mean, standard error, plus the number of observations in each category. Below is the command to generate the mean cholesterol (lbxtc) for the subpopulation of adults over the age of 20 (ridageyr>=20 & ridageyr <.) by gender (riagendr).

quietly svy: mean lbxtc, subpop(if ridageyr>=20 & ridageyr <. ) over(riagendr)
estat size
Output of svy:mean With over Option

Output of svy:mean With over Option

Additionally, the over option can take multiple variables. To generate means for the six gender-age groups you will need to add the age variable to the over option, as in the example below.

quietly svy: mean lbxtc, subpop(if ridageyr>=20 & ridageyr <.) over(riagendr age)
estat size
Output of svy:mean With over Option by Gender and Age

Output of svy:mean With over Option by Gender and Age

The output will list the sample sizes, means, and their standard errors for each of the six gender-age groups.

  • The output shows the sample size, mean, and standard error sorted into total, male and female groups with age subgroups.
  • Also notice that the mean for each group is very near the median results (50th percentile) from the descriptive program in Task 1.

Step 4: Use svy:means to generate geometric means

If you need to generate geometric means instead of arithmetic means, you would first log transform the variable of interest. Then, use the svy:mean command to obtain the mean of the transformed variable. Finally, display the exponentiated form of the variable. The general format of these commands is:

generate ln_varname=ln(varname)
quietly svy: mean ln_varname, subpop(if condition) over(var1)
ereturn display, eform(geo_mean)

To generate geometric means of the cholesterol variable for persons aged 20 years and older by gender using the previous dataset, you would need to run the following commands and options.

WARNING

The example below is for illustrative purposes only. Geometric means are not recommended for use with normally distributed data, such as the cholesterol variables in this dataset.

First, create a new variable which is equal to the natural log of the variable of interest. In this example, the variable of interest is the cholesterol variable (lbxtc).

generate ln_lbxtc=ln(lbxtc)

Then, estimate the mean of the log transformed cholesterol variable (ln_lbxtc) for persons over the age of 20 (ridageyr>=20 & ridageyr <.) by gender (riagendr). The quietly prefix is used to suppress the output.

quietly svy: mean ln_lbxtc, subpop(if ridageyr>=20 & ridageyr <. ) over(riagendr)

Finally, display the output in original units. Stata lets you do this automatically by using the command eform(geo_mean), which displays the exponentiated coefficients for the mean, standard error, and 95% CI (ie, it calculates e to the (ln_lbxtc) power.

ereturn display, eform(geo_mean)

Proportions

Proportions or prevalence estimates are very useful in epidemiological studies. For a national cross-sectional survey such as NHANES, you often need to generate prevalence estimate of a particular disease, condition, or risk factor in U.S. population. It is also used to compare prevalence rates between different subgroups.

In this example, to determine the prevalence rate of high blood pressure in the U.S., you will identify persons who have high blood pressure according to the conventional health care definition set out by the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure. According to the Committee, a person with hypertension is defined as either having elevated blood pressure (systolic pressure of at least 140 mmHg or diastolic of at least 90 mmHg) or taking antihypertensive medication.

Task 4a: How to Generate Proportions Using SUDAAN

In this example, you will look at the proportion of examined persons 20 years and older with measured high blood pressure by sex, age, and race-ethnicity.

Step 1: Determine variables of interest

According to the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure, a person with hypertension is defined as either having elevated blood pressure (systolic pressure of at least 140 mmHg or diastolic of at least 90 mmHg) or taking antihypertensive medication. You will need to define a categorical variable (hbp) indicating persons with high blood pressure (1= high blood pressure; 2= no high blood pressure).

Step 2: Sort data

To calculate the proportions and standard errors, use SAS-callable SUDAAN because the software takes into account the complex survey design of NHANES data when determining variance estimates. If the standard errors are not needed, you simply could use a SAS procedure, i.e., proc freq with the weight statement. The data from analysis_Data must be sorted by strata first and then PSU (unless the data have already been sorted by PSU within strata). The SAS proc sort statement must precede the SUDAAN statements.

WARNING

The design variables sdmvstra and sdmvpsu are provided in the demographic data files and are used to calculate variance estimates. Before you call SUDAAN into SAS, the data must be sorted by these variables.

Step 3: Use proc descript to generate proportions

In this example, you will use proc descript in SUDAAN to generate proportions. Previously, you created a categorical variable, hbp, to indicate whether or not a person had high blood pressure. That categorical variable will be identified in the procedure and the weighted percent (prevalence) of sample persons with the value hbp=1 (high blood pressure) will be estimated along with the standard error.

You can code your variables in this example in two possible ways. Using catlevel option in SUDAAN, persons with high blood pressure, as defined above, are assigned a value of 1. All other sample persons are assigned a value of 2. The weighted percentage of sample persons with a value equal to 1 is an estimate of the prevalence of high blood pressure in the U.S. An alternate method of coding the variables is to assign persons with high blood pressure, as defined above, a value of 100, and persons without high blood pressure a value of 0. The weighted mean of sample persons with a value equal to 100 (which will be expressed as a percent) is an estimate of the prevalence of high blood pressure in the U.S. To see this method in SAS Survey Procedures, but without the catlevel option, see Task 4b: How to Generate Proportions using SAS Survey Procedure.

The SUDAAN procedure, proc descript, is used to generate percents and standard errors. You request those estimates on the print statement along with the sample size (nsum). The general program for obtaining weighted percents and standard errors is shown below.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Proportions in SUDAAN
Statements Explanation
PROC SORT DATA =analysis_data;
BY sdmvstra sdmvpsu ;
RUN ;

Use the proc sort procedure to sort the dataset by strata (sdmvstra) and PSU (sdmvpsu). The data statement refers to the dataset, analysis_data.

PROC descript data= analysis_data design=wr ;

Use the proc descript procedure to generate means and specify the sample design using the design option WR (with replacement).

subpopn ridageyr >=20 ;

Use the subpopn statement to select sample persons 20 years and older (ridageyr >=20) because only those individuals are of interest in this example. Please note that for accurate estimates, it is preferable to use subpopn in SUDAAN to select a subpopulation for analysis, rather than select the study population in the SAS program while preparing the data file.

NEST sdmvstra sdmvpsu;

Use the nest statement with strata (sdmvstra) and PSU (sdmvpsu) to account for the design effects.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

subgroup riagendr age race;

Use the subgroup statement to list the categorical variables for which statistics are requested. This example uses gender (riagendr), age (age), and race/ethnicity (race). These variables will also appear in the table statement.

levels2 3 4 ;

Use the levels statement to define the number of categories in each of the subgroup variables. The level must be an integer greater than 0. This example uses two genders, three age groups, and four race/ethnicity categories.

var hbp;

Use the var statement to name the variable(s) to be analyzed. In this example, the high blood pressure variable (hbp) is used.

catlevel 1 ;

Use the catlevel statement to indicate that the variable(s) on the var statement are categorical and to select the level of each variable to be analyzed. This example indicates the variable hbp is categorical and that hbp=1, i.e., persons who have high blood pressure.

IMPORTANT NOTE

Note that the catlevel statement may be omitted if you code the variable as 100 equals has HBP and 0 equals does not have HBP.

table riagendr * age * race ;

Use the table statement to specify cross-tabulations that estimates are requested. The example uses estimates are gender (riagendr) by age (age) and by race/ethnicity (race).

print nsum= "Sample Size"
percent="Percent"
sepercent="SE"
nohead notime
style=NCHS
nsumfmt=f8.0
percentfmt=f8.4
sepercentfmt=f8.4 ;

Use the print statement to assign names, format the statistics desired, and view the output. If the statement print is used alone, all of the default statistics are printed with default labels and formats.

In this example, sample size (nsum), percent (percent), and standard error of the percent (sepercent) are requested. The percent represents the proportion of persons with hbp=1 or with high blood pressure.

Note: For a complete list of statistics that can be requested on the print statement see SUDAAN Users Manual.

Use the style option equal to NCHS to produce output which parallels a table style used at NCHS.

rtitle "Prevalence of SPs with measured high blood pressure : NHANES 1999-2002" ;
run ;

Use the rtitle statement to assign a heading for each page of output.

Step 4: Review Output

The percents in the output are the proportions of sample persons with high blood pressure.

  • Reviewing the output, you will see tables for both genders, males only, and females only sorted by age group followed by race/ethnicity.
  • The "Other" race/ethnicity category is only included to complete the totals. It is not reported.
  • In the table for females, notice that the proportion of black females with high blood pressure is twice that of other races in the 20-39 years age group, and nearly twice that of other races in the 40-59 years age group.
  • Given the low proportion of high blood pressure in the 20-39 years age group, you will also want to consider using an arcsine of Clopper-Pearson transformation for standard error estimation.

Task 4b: How to Generate Proportions using SAS Survey Procedures

In this example, you will be looking at the proportion of examined persons 20 years and older with measured high blood pressure, by sex, age, and race-ethnicity.

Step 1: Determine variables of interest

According to the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure, a person with hypertension is defined as either having elevated blood pressure (systolic pressure of at least 140 mmHg or diastolic of at least 90 mmHg) or taking antihypertensive medication. You will need to define a categorical variable (hbpx) indicating persons with high blood pressure (100= high blood pressure; 0= no high blood pressure).

Step 2: Create Variable to Subset Population

In order to subset the data in SAS Survey Procedures, you will need to create a variable for the population of interest. In this example, the sel variable is set to 1 if the sample person is 20 years or older, and 2 if the sample person is younger than 20 years. Then this variable is used in the domain statement to specify the population of interest (those 20 years and older).

if ridageyr GE 20 then sel = 1;
else sel = 2;

Step 3: Use proc surveymeans to generate proportions and their standard errors in SAS Survey Procedures

In SAS Survey Procedures, persons with high blood pressure, as defined above, are assigned a value of 100, and persons without high blood pressure are assigned a value of 0. The weighted mean of sample persons with a value equal to 100 (which will be expressed as a percent) is an estimate of the prevalence of high blood pressure in the U.S.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

Generate Proportions in SAS Survey Procedures
Statements Explanation
ods trace on ;

Use the ods statement to provide printer-friendly output.

proc surveymeans data=analysis_Data nobs mean stderr

Use the proc surveymeans procedure to obtain number of observations, mean, and standard error.

stratum sdmvstra;

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu;

Use the cluster statement to define the PSU variable (sdmvpsu).

class riagendr age race;

Use the class statement to specify the discrete variables used to form the subpopulations of interest. In this example, the subpopulation of interest are gender (riagendr), age (age), and race/ethnicity (race).

domain sel sel*riagendr*age*race;
Use the domain statement to specify the table layout to form the subpopulations of interest. This example uses age greater than or equal to 20 (sel) by gender (riagendr) by age (age) and by race/ethnicity (race).
var hbpx;

Use the var statement to name the variable(s) to be analyzed. In this example, the high blood pressure variable (hbpx) is used. If the sample person has high blood pressure, then the value equals 100. If the sample person does not have high blood pressure, then the value equals 0.

IMPORTANT NOTE

The SAS Survey procedure, proc surveymeans, is only able to use the variable coded as 100 and 0.

weight wtmec4yr;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

ods output domain(match_all)=domain;
run;

Use the ods statement to output the dataset of estimates from the subdomains listed on the domain statement above. This set of commands will output two datasets for each subdomain specified in the domain statement above (domain for sel; domain1 for sel*riagendr*age*race).

data all;
set domain domain1;
if sel='Age ge 20'; run;

Use the data statement to name the temporary SAS dataset (all) append the two datasets, created in the previous step, if age is greater than or equal to 20 (sel).

proc print noobs data =all split = '/' ;
var riagendr age race N mean stderr ;
format n
5.0 mean
4.4 stderr
4.2 ;
label N = 'Sample' / 'size'
mean='Percent'
stderr='Standard' / 'error' / 'of the' / 'percent';
title1 'Percent of adults 20 years and older with high blood pressure, 1999-2002' ;
run ;

Use the print statement to print the number of observations, the mean, and standard error of the mean in a printer- friendly format.

Step 3: Review output

The percents in the output are the proportions of sample persons with high blood pressure:

  • Reviewing the output, you will see that the tables of both genders, males only, and females only sorted by age group and then race/ethnicity.
  • The "Other" race/ethnicity category is only included to complete the totals. It is not reported.
  • In the table for females, notice that the proportion of black females with high blood pressure is twice that of the other races in the 20-39 years age group, and nearly twice that of the other races in the 40-59 years age group.
  • Given the low proportion of high blood pressure in the years 20-39 age group, you will also want to consider using an arcsine of Clopper-Pearson transformation for standard error estimation.

Task 4c: How to Generate Proportions using Stata

Stata software can be used to calculate proportions and standard errors for NHANES data because the software takes into account the complex survey design of NHANES data when determining variance estimates. If the standard errors are not needed, you simply could use a standard Stata command, i.e., svy: proportion with the weight statement. In this example, you will be looking at the proportion of examined persons 20 years and older with measured high blood pressure, by sex, age, and race-ethnicity.

WARNING

There are several things you should be aware of while analyzing NHANES data with Stata. Please see the Stata Tips page to review them before continuing.

Step 1: Determine variables of interest

According to the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure, a person with hypertension is defined as either having elevated blood pressure (systolic pressure of at least 140 mmHg or diastolic of at least 90 mmHg) or taking antihypertensive medication.

You can code your variables in this example in two possible ways. Persons with high blood pressure, as defined above, are assigned a value of 1. All other sample persons are assigned a value of 2. The weighted percentage of sample persons with a value equal to 1 is an estimate of the prevalence of high blood pressure in the U.S.

IMPORTANT NOTE

An alternate method of coding the variables is to assign persons with high blood pressure, as defined above, a value of 100, and persons without high blood pressure a value of 0. The weighted mean of sample persons with a value equal to 100 (which will be expressed as a percent) is an estimate of the prevalence of high blood pressure in the U.S. This method can be used with SAS Survey Procedures.

Step 2: Use svyset to define survey design variables

Remember that you need to define the SVYSET before using the SVY series of commands. The general format of this command is below:

svyset [w=weightvar], psu(psuvar) strata(stratavar) vce(linearized)

To define the survey design variables for your cholesterol analysis, use the weight variable for four-yours of MEC data (wtmec4yr), the PSU variable (sdmvpsu), and strata variable (sdmvstra) .The vce option specifies the method for calculating the variance and the default is "linearized" which is Taylor linearization. Here is the svyset command for fur years of MEC data:

svyset [w= wtmec4yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)

Step 3: Use svy:proportion to generate proportions

In this example, you will use svy: proportion in Stata to generate proportions. You created a categorical variable, hbp, to indicate whether or not a person had high blood pressure. That categorical variable will be identified in the procedure and the weighted percent (prevalence) of sample persons with the value hbp=1 (high blood pressure) will be estimated along with the standard error.

The general format of the svy:proportion command is:

svy, subpop(if condition) vce(linearized): proportion varname

To generate the proportion of persons aged 20 years and older (ridageyr >=20 & ridageyr <.) with high blood pressure (hbp), the command would be:

svy, subpop(if ridageyr >=20 & ridageyr <. ) vce(linearized): prop hbp
Output of svy:prop for High Blood Pressure Variable

Output of svy:prop for High Blood Pressure Variable

Step 4: Use over option of svy:proportion command to generate means and standard errors for different subgroups in Stata

The general format of the svy:proportion command with the over option is:

svy, subpop(if condition) vce(linearized): proportion varname, over(var1)

Here is the command to generate the proportion of people aged 20 years and older (ridageyr >=20 & ridageyr <.) by gender (riagendr) with hypertension (hbp):

svy, subpop( if ridageyr >=20 & ridageyr <. ) vce(linearized): proportion varname, over(rigendr)
Output of svy:prop by Gender

Output of svy:prop by Gender

Here is the command to generate the proportion of people aged 20 years and older (ridageyr >=20 & ridageyr <.) by gender (riagendr), race-ethnicity (race), and age (ridageyr) with hypertension (hbp):

svy, subpop( if ridageyr >=20 & ridageyr <. ) vce(linearized): proportion varname, over(rigendr race ridageyr)
Output of svy:prop by Gender, Age, and Race-Ethnicity

Output of svy:prop by Gender, Age, and Race-Ethnicity

Highlights from the output include:

  • Reviewing the output, you will see proportions for all persons, both genders, the four race categories, and three age groups, and finally the 24 gender-race-age groups.
  • The percents in the output are the proportions of sample persons with high blood pressure.
  • The "Other" race/ethnicity category is only included to complete the totals. It is not reported.
  • In the groups for females, notice that the proportion of black females with high blood pressure is twice that of other races in the 20-39 years age group, and nearly twice that of other races in the 40-59 years age group.
  • Given the low proportion of high blood pressure in the 20-39 years age group, you will also want to consider using an arcsine of Clopper-Pearson transformation for standard error estimation.
TOP