# Reliability of Estimates

Analysts must evaluate the statistical reliability of estimates to determine whether the results are appropriate for their intended research objective. This module describes a number of measures that can be used to evaluate the reliability of an estimate, including the sample size, the effective sample size, the design effect, the width and relative width of its confidence interval, the degrees of freedom, and the relative standard error. The National Center for Health Statistics (NCHS) Data Presentation Standards for Proportions is one set of guidelines that analysts may consider when making analytic decisions.

In 2017, NCHS published updated Data Presentation Standards for Proportions. This report described a set of criteria to judge the reliability of estimated proportions in NCHS reports, including those that present estimates from NHANES. Although other NHANES data users are not required to apply these standards, analysts are advised to consider the principles behind these standards in determining whether estimated proportions are reliable for their intended purpose. It may be necessary to combine data from multiple two-year survey cycles or to collapse subgroups within the sample design (e.g. to combine age groups) to produce adequate sample sizes for an analysis.

The Data Presentation Standards for Proportions include criteria based on the effective sample size, the width and relative width of the confidence intervals (CIs), and the degrees of freedom. The tutorial module below provides an overview to the criteria in the standard, but analysts are encouraged to read the full report. The standards do not include thresholds for evaluating the reliability of other types of statistics, such as means or percentiles. However, the principles considered in the standard, such as using effective sample size and CIs to guide decisions, may be relevant to decisions about presenting other types of estimates.

**Reference**

Parker JD, Talih M, Malec DJ, et al. National Center for Health Statistics Data Presentation Standards for Proportions. National Center for Health Statistics. Vital Health Stat 2(175). 2017.

The sample size is the (unweighted) number of survey participants included in an estimate. The variance of an estimated proportion is inversely related to the sample size.

The effective sample size for an estimate from a complex survey such as NHANES is defined as the sample size divided by the design effect of the estimate. As described in the Variance Estimation module, the design effect is the ratio of the variance of a statistic which properly accounts for the complex sample design (i.e. a design-based estimate of variance that accounts for clustering, stratification and unequal selection probabilities by weighting the data) to the variance of the same statistic based on a simple random sample of the same size. However, there is more than one way to interpret how to calculate the variance of a “simple random sample of the same size,” and so statistical packages may vary in how they calculate the design effect. For example, SUDAAN version 11 provides four different methods for how to calculate the design effect. Method 2 (option “DEFT2”) is the method recommended by NCHS for NHANES data. This design effect option assumes that the subgroup sample size is fixed, and measures the impact of two aspects of the survey design: (a) stratification and clustering and (b) unequal weighting.

The **design effect** for an estimated proportion from NHANES is defined as:

where \(\hat{p}\) is a weighted estimate of the proportion (using the appropriate sample weights), \(n\) is the sample size, and \( \widehat{\operatorname{Var}}(\hat{p})\) is the design-based variance estimate for \(\hat{p}\) (i.e. accounting for the complex survey design as described in the Variance Estimation module).

The **effective sample size** for an estimated proportion from NHANES can be calculated as the sample size divided by the design effect, or:

Due to sampling design and variability, it is possible that the design effect of an estimate could be less than one, which would produce an effective sample size that is greater than the actual sample size. In that case, the effective sample size should be capped at the actual sample size when determining whether the minimum sample size criteria is met. The min() function in the formula above implements this restriction that the effective sample size be capped at the actual sample size. (When calculating the confidence interval, the degrees-of-freedom adjusted effective sample size is also capped at the actual sample size. See the section below, "Confidence Intervals for Proportions.")

Statistical software packages may not calculate a design effect for age-adjusted estimates (also known as age standardization by the direct method.) An analyst may first need to estimate the age-specific proportions for each of the age groups used in direct age standardization, then combine them to calculate the variance of an age-adjusted proportion based on a “simple random sample of the same size.”

The **design effect for an age-adjusted proportion** \(\hat{q}\) based on \(g\) age-standardization groups is defined as:

where \( \widehat{\operatorname{Var}}(\hat{q})\) is the design-based variance estimate for \(\hat{q}\) (i.e. accounting for the complex survey design as described in the Variance Estimation module), \(\hat{p_i}\) is the estimated weighted proportion for the *i*-th age group, \(n_i\) is the sample size for the *i*-th age group, and \(wt_i\) is the population share (weight) represented by the *i*-th age group as used in the direct age standardization.

Then, the **effective sample size for an estimated age-adjusted proportion** from NHANES can be calculated as:

The NCHS Data Presentation Standards for Proportions require that an estimated proportion be suppressed if it is based on either an actual sample size smaller than 30 or an effective sample size smaller than 30.

**Reference**

Research Triangle Institute (2012). "Section 12.1: Design Effects Computed in All SUDAAN Descriptive and Modeling Procedures" SUDAAN Language Manual, Volumes 1 and 2, Release 11. Research Triangle Park, NC: Research Triangle Institute.

Confidence intervals (CI) also provide information about the reliability of an estimate. Conceptually, under repeated sampling from the same population, if a proportion and its 95% confidence interval are estimated from each sample, the true value of the proportion is expected to be contained in 95% of the calculated intervals.

The Wald confidence interval [ \(\hat{p} \pm 1.96 \times \widehat{\operatorname{SE}}(\hat{p}) \), for a two-sided 95% CI] is commonly produced as the default CI method by statistical software but is known to have limitations for proportions. Because proportions are bounded by [0, 1], the upper and lower bounds of a CI should also fall within that same range. However, the Wald CI may produce negative lower bounds for proportions near zero and upper bounds greater than one for large proportions. In addition, the Wald confidence interval may be too narrow; simulation studies have shown that the true proportion is contained within a 95% Wald CI in less than 95% of the simulated CIs. The undercoverage is worse for small and for large proportions.

Several alternative methods have been proposed for calculating confidence intervals for estimated proportions, including those from complex surveys. Analysts are advised to consider the properties of the proportion and the analytic goals when selecting an approach. The Data Presentation Standards for Proportions include criteria based on the absolute width and the relative width of the Clopper-Pearson confidence interval, which was adapted for complex surveys by Korn and Graubard.

The calculation of the Korn and Graubard CI depends on the degrees of freedom. For proportions estimated for a subgroup, the degrees of freedom should be calculated as (the number of PSUs with sampled observations in the subgroup of interest) – (the number of strata with sampled observations in the subgroup of interest). For subgroups that are not represented in all primary sampling units (PSUs) or strata (e.g. some racial and ethnic groups), the degrees of freedom will therefore be lower than degrees of freedom available for the overall estimates. **The default calculations from most statistical software packages do not properly account for the reduction in the degrees of freedom for subgroups that are not represented in all PSUs or strata.** In order to properly account for the degrees of freedom, analysts may need to output the number of strata and number of PSUs available for each subgroup from the survey procedure or from a separate tabulation into a data set that can be used to calculate the Korn and Graubard CIs outside the procedure. The code examples provide Sample Code for calculating Korn and Graubard confidence intervals.

**Formulas for Korn and Graubard Confidence Limits**

Korn and Graubard (KG) confidence limits are a modification of the Clopper-Pearson ("exact") confidence limits for a binomial proportion, adapted for use with complex survey data. Where the Clopper-Pearson calculation uses the sample size (or "number of trials"), the KG confidence limit substitutes a degrees-of-freedom adjusted effective sample size \( n_e^* \). Where the Clopper-Pearson calculation uses the number of positive responses or "successes", the KG confidence limit substitutes the adjusted effective sample size times the (weighted) estimated proportion \(n_e^*\hat{p}\).

The Korn and Graubard confidence limits for a proportion (lower confidence limit \(P_L\) and upper confidence limit \(P_U\) ) can be formulated in terms of quantiles of the F distribution:

where \(F(\alpha/2, b,c)\) is the \((\alpha/2)\)th percentile of the \(F\) distribution with \(b\) and \(c\) degrees of freedom, and the degrees-of-freedom adjusted effective sample size \(n_e^*\) is defined as:

where the design effect \(\text{DEFF}\) is defined above in the section "Sample Size and Effective Sample Size." If the estimated proportion is age-adjusted, then the formula for the design effect for an age-adjusted proportion should be used.

Note that the degrees-of-freedom adjusted effective sample size \(n_e^*\) is capped at the actual sample size \(n\). This cap could be binding for subgroups where the estimated design effect is less than one.

**References**

Brown LD, Cai TT, Dasgupta A. "Interval estimation for a binomial proportion." Stat Sci 16(2):101–17. 2001. 12.

Clopper CJ, Pearson ES. "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika 26(4):404–13. 1934.

Dean N, Pagano M. "Evaluating confidence interval methods for binomial proportions in clustered surveys." J Surv Stat Methodol 3(4):484–503. 2015. 11.

Korn EL, Graubard BI. "Confidence intervals for proportions with small expected number of positive counts estimated from survey data." Surv Methodol 24(2):193–201. 1998.

Graubard BI, Korn EL. "Survey inference for subpopulations." Am J Epidemiol. 1996;144(1):102-106.

Newcombe RG. "Two-sided confidence intervals for the single proportion: Comparison of seven methods." Stat Med 17(8):857–72. 1998.

“The SURVEYFREQ Procedure: Confidence Limits for Proportions.” SAS Institute Inc. 2018. SAS/STAT® 15.1 User’s Guide. Cary, NC: SAS Institute Inc.

From a calculated confidence interval, the absolute CI width (CIW) is calculated by subtracting the value of the lower confidence limit from the value of the upper confidence limit. The relative confidence interval width (RCIW) is calculated as the absolute CI width divided by the proportion and the result is multiplied by 100%.

Because the RCIW is divided by the proportion, the strictness of a presentation threshold based on the RCIW would vary depending on the level of the estimated proportion. The RCIW can be too conservative for very small proportions (i.e. requiring a very small CI width in order to present the estimate) and too liberal for very large proportions (i.e. possibly allowing an estimate to be presented even if the absolute CIW is wide.) This property of the RCIW is the motivation for including criteria based on both the CIW and the RCIW in the Data Presentation Standard for Proportions. The table below describes the criteria related to CIW and RCIW in the standards.

Absolute CI Width | Relative CI Width | Guidance |
---|---|---|

≥30% | Any level | Suppress |

>5% and <30% | >130% | Suppress |

>5% and <30% | ≤130% | Present* |

≤5% | Any level | Present* |

The variance of a statistic estimated from the NHANES data is also an estimate, and as such, is subject to its own variability. For complex surveys, such as NHANES, the precision of the estimated variance is approximately related to the square root of the degrees of freedom.

The calculations of effective sample size and CI widths described above are dependent on the estimated variance of the proportion. An estimate based on a small number of degrees of freedom may have an estimated variance with low precision, which in turn may lead to poor reliability of statistical tests and inferences as well as the effective sample size and CI widths.

The NCHS Data Presentation Standards for Proportions requires that a proportion based on fewer than 8 degrees of freedom be reviewed by a clearance official for determination of whether to present or suppress the proportion. Most population estimates from a public-use data file for a single NHANES cycle are based on 15 degrees of freedom (30 PSU – 15 strata). However, estimates for subgroups not represented in all locations and subnational estimates produced in the Research Data Center may have fewer than 15 degrees of freedom.

The relative standard error (RSE) of a statistic is defined as the standard error of the estimated statistic divided by the estimated statistic, and is usually expressed as a percentage.

%RSE = (Standard error of estimate / Estimate) * 100

However, the strictness of the RSE measure varies depending on the level of the estimated proportion; the RSE can be too conservative for very small proportions and too liberal for very large proportions.

In the past, NCHS has often used thresholds based on the RSE in determining whether to show an estimate or whether to identify an estimate as unreliable in its reports. The current NCHS Data Presentation Standards for Proportions do not include criteria based on the RSE. However, estimated means published in NCHS reports will continue to be evaluated based on the RSE, and estimated means with RSE greater than or equal to 30% should be identified as unreliable.