***************************************************************************
** Module 3 examples - Stata code                                         *
** Examples illustrating the over-sampling of some demographic            * 
* groups and demonstrating the importance of using weights in analyses    *
***************************************************************************
** Note to tutorial users: you must update some lines of code (e.g. file paths) 
**    to run this code yourself. Search for comments labeled "TutorialUser"


* Change working directory to a directory where we can save temporary files *
* TutorialUser: Update this path to a valid location on your computer!
cd "C:\Stata_workspace\"

*******************************************************************
** Download data files from NHANES website and import into Stata **
*******************************************************************

* DEMO demographic *
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT", clear
save "DEMO_I.dta", replace

* BPX blood pressure exam *
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.XPT", clear
save "BPX_I.dta", replace

* BPQ blood pressure questionnaire *
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPQ_I.XPT", clear
save "BPQ_I.dta", replace

**********************
** Merge data files **
**********************
use "DEMO_I.dta", clear
merge 1:1 seqn using "BPX_I.dta", nogenerate keepusing(seqn bpxsy1-bpxsy4 bpxdi1-bpxdi4)
merge 1:1 seqn using "BPQ_I.dta", nogenerate keepusing(seqn bpq050a)

*********************************
** Generate analysis variables **
*********************************
**Hypertension prevalence**
** Count Number of Nonmissing SBPs & DBPs **
egen n_sbp=rownonmiss( bpxsy1 bpxsy2 bpxsy3 bpxsy4)
egen n_dbp=rownonmiss( bpxdi1 bpxdi2 bpxdi3 bpxdi4)

** Set DBP Values Of 0 To Missing For Calculating Average **
mvdecode bpxdi1-bpxdi4, mv(0)

** Calculate Mean Systolic and Diastolic (over non-missing values) **
egen mean_sbp=rowmean(bpxsy1 bpxsy2 bpxsy3 bpxsy4)
egen mean_dbp=rowmean(bpxdi1 bpxdi2 bpxdi3 bpxdi4)

** Create 0/100 indicator for Hypertension **
* "Old" Hypertensive Category variable: taking medication or measured BP > 140/90 *
* as used in NCHS Data Brief No. 289 *
* variable bpq050a: now taking prescribed medicine for hypertension, 1 = yes *
* need to explicitly check that mean_dbp is not missing because missing values are representated as large values ("positive infinity") in Stata *
gen HTN_old=100 if ( (mean_sbp>=140 & !missing(mean_sbp)) | (mean_dbp >= 90 & !missing(mean_dbp))| bpq050a == 1) 
replace HTN_old = 0 if  (HTN_old==. & n_sbp > 0 & n_dbp > 0)
label define htnLabel 0 "No" 100 "Yes"
label values HTN_old htnLabel

/*
** For reference: "new" definition of hypertension prevalence, based on taking medication or measured BP > 130/80 **
** From 2017 ACC/AHA hypertension guidelines **
* Not used in Data Brief No. 289 - provided for reference *
gen HTN_new=100 if ( (mean_sbp>=130 & !missing(mean_sbp)) | (mean_dbp >= 80 & !missing(mean_dbp)) | bpq050a == 1) 
replace HTN_new = 0 if  (missing(HTN_new) & n_sbp > 0 & n_dbp > 0)
*/

* Create race and Hispanic ethnicity categories for oversampling analysis *
* combined Non-Hispanic white and Non-Hispanic other and multiple races, to approximate the sampling domains *
recode ridreth3 (3 7 = 4) (4 = 1) (1/2=3) (6 = 2), generate(race1)
label define racelabels1 1 "Non-Hispanic black" 2 "Non-Hispanic Asian" 3"Hispanic" 4 "Non-Hispanic white and other"
label values race1 racelabels1

* Create race and Hispanic ethnicity categories for hypertension analysis *
recode ridreth3 (3 = 1) (4 = 2) (1/2=4) (6=3) (7 = 5), generate(raceEthCat)
label define raceEthnicity_Labels 1 "Non-Hispanic white" 2 "Non-Hispanic black" 3"Non-Hispanic Asian" 4 "Hispanic" 5 "NH other race or multiple races"
label values raceEthCat raceEthnicity_Labels

* Create age categories for adults aged 18 and over: ages 18-39, 40-59, 60 and over *
recode ridageyr (0/17 = .) (18/39 = 1) (40/59 = 2) (60/80 = 3), generate(ageCat_18)
label define Age_Labels 1 "18-39" 2 "40-59" 3 "60 and over"
label values ageCat_18 Age_Labels

* Define subpopulation of interest: non-pregnant adults aged 18 and over who have at least 1 valid systolic OR diastolic BP measure *
generate inAnalysis = 1 if (ridageyr >=18 & ridexprg ~= 1 & (n_sbp > 0 | n_dbp > 0))


**********************************************************************************************
** Estimates for graph - Distribution of race and Hispanic origin, NHANES 2015-2016          *
* Module 3, Examples Demonstrating the Importance of Using Weights in Your Analyses          *
* Section "Adjusting for oversampling"                                                       *
**********************************************************************************************

* Proportion of unweighted interview sample *
tab race1

* Proportion, weighted with interview weight *
tab race1 [iweight=wtint2yr]

* Proportion of US population *
* Input population totals from the American Community Survey, 2015-2016 *
* available on the NHANES website: https://wwwn.cdc.gov/nchs/nhanes/responserates.aspx#population-totals *
* counts from tab "Both" (for both genders), total row (for all ages) *

scalar NH_White_Other=194849491+10444206
scalar NH_Black=38418696
scalar NH_Asian=17018259
scalar Hispanic=55750392
scalar acsTotal= NH_White_Other + NH_Black + NH_Asian + Hispanic

* use display command as a "hand calculator" to display the proportion comprised by each group *
foreach group in NH_Black NH_Asian Hispanic NH_White_Other {
	display "`group' " %5.1f `group'/acsTotal*100
}


**********************************************************************************************
** Comparison of weighted and unweighed estimates for hypertension, NHANES 2015-2016         *
* Module 3, Examples Demonstrating the Importance of Using Weights in Your Analyses          *
* Section "Why weight?"                                                                      *
**********************************************************************************************

** Prevalence of hypertension among adults aged 18 and over, overall and by race and Hispanic origin group **

* Unweighted estimate - for adults aged 18 and over *
tab HTN_old if inAnalysis==1

* Unweighted estimate - for adults aged 18 and over, by race and Hispanic origin *
tab raceEthCat HTN_old if inAnalysis==1 , row 

* Weighted estimates * 

*** WARNING ***
* The following commands using the tab statement are intended to demonstrate the importance of using the sample weight in your analyses.
* The weighted estimate produces the correct POINT ESTIMATES for the prevalence of hypertension.
* However, your analysis must account for the complex survey design of NHANES (e.g. stratification and clustering), 
*   in order to produce correct STANDARD ERRORS (and confidence intervals, statistical tests, etc.).
* Do NOT use this step as a model for producing your own analyses!
* See the Continuous NHANES tutorial Module 4: Variance Estimation for a complete explanation of how to properly account 
*    for the complex survey design using Stata survey commands with the svy: prefix

* Weighted estimates - for adults aged 18 and over *
tab HTN_old [iweight=wtmec2yr] if inAnalysis==1

* Weighted estimates - for adults aged 18 and over, by race and Hispanic origin *
tab raceEthCat HTN_old  [iweight=wtmec2yr] if inAnalysis==1 , row

** Code using Stata svy commands to estimate the prevalence of hypertension, with correct standard errors) ** 
* See Module 4: Variance Estimation for details *
svyset [w=wtmec2yr], psu(sdmvpsu) strata(sdmvstra) vce(linearized)
* overall adults aged 18 and over *
svy, subpop(inAnalysis): mean HTN_old , cformat(%5.1f)
* adults aged 18 and over, by race and Hispanic origin *
svy, subpop(inAnalysis): mean HTN_old , over(raceEthCat) cformat(%5.1f)


************************

** Age distribution among Hispanic adults, weighted and unweighted **
* statement in tutorial text that the unweighted estimate over-represents Hispanic adults aged 60 and over,
*  compared with their actual share of the Hispanic adult population *

* Unweighted age distribution among Hispanic adults in the analysis  *
tab ageCat_18 if inAnalysis==1 & raceEthCat==4
* Unweighted, Hispanic adults aged 60 and over comprise 33% of Hispanic adults in the analysis sample. *

* weighted age distribution among Hispanic adults in the analysis population *
tab ageCat_18 [iweight=wtmec2yr] if inAnalysis==1 & raceEthCat==4
* When properly weighted, Hispanic adults aged 60 and over comprise 15% of Hispanic adults in the US non-institutionalized civilian population *