Subpopulation variance: How to do it correctly?

Hello, I am doing a study comparing cancer prevalence and related health behaviors between veterans and non-veterans. I am pooling data from 2000 to 2018. I read the article “Analysis and Variance Estimation with IPUMS NHIS,” but I am confused.

  • How do I make sure I produce nationally representative subpopulation variance?
  • Do I have to adjust STRATA or PSU?

NOTE: Cancer prevalence variables are available all years. Some of the health behavior variables are available all years, others are available only certain years (2000, 2003, 2015).

My response below will summarize what is documented in our User Note on how to approach analysis and variance estimation with IPUMS NHIS and the Use of Sampling Weights with IPUMS NHIS.

  • I am pooling data from 2000 to 2018.

You only need to adjust the sampling weights with pooled samples if you are creating estimates that are representative of the entire time period. To do this, you would simply divide the weights by the number of samples in your pool. However, if you are creating estimates for each year separately, you do not need to adjust the sampling weights by dividing them by the number of samples. You would just use them as-is. In that case, while your extract may contain data from multiple samples, you are not conducting a “pooled” analysis.

  • How do I make sure I produce nationally representative subpopulation variance?

The following R syntax demonstrates, generally, how an analyst can conduct subpopulation analysis using IPUMS NHIS data without compromising the design structure of the data. This approach has the effect of producing estimates for the population of interest, while incorporating the full sample design information for variance estimation. This syntax uses, as an example, the population of those 65 and older.

library(survey)
library(srvyr)
data <- as_survey(data, id = PSU, weight = PERWEIGHT, strata = STRATA, nest = TRUE)
subset(data, age >= 65) %>% summarise(var1_mean = survey_mean(var1, na.rm = TRUE))
  • Do I have to adjust STRATA or PSU?

The integrated variables STRATA and PSU in the IPUMS NHIS database have been adjusted from the original NHIS design variables to account for sampling design changes across years. Thus, the analyst can simply select the STRATA and PSU variables to use for analysis of one year or for many years of IPUMS NHIS data.

  • NOTE: Cancer prevalence variables are available all years. Some of the health behavior variables are available all years, others are available only certain years (2000, 2003, 2015).

Depending on if you are pooling samples or creating annual estimates, your approach will be different. If you are creating annual estimates for each year separately, there is no necessary change needed for your analyses. If you are pooling data across multiple years for one estimate, you may need to restrict your pool to only years where the variables of interest are available, and adjust weights accordingly (divide your weight variables by the number of samples in your analyses).

While differential availability across time is not the reason for a different weight, it likely indicates that something is part of a rotating topical supplement that is only asked of sample adults/children. On each variable webpage, there is a Weights tab that indicates which weight variable is best for that variable. In general, you should be using the most restrictive weight available based on the variables you are using for your analysis, or you could run your analyses separately for combinations of variables that require different weights, as seems appropriate.