I am trying to generate custom multi-year estimates while also calculating the standard error. For example, find 3-year estimates by combining three 1-year samples together. The guidance I’ve received from the Census bureau is to divide the weights by 3 as well as all of the replicate weights by 3. However, when I input this into my survey design in R using the survey and srvyr packages, my estimate seems correct but my standard errors are far too high.
Does anyone have any experience generating multi-year estimates and calculating the standard errors for them?
If you are combining three ACS samples together, it is typically advisable to use the 3-year files provided by the Census Bureau. These files, available via IPUMS USA, include sampling weights that are already adjusted for the pooling together of multiple single-year files. Additionally, replicate weights are available in these files. If you are not combining three ACS samples together, then the procedure you discuss above (e.g., dividing the sampling weights by the number of samples pooled together) is approximately correct. Strictly speaking, a “more accurate” method is to multiply the sampling weight in sample x by (the sample size in sample x) / (the pooled sample size). If the combined samples all have roughly the same sample size, then the two methods discussed will be approximately equivalent.
@JeffBloem Thanks for the reply. The 3-year data was discontinued in 2012, so to balance currency of data with sample size, I am trying to combine three 1-year estimates.
I was concerned with using the replicate weights, but I think I just solved the issue by using the STRATA and CLUSTER variables. Can you answer these questions for me?
Does IPUMS calculate in-house replicate weights instead of reporting Census PUMS replicate weights?
Does IPUMS offer CLUSTER and STRATA as optional methods for calculating standard error? → And so are REPWT and REPWTP simply separate variables that an analyst could use to estimate the standard errors?
Why would Census PUMS be better than IPUMS for combining three 1-year files?
Last, can you tell me if this seems correct? The results really do seem accurate (not only the estimate, but the margins of error seem appropriate when the sample sizes are cut).
Take 2016, 2017 and 2018 ACS 1-year estimates (IPUMS).
For household-level estimates, divide HHWT by 3† and filter for PERNUM == 1.
Specify the survey design using the CLUSTER and STRATA fields as well as the revised HHWT field.
I will try to answer your questions one at a time.
(1) The replicate weights available in IPUMS USA are the same replicate weights provided by the US Census Bureau.
(2) The use of CLUSTER and STRATA are not necessary to calculate standard errors. They are available as an option for users who feel they will enhance the credibility of their estimates. This page includes much more information about variance estimation with IPUMS USA data using CLUSTER and STRATA.
(3) I’m really not sure. I think this choice ultimately comes down to personal preference. As a frequent user of IPUMS USA, I can’t easily think of a case where I would prefer to use the PUMS data directly from the Census Bureau website. If you prefer to use un-harmonized variables, you can access the source variables directly from IPUMS USA. Just select the “source variables” radio button on the top of the Select Data page.
I had the same question as nkobel and came across your reply (which was very helpful). I am just wondering if you could link to me to a source from the Census/IPUMS regarding this statement “Strictly speaking, a “more accurate” method is to multiply the sampling weight in sample x by (the sample size in sample x) / (the pooled sample size)”. @JeffBloem
I want to preface by stating that altering the individual weights is only necessary if you’re looking to generate population-level totals or incorporate those into your analysis (e.g. the total number of people who are unemployed). It is not necessary to alter the weights when working with population means (e.g. average income).
I am not aware of a citation regarding Jeff’s statement about this being strictly more accurate. However, I can share the logic behind his idea:
If you are calculating a weighted mean with standardized weights, and your observations are i.i.d., then the variance of your estimate is minimized if all the weights are equal. This article on Wikipedia provides more detail. Assuming that the three years of data are samples from the same (unchanging) population, your variance will be minimized if you weight each sampled individual the same. With oversampling, poststratification, and other sampling processes in the ACS, the assumption that the weights are equal over time does not fully hold. However, the “more equal” the weights are, the lower the variance will be.
The Census Bureau recommends dividing the weights by the number of samples you are pooling together (see page 28 of this technical paper); this is the approach Census uses to modify the weights in the multi-year ACS files. Since neither the population size nor sample size change very much over three years of ACS data, the two methods are probably going to give very similar results.