# Issues when combining data from two 5-year ACS

I extracted the 2010 and 2015 5-year ACS samples to increase the sample size of a small immigrant group. I am currently doing person-level analysis so use perwt when I want population estimates. I have several questions related to this decision to combine two 5-year ACS samples:

1. When using perwt with pooled data it effectively doubles the weighted population size. What should I do to get accurate population figures? I was thinking create new person weights that for the 2010 sample do this:

perwt2 for 2010 sample cases X (2010 ACS popuplation/(2010 ACS population + 2015 ACS population))

and for 2015 sample cases do this:

perwt2 for 2015 sample X (2015 ACS popuplation/(2010 ACS population + 2015 ACS population))

1. Using weighted data to estimate confidence intervals for univariates and significance levels for multivariate tests I assume the perwt needs adjusting to make it so the weighted data # of cases equals the actual sample size. So I plan to create new perwts like this: perwt3 = perwt X (sample/population). Again, I would do this separately for the 2010 and 2015 cases. Is this correct?

2. If I want to analyze income levels or home values are there multipliers available to make income for 2006-2010 cases equivalent to 2011-2015 cases?

3. Using the 2006-2015 pooled data, Iâ€™m guessing there are individuals who are respondents twice. Is this correct? If yes, is there something I can/should do to address this?

4. What other problems should I be aware of when using this pooled ACS data (2006-2015)?

My apologies for asking such basic questions, but I couldnâ€™t find them in the forum.

Iâ€™ll answer each question one at a time.

(1) Iâ€™d say your method for adjusting the sample weights will work just fine. I was actually going to suggest simply dividing PERWT by 2. Your method, however, is a little more precise.

(2) IPUMS USA actually provides replicate weights that can be useful for variance estimation of estimates calculated from public use sample data. See this page for more information.

(3) There is no ready-made inflation rate variable, that adjusts to present day dolar values, available in IPUMS USA. We do offer the CPI99 variable for adjusting dollar values into 1999 dollars by year of interview (and a descriptive doc for more details). You can adjust your income variables, however, by using publicly available annual inflation rates. Note that in the multi-year files the actual year of the survey is given by the variable MULTYEAR.

(4) It is highly unlikely that respondents are included twice in the ACS in a 10-year span. This is because each ACS sample represents a 1% sample of the total population. Nevertheless it would be difficult to correct for this situation with public use data.

(5) Well, it is difficult to say without more detail about what you plan to do with the data. One challenge that often comes up is geographies are a bit tricky in the mulityear files. This is because geographic boundaries sometimes change over time and this complicates things when the data is pooled over a long time period.

I hope this is all helpful. Let us know if you have any additional questions.

1 Like

Thank you much for these very helpful responses. I want to follow up on your response to #2.

"(2) IPUMS USA actually provides replicate weights that can be useful for variance estimation of estimates calculated from public use sample data. See this page for more information. "

I reviewed the info on replicate weights. Using them appears to create overwhelming added work for me, using SPSS, and running and reporting numerous logistic and OLS regresssions. It appears I would need to run each regression model 80 times and then recompute the standard error for each variable to precisely report significance levels. I suppose I could focus only on SEâ€™s where the coefficient is barely significant. Do you know if anyone has created a macro or module for SPSS that would streamline this?

The info notes that using replicate weights usually raises SEs modestly, not usually making significant relationships nonsignificant. Based on your experience with replicate weights, what do you think of these alternative strategies to recomputing SEs and p values for all significant or nearly significant predictors in regressions?

1. Use the replicate weights only when reporting confidence intervals. In regression models where replicate weights were not used, report this and that SEs are likely deflated. Use a more rigorous p value to test significance (e.g. p < .025 or maybe p < .01). Report that p values are likely modestly deflated and this is why we are using a more rigorous p level to test significance.

2. Use replicate weights when reporting confidence intervals. In regresssion models use replicate weights only for predictors where p > .01 and p < .05. Report this strategy and note that for other variables the p values may be deflated and, for example, variables significant at p <.01 may only be significant at p < .05

Finally, Iâ€™m still thrown by how perwt weights each case so the n matches the population. This makes it so in regresssions even tiny relationships are significant at p < .001. I stil think I need to adjust perwt as I first described so the n matches the sample size. What am I not getting?

Unfortunately, I donâ€™t know of a macro or anything else that would streamline this prociedure in SPSS.

I think both of your alternative strategies are reasonable. If it were me, I think Iâ€™d lean toward a combination of both stratgies. Basically, do strategy number 2, but report the non-replicate weighted SEs and p-values - so that they are all consistant. For predicotrs where p >0.01 and p<0.05 provide some sort of note that the results robust or not to the use of replicate weights. (I hope that makes sense. If not, let me know or just ignore me.)

For each observation, PERWT indicates how many persons in the US population the observation represents. As mentioned above, if you are pooling two IPUMS samples youâ€™ll need to adjust the values of PERWT somehow. Otherwise youâ€™ll calculate a population size of roughly twice the actual size. I may not have completely answered your question, if so feel free to follow up.

Thank you, thank you. Your responses are most helpful

I think Iâ€™ve got you thoughts for dealing with replicate weights. This question is a followup on the last part of that post, which raises a concern about perwt that I still stuck on. Here is the exchange. My, hopefully, clarified questions follows:

CS: â€śFinally, Iâ€™m still thrown by how perwt weights each case so the n matches the population. This makes it so in regresssions even tiny relationships are significant at p < .001. I stil think I need to adjust perwt as I first described so the n matches the sample size. What am I not getting?â€ť

JB: â€śFor each observation, PERWT indicates how many persons in the US population the observation represents. As mentioned above, if you are pooling two IPUMS samples youâ€™ll need to adjust the values of PERWT somehow. Otherwise youâ€™ll calculate a population size of roughly twice the actual size. I may not have completely answered your question, if so feel free to follow up.â€ť

CS: Yes, I understand the need to approximately halve perwt since Iâ€™m using two 5-year ACS files combined. This is a separate issue, so letâ€™s assume Iâ€™m gonna do that as described in my orginal #1 and leave that aside. I think this other weighting problem stems from SPSS treating the weighted data as if the n weighted (population n) is the actual sample size when calculating standard errors. A simple illustration: if I had 100 cases, each weighted 10.0, I believe SPSS computes standard errors and significance levels as if the actual n is 1000 (100 x 10). Thus, since perwt, even ~halved, averages each case representing 10 people, I believe the effect is to greatly lower p values. So when Iâ€™m doing OLS regressions explaining occupation earnings score for people ages 25-64 of a small immigrant group with an n of 4500, it is treating it like 45,000 cases, and a number of variables that are not close to significant when I take the weights off, are highly significant with the weights on. So my thought is to save the weighting information in perwt, but reduce them to representing the sample size as the actual sample size, rather than representing the sample size as the population size. I can do this by mulitiplying perwt by sample n/population n. I may be wrong about this, but I donâ€™t see my error yet.

Many thanks!

Yes, I think you are right that SPSS is not treating the weight properly. How are you integrating the sample weight into the regression? Are you using a built-in SPSS feature or are you manually multiplying the data by the sample weight? If you are using the built-in feature, SPSS should be smart enough to correct for this potential issue. Or at least that is what Iâ€™d expect, Iâ€™m not personally a SPSS power-user.

Iâ€™m using the perwt variable. SPSSs interface has a global â€śweight casesâ€ť selection and a choice to select a weighting variable in the OLS regression interface. Iâ€™ve tried both with the same results, that appear distorted. Thank you.

Okay, well weighting data will influence your standard errors. So, Iâ€™m not sure if this is an error or not. If youâ€™d like, please email your code or syntax for this analysis to ipums@umn.edu. Iâ€™ll be able to better assist you with a more clear picture of what you are working with.

Thank you for the kind offer on the weighting question. Let me look at some more runs first to get a clearer idea on the source of the deflated SEs.