I’m working to estimate household counts using PUMS data and running into numbers 2.5x larger than those from the ACS for the same period. Is this an issue with my method for counting (summing HHWT)? Here is the code used to generate the counts. Using PUMS data I’m getting ~12 million households in IL where the ACS estimate for the same period is ~5 million with a 10,000 MoE.
library(ipumsr)
library(dplyr)
ddi <- ""
ipums_path <- ""
ipums_data <- read_ipums_micro(ddi = ddi,
data_file = ipums_path)
IL_hh_pums <- ipums_data %>%
filter(GQ == 1, STATEFIP == 17) %>%
group_by(STATEFIP) %>%
summarize(households = sum(HHWT))
IL_hh_acs5 <- tidycensus::get_acs(
survey = "acs5",
year = 2019,
geography = "state",
state = "IL",
variables = c("households" = "B11012_001"),
output = "wide"
)
IL_hh_pums # 11,899,895 households
IL_hh_acs5 # 4,846,134 households, 10,459 moe
When calculating household totals from the PUMS data available from IPUMS USA, there are two key considerations. First, you should restrict to only one person per household (e.g., PERNUM == 1) to avoid counting a household more than once. Second, while it looks like your code addresses group quarters, it seems you are only counting GQ values of 1; note that values of 2 and 5 should probably be included as well.
My data doesn’t include PERNUM, since it only contains household-level variables in the extract. I was working through this with Ivan in a previous thread (see here). Adding more GQ values into my filter would increase the household count, so I’m still unsure why I’m getting such large counts in my data. Is there another variable in household-level data that is equivalent to PERNUM?
I believe I was able to figure this out. There is no PERNUM variable in household-level data extracts, but by ensuring there were no duplicate SERIAL values I was able to get a count within the ACS5 estimate ± the margin of error. See below:
library(ipumsr)
library(dplyr)
ddi <- ""
ipums_path <- ""
ipums_data <- read_ipums_micro(ddi = ddi,
data_file = ipums_path)
IL_hh_pums <- ipums_data %>%
filter(GQ == 1, STATEFIP == 17) %>%
distinct(SERIAL, .keep_all = T) %>%
group_by(STATEFIP) %>%
summarize(households = sum(HHWT))
IL_hh_acs5 <- tidycensus::get_acs(
survey = "acs5",
year = 2019,
geography = "state",
state = "IL",
variables = c("households" = "B11012_001"),
output = "wide"
)
IL_hh_pums # 4,844,000
IL_hh_acs5 # 4,846,134; moe 10,459
1 Like
Hi Ethan, I am running into the same issue, but still getting larger household counts than reported on somewhere like QuickFacts. I am also seeing large shifts in household counts year to year. I did not use the tidycensus package, instead downloading directly from IPUMS. I accounted for duplicates through SERIAL, but that does not solve the overestimates. Has anyone else had this issue in the past?
I didn’t download the IPUMS data using tidycensus. I masked my file path for my IPUMS data and codebook. The tidycensus call is pulling the ACS data I compared my counts with. Sorry if there was any confusion there.
When you pulled the ACS data through tidycensus did it automatically calculate the moe?
Yes, you can copy/paste the tidycensus call into R and get the same output. The Census API provides MoEs with all ACS estimates.