I’m using IPUMS microdata from the 1960 census, and I’m clearly overestimating the number of households. I ran this R code:
data |>
# restrict to one record per household
distinct(SAMPLE, SERIAL, .keep_all = TRUE) |>
# add up weights
summarise(sum(HHWT))
The result is 115,434,597 – more than twice as many households as in the US in 1960. What am I doing wrong?
HHWT is constructed independently for each IPUMS USA sample. For 1960, IPUMS offers two samples for analysis: a 5% density sample and a 1% density sample. This is the case for many other decennial census years. The variable SAMPLE = 196001 identifies records in the 1960 1% and SAMPLE = 196002 identifies those in the 1960 5%.
In a sample that includes only household records (or only a single person record per household), the sum of HHWT across all records should equal to the number of households in the US population in that year. That means that the sum of HHWT in the 1960 5% and the 1960 1% samples will each approximate the total number of households in 1960. More detail on each sample can be found on our samples page and in the User Guide’s section on sample designs.
Using the code you provide, I obtained an estimate of 57,699,837 households using the 1960 1% sample. The 1960 5% sample meanwhile provides an estimate of 57,734,760 households. The sum of these two match the result that you’re reporting. I hope this clarifies what’s happening with the data.