Identifying individual households - sample size discrepancy in 2000s?


I have been trying to count all the households in my IPUMS CPS extract, but for some reason I haven’t been able to get the number to match the total number of households listed on the website here.

Following the survey documentation, I have been identifying households by serial * year. However, when I count the number of households by this variable combination (bysort serial year: egen tag = _n == 1), I get approximately 20,000 fewer households in the 2000s than what is listed on the website. I get the same (lower) number when I simply count pernum == 1.

As an example, in the 2013 extract I get a count of 74,821 households using both of the methods listed above, while the number listed at the link above for 2013 is 98,095. Any suggestions about what I am missing here?




The difference between the number of household you are seeing in your rectangular data extract and the total number of households are the “non-interview” or vacant households. Because the default, rectangular extract structure places all household information on the person record, households with no associated person records (empty households) are effectively dropped from the extract. In the 2013 March IPUMS-CPS file there are 23,274 empty households. Since these households are empty, there is not much information about them in the data. However, if you wish to download an extract with all households you can choose to create an hierarchical extract from the extract request menu.

I hope this helps.