I’m comparing estimates of total population by county (ie statefip & countyfips) in the decennial IPUMS from 1940 to 2000 with the exact values from the full counts by county, which I obtained from NHGIS. The two line up well in all years except in 1980, and I cannot figure out why.
My best guess is that for some counties, a subset of individuals in IPUMS have a countyfips value different from 0, while the rest has it set to zero.
For example, Allegheny County, PA (which includes Pittsburgh) with FIPS code 42003 has a population of 1,450,085 in 1980 according to NHGIS. However, in the 1980 5% IPUMS sample, this FIPS code only has 9,017 individuals (which represent only 9,017 x 20 = 180,340 of the county population).
I can get to a value that is closer to 1.4m by using the county group variable, but that then gets me further away from the actual full counts in other counties. Hence, I haven’t figured out a systematic way of dealing with this issue.
Your hunch about the presence of zeros (i.e. non-identifiable cases) is on the right track. Since counties are not identifiable in public use microdata and are only identified if they are coterminous with other lower level geographic identifiers, identification is limited by errors of omission. A county is identifiable only for residents of areas that lie entirely within a single lower level geographical identifier. This prevents errors where non-residents of a county are identified as residents, but puts no limit on errors where residents of a county are not identified as residents. This explains the under-counting of the population in IPUMS USA. This is explained in the COUNTY variable description and will be added to the COUNTYFIPS variable description soon.
I don’t know with certainty. I suspect it has something to do with how the various boundaries of geographic variables changed in 1980, compared to other years. That is, these changes yielded more errors of omission.
A follow-up: The problem with 1980 county population totals was not in fact due to known limitations in our ability to identify whole counties, nor was it due to an explicit plan to identify only parts of counties. After investigating the problem further, we discovered that our handling of 1980 county codes *was* faulty.
It’s our intent that if a county is identified, then ALL county residents–and ONLY county residents–are coded as residents.
This week we released corrections to our county codes. There were at least a few corrections in most samples from 1970 through 2011, and many corrections in the 1980 samples.
For details, including a spreadsheet identifying all corrections, see this entry in the IPUMS USA Revisions page.
Thanks much for bringing this issue to our attention!