IPUMS USA makes available the 1940 Census 100% sample, where each row in the table represents a person. Just to be specific, when I aggregate the person-level records for the state of Georgia, I get 3,128,132 persons. However, various published summaries of the 1940 Census put the population total is 3,123,723 – see for example the Vital Statistics publication for 1940 on page 23 (warning: 28MB file) (also can be obtained through IPUMS NHGIS). A pretty small discrepancy – something I am not worried about but I am curious nonetheless.
At the county level, I get some discrepancies when breaking down populations by race (White vs. All Other). For Appling County, GA, the aggregated total is 14,511 vs the published value of 14,497. Among Whites, the aggregated count is 12,114 vs a published value of 11,856, and for non-Whites, the aggregated count is 2,397 vs a published value of 2,641 (unless I made a mistake).
I was just trying to understand why aggregates based on the microdata are slightly different from the published aggregates in various publications. Of course, 1940 was a long time ago and maybe we don’t know the exact filters or database that the Census used, which may be the answer to my question. Apologies if I have missed something obvious or an explanation in the documentation, and thank you for making the data widely available.
You’ve guessed the correct answer. The original summaries (available through NHGIS as you mentioned) were tabulated immediately following the enumeration, whereas the IPUMS USA microdata files are based on transcriptions of the microfilmed original enumeration forms decades later. While it may be safest to assume the original tabulations are the “correct” figures, and these discrepancies represent errors in the IPUMS USA microdata, they may also be due to differences in management of missing/unclear information. There were certainly errors introduced as part of both the microfilming and transcription processes (such as duplicated records, duplicated images, transcription errors, etc.), but IPUMS does attempt to identify these, often using the original summary statistics as a guide. However, we do not expect to exactly match these published figures since we can’t exactly match the original tabulation procedures. I don’t think we have an explicit note about this, and that is a great suggestion. I will be sure to pass this along to the IPUMS USA Team.
Thank you so much for the detailed answer, your response helps me understand this much better. I would happily read any documentation or notes that the IPUMS USA team adds about this in the future, or if you have any other suggested readings about the process.