Different sample size across versions of 1940 census

My organization has a version of the IPUMS Complete Count 1940 census data stored in a repository. It claims to include all individuals from the 1940 US census. The total number of records in the file is 132,404,766. The citation information provided is:
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 6.0 [dataset]. Minneapolis: University of Minnesota, 2015.

However, when I access the 1940 full count census data on IPUMS.org, I’m seeing that the total number of records is 131,903,910. The citation info currently on the ipums website is: **Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. IPUMS USA: Version 9.0 [dataset]. Minneapolis, MN: IPUMS, 2019. D010.V9.0 | IPUMS

So, it looks like there is a discrepancy of about 500,000 people. I also noticed in the citation info that the versions are different. I’m a bit confused because it seems like the total N for the 1940 census should be fixed. Is there any reason to expect differing counts across versions of the data?

This is a good question. Typically, revisions to the data are listed on the IPUMS USA Revisions Page. In this case, two notes seem to identify the source of the differing number of records. First on January 29, 2019 a new version of the 1940 file was released which excluded Alaska and Hawaii. Second, on February 8, 2016 an updated version of the 1940 full count file removed a number of duplicate person records from the file.

Hello!

It appears that the version of the 1940 Full Count Dataset (from IPUMS USA | DEMONSTRATION DATA FOR U.S. CENSUS BUREAU DISCLOSURE AVOIDANCE SYSTEM) is out of date. The version there still has a total Persons count of 132,404,766 (as opposed to the 131,903,910 if you download a extract). The 131,903,910 matches the total of the case-count view for the full 1940 (see IPUMS USA: descr: CITIZEN, for example).

Also, many of the variable counts are different. For example, in EXT1940USCB.dat, there are no Persons with Citizen == 5 (Foreign born, citizenship status not reported), where the case count view shows there should be 616,530.

Are these discrepancies due to the different versions, or am I missing something?

If the differences are due to different versions, would it be possible to get an updated version of the EXT1940USCB.dat to use?

Thank you!
Micah

The 1940 Full Count Dataset available on the 1940 Demonstration Data For US Census Bureau Disclosure Avoidance System is the dataset the Census Bureau used to help design their proposed disclosure avoidance system (based on differential privacy). We’ve posted the dataset on the website so that others may run the disclosure avoidance code and see how it impacts the accuracy of output data.

This dataset was provided to the Census Bureau in 2017 or 2018 and will not match the data currently available through the IPUMS data access system. The data currently available in the access system will be the most up-to-date version (released in January 2019 - https://usa.ipums.org/usa-action/revisions#revision_01_29_2019).

I strongly recommend you access the data through the IPUMS data access system to make sure you’re getting the most updated version. The 1940 Full Count Dataset you described was posted for the express use case of testing the Census Bureau’s disclosure avoidance system. At this time, we have no plan to provided an updated version.

That makes sense, thank you for the information!