Undisclosed recent revision of IND1950? for 1910 (and 1900-1940) full count Censuses?


Have there been any recent (non-disclosed) revisions to IND1950 since June 2021 in the full census between 1900 and 1940? The ipums revisions page does not indicate so, but I feel like I am at my wit’s end here. (Edit: I do see a revision of the full counts from 1900 and 1930 that occurred in January 2022.)

I have attached to a screenshots tabulating IND1950 for Aberdeen in 1910. The first part of the image uses data downloaded in June 2021, the second part uses data downloaded today.

I can tell there isn’t a geography definition issue, as the total populations are identical between the two datasets. While I understand that some unassigned codes may be reclassified (advances in legibility/deciphering capabilities), 1) does this process reassign individuals already assigned to industries and 2) are these changes nondisclosed?

I think I’m just generally taken aback with how drastic these changes are without understanding why these changes occurred. E.g. the new version features a ~2.5X increase people working in “local public admin” and like a 33% drop in workers in the legal services industry. These just seem like shocking changes.

Is there any way to learn about the history of changes to each full counts set in greater detail than what is available on the /revisions site? Ideally, I would also like to know when the datasets were updated.

As you mention, there was an update to the 1900-1930 complete count files on January 18, 2022. Revisions to the full count historical files often include updates to string variables; in the case of the January 2022 release of 1900-1930 data, this specifically included, among other edits, updates to the coding of string combinations of OCC/IND/CLASSWKR and other data quality checks that further refined these codes. These edits targeted strings that either were not coded or were coded incorrectly. These changes would also be reflected in the harmonized IND1950 codes that you are seeing.

You might try examining how respondents were recoded between your two datasets. HISTID is a consistent individual-level identifier that should be included in your older extract and can be used to merge respondents across your datasets to compare individual level changes in industry coding.

Thank you so much for this response! In my older version of the full count census, I omitted HISTID to save space. Is it possible to download the older version again so I can make this comparison? I’ve had a hard time figuring out how to do so (if it’s possible).

We can work with you to provide an older version of the data if necessary, but it might be helpful to first take a more detailed look at these industry changes and see if we can provide information about the revisions that occurred and how they affected the data.

Could you describe broadly what changes you are seeing in industry coding between your samples? If you could also share a larger tabulation of IND1950, and IND itself if you have it in your older dataset, that would be particularly helpful. I’m also hoping that you could provide a bit more detail regarding your analysis so that I can understand the scope of these edits that are relevant to your research, including geographic, industry, and/or demographic criteria.

Hey Ivan,

Basically, in my analysis, I am studying changes in occupational composition within cities in response to a specific government policy. This analysis involves producing population and labor force shares of the population that are assigned to a certain value of OCC1950 or IND1950.

You can see attached four screenshots, entitled “occ1950_old”, “occ1950_new”, “ind1950_old”, “ind1950_new” some tabulations of the first 40 or so occupations/industries in the full count census for 1900. Let me know if you would like to see anything more specific.

I guess this isn’t quite your guys’ problem, but I was just pretty distressed, as since downloading the updated censuses, the results of my DD designs just deteriorated considerably. And while of course it’s entirely within my interest to have good-looking DD results, the types of changes that I’ve seen just seem odd, leading me to suspect that the type of recoding that occurred might not see perfect comparability across geography or over time (I.e. in combination with 1940 and 1950 censuses).

I’ve attached two DD graphs that use identically-specified regressions—one using the old data, and one using the new data.

Ex-ante, I would suspect that one should prefer the updated data, as presumably updated means “better”, but perhaps there would be a justification for me to use the pre-update data. Perhaps the updated recoding/processing was implemented differentially over geography or perhaps results in a poor comparison with analogous coding for full count censuses 1940 and 1950? Moreover, I just have a very hard time believing the fidelity of the recode just given the lack of change in the DD graph that uses the new data between 1930 and 1940. I’m also using an instrument (which I view as valid) in this specification, so I find the emergence of a pre-trend very striking.

I think just more broadly, I want to know a bit more about this recoding of industries and occupations that occurred.

The historical team shared an overview of the types of changes between the two versions of the data with specific attention to IND1950, and took a closer look at changes to the publishing industry (ind1950 code = 459 “Printing, publishing, and allied industries”) to examine how these played out for a specific industry; they chose this industry based on the figures that you shared. There were very few changes in 1900 and no changes in 1940, so the historical team concentrated their focus on 1910-1930. Changes include adding/removing cases (fewer than 0.5% of records), assigning codes that were previously missing to valid industries (this is the largest share at 6-9% of records), and re-assigning codes that were NOT previously missing to a different industry (up to 3% of records). On the whole, the changes seem appropriate and improve data quality.

For the publishing/printing industry specifically, the new counts appear to be closer to the estimated sample counts (the occupation coding in the sample data is generally assumed to be high quality as there are many fewer cases, so it is easier to assign edge cases correctly), there was no evidence of state-level clustering of changes, and the only demographic trend for industry was that more younger people were coded away from publishing/printing to a different industry (but the overall sample was pretty small). Note that because of the similarity in the occupations “PRINTER” and “PAINTER”, a common change was coding cases away from publishing/printing and into construction. It is certainly possible that some of these should have remained as publishing/printing.

Below are additional details about the changes between versions for each sample year:

1910: Added 149k cases to industry code 459 (132k were previously “0 = N/A or none reported”) and 21k cases assigned from code 459 to something else; approximately 5k were recoded to “construction” (the painter/printer issue described above).
1920 : Added 176k cases to industry code 459 (158k were previously 979 (missing) and 30k cases assigned from code 459 to something else).
1930 : Added 207k cases to industry code 459 (187k were previously 979 (missing) and 23k cases assigned from code 459 to something else).

Hi Ivan,

Thank you again for your help. I realize that since these corrections were not implemented for 1940 and 1950, these modified full count censuses aren’t longitudinally comparable with the corrected 1910-1930 full counts.

I.e. if correcting the OCC1950 categorization problem 1910-1930 counts resulted in important changes in the measured occupational composition of cities (particularly the differential change that is measured in my DD that I previously showed, albeit not having disclosed what the two DD groups are), then unless there is good reason to believe that similar problems don’t exist for 1940 and 1950, then the specific subject of my analysis shouldn’t use 1940 and 1950 full counts alongside the corrected 1900-1930 full counts. Whatever was “contaminating” the analysis years 1900-1930 prior to correction in January 2022 presumably is still contaminating 1940 and 1950 full counts, thus resulting in poor comparability of 1900-1930 with 1940 + 1950, at least for the purposes of my analysis.

This leaves me with three possibilities:

  1. Wait for similar adjustments to be made for 1940 and 1950
  1. Proceed with using the old version of the full count censuses, explaining that using the old releases 1900-1930 in tandem with 1940 and 1950 preserves the longitudinal comparability of the occupational composition of cities
  1. Justify why the current releases of the 1940 and 1950 full counts shouldn’t see similar problems as the previously uncorrected 1900-1930 full counts and proceed with the new releases

Does this line of thinking makes sense?

I received some more information that I can share with you. Overall, I suggest that you prioritize using the most recent and best versions of the data that are available. The version of the data prior to the January 2022 update were preliminary data and the now finalized datasets are improved versions of this preliminary data.

Your emphasis on comparability makes sense. However, you should note that all of the years are processed separately based on data availability and data transcription of the variables: 1900 is processed differently than 1910-1930, 1940, and 1950 (and vice versa for each dataset). Errors would only be duplicated within each of these samples, though each dataset is processed individually by HCP and enumeration instructions differ between censuses. Based on this, there is no reason 1940 has a similar issue as 1910-1930 unless evidence is produced to show that there are systematic coding errors. While the codes are generally comparable, researchers should consider the historical context and census enumeration procedures on whether comparing a particular industry over time is sound. Researchers should be cautious using 1950 because that is still a preliminary dataset and we are still improving occupation coding. Adjustments will still be made for 1950 since that is a preliminary dataset, but 1940 (and any version of 1940 data since April 2021) is based on the final version of 1940 so there are no more data adjustments being made to occupation except when errors are found.