Discrepancy Between 1% Sample and Full Count Data



I had been working with the 1% samples. I recently downloaded the full count census data and I’m getting significantly different numbers. I’m trying to figure out where I’m going wrong.

For instance, the 1920 full count data from Oregon shows 9,995 cases where the individual was categorized as working in the logging industry (code 306 in the IND1950 variable).

On the other hand, the 1920 1% sample from Oregon with Person Weight (PERWT) applied shows 16,772 cases where the individual was categorized as working in the logging industry (again using IND1950).

This is a person-level analysis so PERWT is the correct weight to apply to the 1% sample in this case, right?

I understand that the full count data and 1% sample will not alighn perfectly. Still, I didn’t expect the numbers to be that different. Like I said, I’m just trying to figure out why the numbers are so off. Am I making a mistake working with the 1% sample or with the full count data?

Thanks for your help!



The main reason is that the processes for transcribing and coding these two files were very different. For the 1% sample files our historical team is able to give a fair amount of attention to each record, especially those that seemed incorrect. With the full count processing there is just too much data to comb through with such detail. Instead the historical team adapts many of our processing methods, the main adaptation being our partnership with Ancestry.com to produce the transcriptions of the original enumeration forms. This can cause a discrepancy such as what you are reporting here.
That being said, our historical data team is currently working on a new version of the 1940 full count file. One of the updates for this file is improved coding of the occupation and industry variables. So, while I can’t know for sure if this discrepancy will be addressed by this forthcoming update, it may help.