I have a same question that was asked previously (sample size discrepancy between IPUMS CPS and Raw CPS :
I am comparing the IPUMS version with the raw CPS files in the Census FTP. It seems that the IPUMS has about 12000 households more than the raw CPS files for each month. I think this issue has not been resolved or documented on the IPUMS website yet. (Or am I just missing it?). What is the source of discrepancy and how to deal with this?
Thanks for following up on this question. We do have a bit more information to share. None of this allows for much clarity, but it does provide some additional explanation for what might be going on here. In short we suspect this is due to inconsistent inclusion of a SCHIP (State Children’s Health Insurance Program) oversample in this year, but documentation is both scant and contradictory.
A few details to note. First, from September 2000-August 2001 the basic monthly samples contain (or should contain) an over sample of about 12,000 households for SCHIP purposes. (see technical paper 63, revised, appendix J). This sample size increase was (according to the aforementioned technical paper), phased in by November of 2000. Second, it seems that not all of the data files between November 2000 and August 2001 actually contain this over sample. In fact, the only samples that appear to have those extra SCHIP households are April 2001, June 2001, July 2001, and August 2001. Finally, the CPS codebooks reflect the discrepancies in record counts that we see in IPUMS CPS data. Case counts for the original input data for April, May, and June 2001 match those listed in the sample-specific codebooks on page 3 under the heading “Technical Description.” This makes us think that, even though the technical paper suggests that all samples from November 2000-August 2001 should contain extra records, not all of them actually do. However, we haven’t been able to find anything in these codebooks to suggest a reason for these differing case counts among samples.
So, in conclusion the CPS documentation contradicts itself. The technical paper says samples between November 2000 and August 2001 should have over samples, but not all of them do. Further, sample-specific documentation reflects the case counts that appear in the actual files.