I have been doing some preliminary analysis of the IPUMS-CPS data set, trying to get my accounting correct for the way that person records relate to household records, to fragments, and to group quarters. I have some hypotheses, to the plausibility of which I invite your assessment, and have also found some puzzling results, on which I hope you can shed some light.
Let us call a residence the place where anyone resides, whether or not they are a member of a household. For the rest of this note I will be ignoring weights on both households and individuals, because I assume that before I can get the population accounting straight, I need to get the observation accounting straight.
Until 1976, all residences with SERIAL numbers also had a person with PERNUM=1.
After 1976, the number of residences with a serial number but without anyone with PERNUM=1 increased suddenly and dramatically, amounting in every year to roughly a fifth of the number of households identified with SERIAL numbers.
Also prior to 1976, the number of residences with SERIAL numbers but no householder was roughly equal to, but always slightly greater than, the number of household fragments, and varied closely with the latter. I don’t know who constitutes the non-FRAGMNT residences with no householder in this period, but the number is quite small – less than 40 in every year but 1966. I was thinking that this suggests that most of the fragment households are remains of complete households, the rest of which have been lost, but this assumed that the SERIAL numbers came from the original data. Given that IPUMS assigned the SERIAL numbers, do you know why there are some residences that are not fragments but also do not have householders?
(I love you folks at IPUMS and greatly appreciate what you do, but I wish you would retain things like the original household serial number in your data set. I’d rather not have to retain two separate sets of data and documentation to address these questions, especially since there does not seem to be any unique key adequate for a reliable merger. And some kinds of research, such as investigating alternative ways of linking records across periods, is pretty much impossible without the stripped data.)
From 1977 on, the number of residences with a SERIAL number but no householder (i.e. no person identified with RELATE=0101) rose by an amount similar to but somewhat larger than the number of residences with a SERIAL number but no one with a PERNUM=1. These numbers remained close but distinct until 1994, when they became identical, and remained so though the present.
Also in 1977, the number of reported fragments dropped to zero.
I counted the SERIAL numbers by taking the maximum for each year. It occurred to me that, if there are unallocated SERIAL numbers, i.e. gaps in the SERIAL number sequence, that could account for the difference with the number of residences with no PERNUM=1 or RELATE=101. I tested this in two ways.
First, for each year, I summed the value of NUMPREC for every line with RELATE=0101. This should give me the number of persons in households. Then I just counted the number of person records in each year. From 1971 through 1993, the sum of NUMPREC for householders had a difference that was exactly equal to number of people in group quarters. So for these years, every person in the sample is either a member of a household with a householder or a resident in group quarters. For 1969 and 1967, the difference between the count of person records and records of people in households is, respectively, 116 less and 2 more than the number of residents in group quarters. Prior to 1967 residents in group quarters are not reported separately, though the difference between persons in households and total person records remains roughly the same. To me this suggests that in those earlier years the sample included people in group quarters, and they are simply not identified. Does this seem plausible to you?
After 1993 the number of people in households is exactly equal to the number of person records. This implies that in 1994 and thereafter, but not before, people in group quarters are each treated as being head of their own household. For consistency, I would recommend recoding the persons residing in group quarters to be household heads in the earlier years as well.
The fact that the number of person records is exactly equal to the sum of NUMPREC for heads of household (RELATE=101) for every year after year after 1993, and differed by only the value of group quarters residents prior to that, implies that the large increase in the difference between the highest SERIAL number and the number of residences with a PERNUM = 1 that begins in 1977 is not due to an increase in the number of residences without a householder. Instead it must be due to gaps in consecutive serial numbers. This conclusion is further reinforced by the noticing that the maximum SERIAL number increased by 12,806 more than the number of households with PERNUM =1 between 1976 and 1977 – an amount nearly the same as the 12,750 difference between the two that appeared in 1977, and that was maintained (in percentage terms) thereafter. Does this seem correct to you?
The preceding three paragraphs were written before I looked at the documentation and found that the SERIAL numbers were assigned by IPUMS rather than the Census. Now I am much more puzzled. I could see reasons why it might be convenient for the Census to distribute blocks of SERIAL numbers to different geographic areas and so develop gaps in numbering. I do not understand how the sudden increase in the ratio of SERIAL numbers to householders in 1977 could arise from any reasonable method of assigning such numbers by IPUMS. Could you shed any light on this?
Finally, I have two questions with respect to person records with FRAGMNT=2. First, have you established that these records are not duplicates or near-duplicates of any other person records from the years in question? And second, have you looked at whether the inclusion of these records makes the CPS population composition more or less similar to the Census composition?
Sincerely, Andrew Hoerner