Some questions about households vs individuals and correctly measuring household size in the IPUMS-CPS


I previously asked whether the first person listed for a household was always the householder, and one of your staff replied that the first person was not necessarily the householder. Since the household weights are only correct if you take only a single row for each household, I have been trying to figure out how to do so in such a way as to represent every individual in the sample while correctly measuring the household size. I want to adjust household income for household size for purposes of adult-equivalency weighting, and also to construct equal-population deciles from household data.

If I take RELATE = 1, I get one row from each household with a household head. Household size is equal to NUMPREC from 1968 on

If I take GQ = 2, I get 1 row for each person selected from a group quarters. Household size is always 1 for weighting purposes.

If I take FRAGMENT = 1, I get a line for each household fragment. Household size is always 1 for weighting purposes.

Will this combination (RELATE=1 OR GQ=2 OR FRAGMENT=1) pick up one row and only one row from every household type? Combined with family size measurement as described above, does it leave any people unaccounted-for in the years after 1967?

For 1962-1967, your Sample Notes documentation states that children under age 14 are not included in the sample. I take it this means that there is no person record, and hence no line, for such children. In this case, NUMPREC does not correctly count the number of people in the household. Is there any field in a household record or in the records of any member(s) of the household that would allow me to correctly measure household size for the 1962-1967 period?

Sincerely, Andrew Hoerner



Depending on your goal, I see no reason why your filtering combination of RELATE, GQ, and FRAGMENT wouldn’t work (however, do notice that the code for Head/Householder for RELATE is 0101 and not just 1). This method does make some strong assumptions about Group Quarters and Fragments. Firstly, even though, as the GQ description sates, persons living in group quarters are generally sampled individually and can not be linked, many samples do include multiple persons from single GQ units, including possible families. Multiple persons from a single GQ unit will share a SERIAL and HWTSUPP. Also, some individuals in GQs actually have values of FTOTVAL higher than INCTOT, implying that they do not simply represent a family of one. How you deal with these issues is ultimately your choice.
FRAGMENT is dangerous because it represents person records that were essentially “orphaned” in the original data, meaning that they were not listed with their appropriate household. Since there are so few cases we generally advise people to drop these cases, but again this is up to you. As to your final question about identifying family size in the 1962-1967 samples, I am afraid I do not now of any dependable way to impute family size from the available variables. These samples were much more interested in individual workers than families. I hope this helps.



I have been doing some preliminary analysis of the IPUMS-CPS data set, trying to get my accounting correct for the way that person records relate to household records, to fragments, and to group quarters. I have some hypotheses, to the plausibility of which I invite your assessment, and have also found some puzzling results, on which I hope you can shed some light.

Let us call a residence the place where anyone resides, whether or not they are a member of a household. For the rest of this note I will be ignoring weights on both households and individuals, because I assume that before I can get the population accounting straight, I need to get the observation accounting straight.

Until 1976, all residences with SERIAL numbers also had a person with PERNUM=1.

After 1976, the number of residences with a serial number but without anyone with PERNUM=1 increased suddenly and dramatically, amounting in every year to roughly a fifth of the number of households identified with SERIAL numbers.

Also prior to 1976, the number of residences with SERIAL numbers but no householder was roughly equal to, but always slightly greater than, the number of household fragments, and varied closely with the latter. I don’t know who constitutes the non-FRAGMNT residences with no householder in this period, but the number is quite small – less than 40 in every year but 1966. I was thinking that this suggests that most of the fragment households are remains of complete households, the rest of which have been lost, but this assumed that the SERIAL numbers came from the original data. Given that IPUMS assigned the SERIAL numbers, do you know why there are some residences that are not fragments but also do not have householders?

(I love you folks at IPUMS and greatly appreciate what you do, but I wish you would retain things like the original household serial number in your data set. I’d rather not have to retain two separate sets of data and documentation to address these questions, especially since there does not seem to be any unique key adequate for a reliable merger. And some kinds of research, such as investigating alternative ways of linking records across periods, is pretty much impossible without the stripped data.)

From 1977 on, the number of residences with a SERIAL number but no householder (i.e. no person identified with RELATE=0101) rose by an amount similar to but somewhat larger than the number of residences with a SERIAL number but no one with a PERNUM=1. These numbers remained close but distinct until 1994, when they became identical, and remained so though the present.

Also in 1977, the number of reported fragments dropped to zero.

I counted the SERIAL numbers by taking the maximum for each year. It occurred to me that, if there are unallocated SERIAL numbers, i.e. gaps in the SERIAL number sequence, that could account for the difference with the number of residences with no PERNUM=1 or RELATE=101. I tested this in two ways.

First, for each year, I summed the value of NUMPREC for every line with RELATE=0101. This should give me the number of persons in households. Then I just counted the number of person records in each year. From 1971 through 1993, the sum of NUMPREC for householders had a difference that was exactly equal to number of people in group quarters. So for these years, every person in the sample is either a member of a household with a householder or a resident in group quarters. For 1969 and 1967, the difference between the count of person records and records of people in households is, respectively, 116 less and 2 more than the number of residents in group quarters. Prior to 1967 residents in group quarters are not reported separately, though the difference between persons in households and total person records remains roughly the same. To me this suggests that in those earlier years the sample included people in group quarters, and they are simply not identified. Does this seem plausible to you?

After 1993 the number of people in households is exactly equal to the number of person records. This implies that in 1994 and thereafter, but not before, people in group quarters are each treated as being head of their own household. For consistency, I would recommend recoding the persons residing in group quarters to be household heads in the earlier years as well.

The fact that the number of person records is exactly equal to the sum of NUMPREC for heads of household (RELATE=101) for every year after year after 1993, and differed by only the value of group quarters residents prior to that, implies that the large increase in the difference between the highest SERIAL number and the number of residences with a PERNUM = 1 that begins in 1977 is not due to an increase in the number of residences without a householder. Instead it must be due to gaps in consecutive serial numbers. This conclusion is further reinforced by the noticing that the maximum SERIAL number increased by 12,806 more than the number of households with PERNUM =1 between 1976 and 1977 – an amount nearly the same as the 12,750 difference between the two that appeared in 1977, and that was maintained (in percentage terms) thereafter. Does this seem correct to you?

The preceding three paragraphs were written before I looked at the documentation and found that the SERIAL numbers were assigned by IPUMS rather than the Census. Now I am much more puzzled. I could see reasons why it might be convenient for the Census to distribute blocks of SERIAL numbers to different geographic areas and so develop gaps in numbering. I do not understand how the sudden increase in the ratio of SERIAL numbers to householders in 1977 could arise from any reasonable method of assigning such numbers by IPUMS. Could you shed any light on this?

Finally, I have two questions with respect to person records with FRAGMNT=2. First, have you established that these records are not duplicates or near-duplicates of any other person records from the years in question? And second, have you looked at whether the inclusion of these records makes the CPS population composition more or less similar to the Census composition?

Sincerely, Andrew Hoerner



I am not sure this answers all of your questions or addresses all of your concerns but there are two characteristics of the pre-1977 March samples that may shed some light on your analysis.

Firstly, prior to 1977 the March samples do not include non-interview households. This means that starting in 1977, the jump in households without a PERNUM==1 is representative of the inclusion of non-response households (which don’t have any person records). This may also answer the gaps you see in SERIAL for rectangular extracts, because the SERIAL numbers that represent non-interview households would not be included in the extract (again, because they don’t have any persons records to rectangularize on).

Secondly, Fragments are mostly artifacts of old survey and allocation practices that were improved upon in later years. It is generally believed that fragments should be group quarters but were just not properly coded. This assumption is reinforced by the fact that all other “headless” households in the pre-1977 samples are group quarters (except for one strange household in 1968, composed of two teenagers with RELATE codes of “Child”).

If these pieces of information do not answer any of your questions, or if they raise new ones, feel free to continue this correspondence.



Dear Joe:

I see! That is indeed helpful. Are non-response households identified by any particular method, other than PERNUM being missing? Is NUMPREC=0 for these records? Are household level variables like HWTSUPP and HHINCOME empty, zero, hot-deck imputed, or what?

More generally, is there anything in the IPUMS-CPS data that tells whether a particular variable for a particular person or household is measured or imputed?

Thanks again for your help!




NUMPREC will always be zero for non-interview households. Starting in 1977 a variable called HHINTYPE is available which identifies non-interview households and includes a general explaination as to why the household was not interviewed.

As far as imputation markers go, in IPUMS-CPS these are known as data quality flags or just “Flags” for short. You can access these flags by clicking on the “Flags” tab on a variables description page, but to add them to a data set you must use the “Select data quality flags” button on the Extract Request page. If a variable has a flag that means some respondents have imputed values for that variable. If there is no flag that means none of the respondents’ values were imputed. Be careful though, constructed variables (like HHINCOME and INCTOT) do not have data quality flags but the varaibles that were used to construct them do have data quality flags (e.g. INCWAGE, INCSS…).