What don't I understand?

image
As the associated image documents, when I download one sample (2022) of PUMS 1-year household data, select just City of Chicago, and then look at SERIAL, which is supposed to be unique, 52.4% of household records have duplicate serial numbers.

I replied to your other forum post on the same question, but will reply again here for other users who may have similar questions.

By default, IPUMS USA data are person-level microdata, meaning each row or observation is a person. Each column is a variable that describes a person-level characteristic or household-level characteristic; household-level variables are automatically appended to person records.

In IPUMS USA, the two variables SAMPLE and SERIAL uniquely identify households. The three variables SAMPLE, SERIAL, and PERNUM uniquely identify persons. While SAMPLE and SERIAL uniquely identify households, if your data extract includes person records as well as household records (as you stated), you will still have multiple observations associated with many of the households; many households include multiple household members. If your data extract includes only household records, then you will have just one observation per household.

Thanks for your prompt reply. I’m quite used to joining person and household data, and since I was only reading the census bureau documentation for PUMS I had no clue that doing extractions using the household and person buttons separately somehow brought in additional unexpected records.

I find that if I A) take one year of my household data (chosen, as above, just using the household button) and select out all duplicated SERIALs, and then B) take the same year of person data (selected only using the person button) and select out all of the duplicated SERIALs, I get a one-to-one match.

However, I am clearly losing Rs from multiple R households; the resulting cases all have a PERNUM of 1. My question is, how can I purge my person file of “excess” person records, so there is one record per R. Then I can do a clean few-to-many join to match household data to person data.

I also tried to download data selecting via a mixture of the household and person buttons. I’m not sure what I have gotten, and I found that a key variable CITY (I just do Chicago) was trashed unless I radically reduced the number of variables selected. This led me to process (I thought) household and person variables separately. And thus my hope to continue downloading and matching separate subfiles.

Thanks again.

In IPUMS USA, the default data format is person-level records that also have household-level variables appended to them. Accordingly, there is no need to match or link household records to person records if you wish to have the information all on the person record. Any information available on the household record can also be added to the person records for each person in the household; you just need to add the household variables you want to your data extract, and then create a person-level extract (as opposed to a hierarchical extract, which contains both household and person records, or a household record only extract). Below is a description of each extract format which I hope will be helpful for your understanding of how household records, person records, household-level variables, and person-level variables work.

Person-level rectangular extracts contain only person records. Each row or observation represents a person. These extracts include all person and household variables that were added to the extract. You do not need to do anything extra to add household variables to person records in these extracts; just add the variables to the data cart. Here is what a person-level rectangular extract looks like:


As you can see, the household-level variables SERIAL, CITY, and STATEFIP are appended to the person records, along with the person-level variables.

Household only extracts contain only household records. Each row or observation represents a household. There are no person records in these extracts, and no person variables (since there are no persons to apply them to). Here is what a household records only extract looks like:


As you can see, there are only household records and no person records or person-level variables.

Hierarchical extracts contain household records and person records. The person records are organized under the household record of the household they are part of. In these extracts, household-level variables are not automatically appended to person records, and they appear on household records only. Here is what a hierarchical extract looks like:

As you can see, there is a household record with household variables on it, and person records underneath the household record of the household the person belongs to. The person records have only person-level variables attached to them, and do not have household variables attached.

I took a look at your IPUMS USA extracts and they are all person-level extracts. In this extract type, you will have person records only, with household-level variables included in the person records. You will not have any household records. You can still, however, identify households in these types of files using SERIAL and SAMPLE together. If your extract contains data from only one sample, you can identify households uniquely using SERIAL. This does not mean that there will only be one observation per household, since a household can include multiple person records. If you wish to retain only one person record per household, you can filter on PERNUM=1. Note that filtering on PERNUM=1 will only retain the head of household for each household, and therefore will impact analysis of person-level characteristics like race or sex.

The number of variables you include in your extract will not affect the values of other variables, like CITY. City is a household variable that, when added to an extract that includes person records, will appear on person records as well. Note that not all cities are identified in the public use microdata; you can read more about this in the comparability of CITY.