Merging data across samples

A colleague has combined three 1-year ACS samples (CY 2009-2011) and has generated information that I would like to merge into my 3-year 2011 ACS dataset. However, the unique identifiers (SERIAL and PERNUM ) in the datasets differ. Does anyone know of a method to do such a merge? (The value of ADJUST can be used to determine YEAR in the 3-yr sample.)


Joyce Morton

The variable SERIAL is generated by IPUMS to uniquely identify households within a sample. Because the multi-year files are considered unique samples (even though they contain much of the same information as the individual files) the SERIAL variable is generated as though the multi-year file is completely unrelated to the single-year files. Furthermore, the same SERIAL value may exist in multiple single-year files, so using the single-year SERIAL codes may result in non-unique values.

There is no recommended way to combine the single-year files to the multi-year file, but you could attempt to use a combination of variables (such as MULTYEAR*, STATEFIP, PUMA, NUMPREC, PERNUM**, OCC, AGE, SEX, RELATE… but avoiding the weight variables as they will be different between the single-year and multi-year files, as will income values since they are automatically converted to represent dollar values of the most recent year in the sample) to uniquely identify each individual in the 3-year file. You could even create a new variable out of the concatenated values of the identifying variables. In STATA the code would look like this:

egen uniqueid = concat(multyear statefip puma gq numprec pernum occ ind age sex related)

Then you would want to make sure that each value of uniqueid only identifies one person in the file. Though this sounds simple it is actually almost impossible (using the sample code above on the 2011 3-year sample uniquely identifies 98.84% of people), since a lot of work has been put into making sure you cannot identify people in the PUMS files. You would then create the complimentary variable (using YEAR instead of MULTYEAR) in the single-year files (making sure the order of the variables is the same) and merge on uniqueid. Because this process does not use any variables that where created with the intention of uniquely identifying individuals I cannot promise that the match will be perfect. You could drop cases that cannot be uniquely identified, but this could introduce error (for the example code above, this would drop 106,729 people).

*The variable MULTYEAR is used in multi-year samples to identify the individual years if you don’t want to use ADJUST, but ADJUST obviously works as well.

**Even though SERIAL is different between the single-year and multi-year files, the persons within each household should still be in the same order.

I hope this helps.