Swap files and cell means 1976 to 2000, 2000 to 2010, and years later than 2010

  1. You state here: https://cps.ipums.org/cps/income_cell_means.shtml that Larrimore et al. provide consistent cell means for 1976 to 2007. Immediately following this, you provide files containing IPUMS-consistent cell means for 1976 through 2000. Are these the Larrimore et al values, or some other set of values? If the former, is there some reason you do not provide the 2001-2007 values from Larrimore? Or do you provide them elsewhere?

  2. On your page Income Components: Topcodes, Replacement Values, and Swap Values you provide cell mean replacement values for the years 1996 through 2010. Am I correct that these values, and not the top codes, are what is in the IPUMS micro data downloads now? Are these values the same as or different from the Larimore et al. cell means through 2007? If different, where do they come from? In the various tables presented on this page, are the numbers provided as top codes the original CPS top codes before they were replaced by cell means? Or are they, as they appear to me, the cutoff values above which incomes were top-coded and subsequently replaced by cell means?

I have a strong request concerning this page. The term “top code” is used sometimes to refer to an uninformative code (like 99997) indicating that the observed value was above the top-coding threshold, and sometimes to refer instead to the threshold, and sometimes it is not clear which. So for instance, in the years 1999-2002 in the first table, are the values of 15,000, 20,000, and 25,000 actually top codes? Or are they top code thresholds? And in the various tables like “1996 Income Topcodes” for different years, the values that are identified as topcodes look to me a lot more like top coding thresholds. I’d like somebody to review the language on this page and make sure things are properly identified as top code or top code thresholds.

Were the thresholds originally used as the topcodes in these years, before being replaced by cell means and then subsequently by swap values?

  1. Under the heading Income Component Rank Proximity Swap Values , here IPUMS CPS (at the bottom of the page, below the section on cell means) you provide a file of income replacement values for 1976-2010. The link there to the original Census files is broken, and I did a lot of searching of the Census website looking for either the originals or updated versions of these files and I found no reference to them. If I were to download CPS data from the Census now, would I see top codes, cell means, or swap codes in these years?

  2. I am confused about how the values in this swap values file relate to top codes. I thought that the swap values were supposed to completely replace top codes. However, the file contains 1044 values of 99997, i.e. in a bit less than 1 percent of the individual records are so coded. These certainly look like top codes. To me, this seems to imply the old cell means for values above the top coding threshold have only been replaced by swap values up to a certain threshold value, higher than before, above which they are still top-coded, and moreover, not replaced by cell means. Do you know if this is correct? If so, do we know what the new thresholds are? Do the IPUMS-CPS records for 2011 and following years also contain top codes in addition to swap values? And if so, do we have the thresholds for these top codes?

I thought the CPS had gotten past top-coding. <Grrrr!>

I’ll address each question one at a time.

(1) As noted on the income component cell means replacement values page, Larrimore et al. (2008) uses the same technique as the Census Bureau implemented for replacement values starting in 1996. These replacement values, available from 1996 through 2010, are available on this page. Additionally, at the bottom of this page we provide income component rank proximity swap values for 1976 through 2010.

(2) Yes, for samples from 1996 through 2010 the replacement values noted in the tables on this page are what is found in the data available on IPUMS CPS. Additionally, the cell means values from Larrimore et al. and top-coded values already in Census public use and IPUMS data will be nearly identical from 1996 onward, except for 2000, where the Census has acknowledged some data error, as noted in Larrimore et al. (2008). For additional context: The 1996 through 2010 public use files included the replacement values for observations above the top-code. Larrimore et al. extended this method backward for samples from 1976 through 2010. In 2011, the Census Bureau shifted from the average replacement value system to a rank proximity swapping procedure.

(Regarding any confusion on the term “top-codes”) The term “top-code” generally refers to a top-coded threshold. This means that any income value above the top-code threshold will be “top-coded” and (for samples between 1996 through 2010) replaced with the average value of the top-coded values. There are times when the observation is either not in the universe for the particular income question or when the respondent did not actually respond to the question, that there are special codes the note NIU or item non-response. In any event, I’ll look through this page and see if anything can be more clearly stated.

(3) Thanks for the note about the broken link, I’ll look into fixing this. If you were to download CPS data from the Census Bureau website, what you’d find would depend on the year of the sample. From 1962 through 1995, values exceeding the top-code are simply recoded with the threshold value. For example, all responses for INCWAGE greater than or equal to 50,000 in the 1976 CPS ASEC Survey were replaced with 50,000 in the public use Census data. From 1996 through 2010, the Census Bureau introduced replacement values to take the place of top-coded values. Topcoded individuals are divided into twelve groups depending on characteristics such as race, gender, and full time status. Income values are reassigned according to the mean income within each group. From 2011 onward, all income values above the top-code are rounded to two significant digits and then swapped among individuals within a bounded interval. This last method is called “rank proximity” and is what is applied consistently in the 1976 through 2010 files on the bottom of this page.

(4) As noted in this question and answer, The code 99997 does not identify “top-coded” values, rather it is the code for item non-response. These codes will still persist despite swapping values because the swap-values method does not correct for observations where the respondent did not respond to the income question.