Totals constructed from swap values, average-above-threshold values, and topcode thresholds

You indicate, here:… that the swap values generally replace cells that have previously been top-coded. So we have three different historical high-income replacement methods: top codes, average incomes for all top-coded values for a demographic group, and swap values; and we have swap values for 2011 on and then a separate set of swap values supplied later for 1975 to 2010.

A preliminary question: I had previously assumed that entire records were swapped when a value exceeded the swap threshold. From looking at the swapvalues.txt .csv file, it now appears to me that only individual cell values are swapped. Is this correct? Are the “income bands” used to determine the households selected for the swap income component bands for the swapped component, or aggregate income bands for the individual or household in question?

I have questions concerning the summary values constructed from these values.

First, concerning what is in the standard IPUMS CPS extracts:

  1. For years where only topcodes are available, are summary values constructed by treating the top-codes as if they had the value of the topcode threshold?

  2. For years where average values are available, from Larrimore or otherwise, are summary values reported in IPUMS constructed by replacing the topcoded values with average values and then totaling? Or by totaling the raw, non-topcoded values and then replacing with an average if the raw total exceeds the topcoding threshold? Or is some other method used?

  3. For 2011-on swapped values, does the swapping completely replace topcoding, or is there still a (presumably higher) topcode on swapped values? If the latter, are these values provided anywhere? In either case, are summary values after swapping strictly the sums of the component values, or is some more complicated method used to calculate these totals?

  4. Are the retroactively provided swap values for 1975-2010 calculated in exactly the same way as the 2011-on values? If not, where are the differences?

  5. Was there a special, different method of topcoding summary values in 1990, using state medians above threshold rather than sum of topcoded items? (See this question: How should user's deal with topcoded values for tax--person variables pre 2010 when no swapvalues exist?). Or is this an ACS method? I don’t see any survey identification with the question. The ASEC high-income summaries are generally averages, not medians, correct?

  6. If the answer to any of these questions is “We don’t know?”, is there anyone at the Bureau that you would suggest as a good contact?

I’ll aim to address each question one at a time.

(Preliminary question) Yes, the swap values replace only individual cell values. This process helps ensure that individuals with high levels of income are not able to be individually identified in the microdata and preserves the ability of the data to calculate accurate estimates of incomes at the upper end of the distribution. In this method, all values greater than or equal to the income topcode are systematically swapped with other values with a bounded interval of a given income variable.

Regarding these “summary values”, I understand this to mean the aggregated income variables (such as INCTOT). If this assumption is incorrect, please correct me.

(1) For years where only topcodes are available (e.g. years prior to 2011) income values above the topcode are replaced with the topcode value. Therefore, aggregated income variables treat the topcoded values as if the topcode is the income value.

(2) For aggregated income variables that come directly from the Census Bureau, the aggregation of the various components includes the topcoded value.

(3) As noted on this page, from 2011 onward the Census Bureau moved away from the average replacement value system to a rank proximity swapping procedure. In this procedure, all values greater than or equal to the income topcode are ranked from lowest to highest and systematically swapped with other values within a bounded interval. Therefore, the aggregated values of income variables should be the sums of the component values.

(4) The methods for constructing these retroactively generated swap values are detailed in this paper by Larrimore et al. (2008).

(5) (Note: I don’t see any reference to 1990 topcodes in the User Forum link in the question.) The ACS uses a method of using state-specific topcoding scheme. This method does not apply to the CPS. Also, as far as I am aware, wherever appropriate, means are used instead of medians.

(6) We don’t have any specific Census Bureau contacts we are able to suggest you contact in response to these questions. We recommend contacting the Census Bureau through officially listed channels.

In number 4 above, did you intend to refer to a different paper? This one seems to be about construction of cell means rather than swap values.

I think that IPUMS documentation uses the term “top codes” in a somewhat inconsistent manner. It seems to refer sometimes to what I would call top codes – uninformative codes like 999997 indicating that the observed value was above the topcoding threshold; sometimes to the top coding threshold value itself, and sometimes to data-based means of persons above the threshold.

I think this is the correct paper. As I noted on this page under the rank proximity swap values 1976-2010 section, “Each file contains income values to replace topcoded income components for every ASEC sample from 1975 to 2010. The purpose of this file, and the related file from Larrimore et al. (2008) (above), is to provide researchers with income data using a consistent topcoding method. We provide the Census Bureau revisions with IPUMS identifiers and income variable names.”

I’ve previously noted your feedback about the IPUMS CPS documentation on top-codes. These sorts of revisions take time to implement. We ask for your patience as we consider improvements.