Dear Folks–
I am hoping you can confirm (or, if incorrect, disconfirm) five beliefs I have in regard to top-coded values in the IPUMS-CPS, and perhaps answer two related questions.
Beliefs:
- All of the formerly top-coded values in the IPUMS-CPS from years after 1975 (calendar year 1974) now have swap values provided by the Census Bureau comparable to those from the 2011 (calendar 2010 on), and that this is now the best (except perhaps for understanding state economies?) available CPS data for high-income persons and households.
- Swap values are generally more informative replacements for the cell mean values provided by Larrimore et al. (2008) and referenced at https://cps.ipums.org/cps/income_cell_means.shtml
- Swap values consist of swaps of entire household records, so that correlations between income types and, e.g., demographic variables, or occupation and industry of household members, are correct.
- IPUMS has not itself substituted those values directly into the IPUMS-CPS data as provided by the extract system.
- There is not currently ipumsr code provided to do the required merge for this substitution with R (though it does not sound hard. Are the replacement value vectors the same length as the full IPUMS-CPS files for the covered years?)
Questions:
- There was a moment in time, some years ago, when a Census employee told me that even the swap values max out and are silently replaced by ceiling values at some very high level – he mentioned $1 million (per income type, I think) for all internal-to-Census calculations and published values, and $10 million for data collected and retained (but never used). Do you know if this is currently true, or true for any years post 1975?
- I have seen – but I can not recall where – apparently authoritative claims that records corresponding to the Fortune 100 or 500 richest individuals, or some similarly-defined group, are by design excluded from reported CPS micro data, presumably because even after anonymization their income profiles are too distinctive to provide legally required privacy guarantees. Do you know if this is true, or true for some years?
Warmest regards, Andrew Hoerner
I’ll briefly respond to each of your beliefs about top-coded values in IPUMS CPS before addressing your questions.
(1) Prior to 2011, the income variables have top-coded values to preserve confidentiality of responses. Starting in 2011, the Census Bureau implemented a rank proximity swapping procedure. This is the way the data in IPUMS CPS is provided by default. You can read more information about this on this page. A method developed and described in Larrimore et al. (2008) makes these top-coding or swapping procedures roughly comparable over time. You can download the necessary data files to implement this procedure on this page.
(2) The procedure implemented and described in Larrimore et al. (2008) does allow for slightly better estimates of average incomes and investigations of income inequality.
(3) As far as I understand the Census Bureau’s documentation correctly, in the rank proximity swap procedure all incomes above the top-code are exchanged among individuals within a bounded interval. Therefore, the swap happens at the individual level - rather than the household level. This is done to preserve confidentiality, with some reduction in accuracy.
(4) Yes, correct. The data available in the IPUMS CPS extract system uses the original method for preserving confidentiality as implemented by the Census Bureau.
(5) Yes, correct. There is no R code readily available to perform the required merge. It should be quite straightforward since the files should be the same length.
Questions:
(1) I do not know about such a practice. I suppose it seems plausible, but I have not seen explicit documentation of this practice. You may find reaching out the the Census Bureau directly about this question helpful.
(2) I also do not know much about this detail. I do know that the Census Bureau goes to great lengths to maintain the confidentiality of respondents in the CPS microdata. Additionally, since the CPS is a sample of the overall population it is quite unlikely that in any given year a specific individual on the Fortune 100 or 500 list is included in the CPS sample.
Hi Jeff! I’ve been looking at the 1976-2010 retrospective proximity swap file and while the merge appears straightforward, it is not the same length as the underlying CPS file. Only individuals with at least one king of income above the top-coding threshold are contained in the fils, and it appears the merger is to be done on YEAR, SERIAL, PERNUM.
The swap replacement file still has top-coded values in it for about one percent of the people. Do you know anything about this? Although income values are reported as high as 1.7 million dollars for some income types, top-coded values of 99997 are still identifiable because they do not conform to 2-digit rounding. I have not checked data more recent than 2011 yet. If there are still top-coded values dispite swapping that makes me unhappy.
I think this detail requires a brief clarification. The code 99997 does not identify “top-coded” values, rather it is the code for item non-response. These codes will still persist despite swapping values because the swap-values method does not correct for observations where the respondent did not respond to the income question.
I thought all the income item non-response values had been imputed by some hot-deck like procedure? There is this paper “Trouble in the Tails? What We Know About Earnings Nonresponse Thirty Years after Lillard, Smith, and Welch” which notes that the proportion of values were imputed rather than measured directly has been rising. They do a real match on imputed labor income and SSI income from administrative records, and find that the imputed values are low by about 20 percent.
So are there some missing values that are imputed while other missing values remain missing? How is this determined?
Wait a minute – If a value is missing, how do they know that it exceeds the threshold for swapping? I really don’t understand how this works at all.
Okay, I think I can help clarify. The “missing” value code of 99997 applies to the 1976-2010 swap values data file. You are correct that in the income data, all item non-response cases are imputed. This is not the case for the swap values data file. As is noted on this page, “Every non-zero value in the swap values file should replace a topcoded value in IPUMS. All other values are missing or zero, meaning that the income variable is not available for that year or the income was not topcoded, respectively.”