Best Practices on Top and Bottom Coding

Hey folks,

I’ve been working with HHINCOME in the ipums microdata to obtain median household incomes for my region (Greater Boston) and the United States overall. But one thing I’m not sure about is whether to include or exclude the top/bottom coded incomes. Does anyone have a sense of best practices on that front? Should I drop them? Or leave them in? What do other folks do?

Thanks,
Peter

A top code is assigned to all respondents whose income is above a certain threshold. This is done on a variable-by-variable basis such that a respondent might have a top code assigned for wage income (INCWAGE), but not for investment income (INCINVST). In the ACS samples, this threshold is set at the 99.5th percentile in each state, since outlier income values might be used to identify survey respondents. The top code for each income source variable is the state mean of values above the threshold value. A similar procedure is used for bottom coding for income sources that are allowed to be negative, including business and farm income (INCBUS00), investment income (INCINVST), and other income (INCOTHER). Retaining these observations should not pose issues for calculating medians since only the top 0.5% of earners receive a top code and the medians that you calculate will likely be below this threshold; entirely omitting these households however would skew the results.

There are two additional complications that you should be aware of:

  1. Household income is not directly reported in the ACS. HHINCOME is created by summing the values of INCTOT (total personal income) for all household members. INCTOT itself is the sum of eight different sources of income (see the comparability tab for INCTOT for the list of variables). While neither HHINCOME nor INCTOT are top coded, the individual components of INCTOT are. Note that HHINCOME = 9999999 are not-in-universe values assigned to group quarters and vacant unit respondents; these should be dropped from your median calculation.

  2. While the Boston metro area is identified by MET2013 = 14460, a small number of respondents who are coded as residing in this metro area are adjacent to, but outside the metro. The Public Use Microdata Sample (PUMS) file released by the Census Bureau only identifies the State and Public Use Microdata Area (PUMA) of respondent households; IPUMS USA harmonizes this data to allow users to analyze metropolitan areas where the discrepancy between PUMA and metro boundaries is determined to be sufficiently low. You can find a detailed explanation of our protocol in the description tab for MET2013 and an estimate of the number of miscoded cases in the match summary file.

If having access to the microdata person-level records is not crucial for your analysis, I recommend obtaining the median household income data from IPUMS NHGIS. This IPUMS project also releases ACS data, but instead of harmonizing the person-level PUMS file, the project allows researchers to access summary statistics from the ACS published by the Census Bureau. By selecting Household and Family Income as a topic in the Data Finder tool, you can find tables that report the median household income for a variety of geographic areas including metro areas. Since these are summary statistics generated by Census from an internal file, there are no top codes or misidentified households. We recommend that new users of IPUMS NHGIS familiarize themselves with the website by reviewing the FAQ page and the short video tutorials in the User Guide.