Topcoding of earnings in ATUS

Hey IPUMS users,

As documented, earning variables are topcoded in ATUS data using a single “cutoff” value throughout all years (for example, earnweek is topcoded at 2884.61 in all data years). The problem is that average and median salaries rose during the period covered by the data, 2003-2023 (see for example here). This means that later years in the data are more coarsely topcoded, with a higher percent of cases passing the threshold every passing year. Additionally, the underestimation of the earnings of the topcoded cases grows bigger as time progresses (i.e., the average of the real earnings of cases that were topcoded will be higher in later data years while data still gives us 2884.61). As my understanding goes, this problem is exacerbated when using CPI to account for inflation because cases from 2003 earning 2884.61 will now get an even higher value relative to cases from 2023 that are also topcoded at 2884.61.
Is there a way to mitigate this problem? Is it better to topcode values again after applying CPI?

I understand this is not an ATUS question per se, but any help will be very much appreciated.
All the best and thank you for your help.

From a quick tabulation, I’m finding that the number of respondents with the top code for EARNWEEK (2884.61) increased from 202 in the 2003 ATUS to 442 in 2023 (from 1.5% to 8.2% of the total in-universe sample). As you explain, it is expected that more respondents will be top coded over time due to inflation. This may bias average earnings estimates by decreasing the reported earnings down to the top code for a growing size of the sample. It will in most cases not affect median earnings estimates since the median will still be below the top code threshold. The Census Bureau engaged with this issue in April 2023 when they instituted a dynamic topcode to weekly earnings that would be set at 3% of the sample in each survey month. This however does not address the issue for earlier samples.

You might be able to mitigate this effect by applying a consistent top code across your sample years and dropping respondents above this top code. For example, since EARNWEEK top codes at most 8.2% of respondents (in 2023), you might drop the top 8.2% of weekly earners from each sample that you analyze. Alternatively, if you’re interested in comparing inflation adjusted earnings, then you might observe that the top code value in 2023 is roughly equivalent to $1,752 in 2003. In this case, you might exclude respondents in each year whose inflation adjusted weekly earnings exceed the top code for your most recent year of analysis. A good strategy may be to perform your analysis with and without respondents with top-coded income values and compare your results.

For methods other than dropping top coded respondents, I recommend reviewing this paper on using the pareto distribution approach for top coded values. You will also want to review how other researchers and publications handle this issue.

Hi Ivan, how do we know that EARNWEEK top codes at most 8.2% of respondents in 2023? I’m getting about 10% of respondents with values ≧2884.61. Thanks much, SG

. su earnweek if earnweek<99999.99 & year==2023

-------------±--------------------------------------------------------
earnweek | 4,381 1308.9 809.4945 0 2884.61

. su earnweek if year==2023 & earnweek>=2884.61 & earnweek<99999.99

-------------±--------------------------------------------------------
earnweek | 442 2884.61 0 2884.61 2884.61

Below I have calculated the share of respondents in each ATUS sample whose EARNWEEK value was top coded at 2884.61. Generally, this represents the share of respondents whose exact EARNWEEK value is not known due to the top code imposed in the original data, but who earned weekly income of at least $2884.61. In all cases, I have restricted the dataset to only those who are in-universe for EARNWEEK. I did this by excluding respondents with EARNWEEK=99999.99 (see codes tab).

This first table is weighted using WT06 (for all samples except 2020) and WT20 (for 2020). I created a variable *topcoded—*a binary variable equal to one if EARNWEEK=2884.61. There are no values of EARNWEEK above this threshold in IPUMS ATUS data, with the exception of 2024, which we discussed in an earlier forum thread. I have used the svyset [pweight=weight] command to declare weights for the survey data and produce a weighted cross tabulation of year and the topcoded variable.

This represents the share of the population that had weekly earnings at or above 2884.61 each year. You can see that in the 2023 sample, 8.17 percent of individuals had weekly earnings at or above 2884.61. In Ivan’s post, he wrote that 8.2 percent of respondents had top-coded EARNWEEK values in 2023; it would be more accurate to say that about 8.2 percent of the U.S. population had weekly earnings of 2884.61 or more in 2023, as estimated using the ATUS data.

This second table is an unweighted cross tabulation of the topcoded variable and year. This represents the unweighted share of the sample that had weekly earnings at or above 2884.61 each year. This tells you how many individuals in the dataset are affected by topcoding.

Thanks Isabel, that was incredibly helpful, including the distinction between population estimates and sample values. I wonder if your conclusion that “about 8.2 percent of the U.S. population had weekly earnings of 2884.61 or more in 2023, as estimated using the ATUS data” can be used to check the ATUS earnings variable against other earnings measures, e.g. CPS or PSID. Anyway, thanks again!