The highest one percent incwage earners are combined into the incwage"Top Code,"which is the median of that group?

The Top Code for incwage, protects the privacy of the top income earners. The Top Code is set at 99.5 percentile for the most recent years. This is not a wage cap, but a median number of the top 1% of inc wage earners fall into the Top Code? Am I right that is a grouping of just the top 1% of wage earners? and that the Top Code is a median of that group, and not the mean of that group? Thank you for taking the time to answer my question.

Sincerely,

jglickman234

These details depend on the specific sample you are looking at. For example, samples between 1940 and 1980 all have top-coded values. In 1990 and onward the “top-code” represents some measure of the central tendency of income earners above some threshold. In 1990 the value is the median of the state specific distribution of income earners above $140,000. In 2000 and onward the value is the mean of the state specific distribution of income earners above the given threshold. These details can be seen on the Codes Tab and on the Variable Appendices page.

Jeff,

Thank you for your answer to my question. But I’m only asking for the most recent years: 2003 and on. It states that the top code is the 99.5% percentile for the state. Let’s just use incwage, 2016, and California as the example. It stays the top code value for incwage is $504,000. I also noticed the second highest income wage (incwage) for California in 2016 was 300,000. What I am asking is what percent of values are affected by the Top Code? All values above $300,000? what percent of values or method did you use to determine what values are affected by the Top Code? The second part of my question involves the top code. It says it is the 99.5 percentile. Which I interpret as the median of the top one percent, but then in the documentation you use the term that the Top Code is the mean of the values affected by the Top Code. Is the Top Code a mean or median value? I apologize if my question’s answer is apparent in your documentation, but I can’t see to confirm this. Thank you for your help explaining how the Top Code works. And if for privacy reasons you can’t go into this specific detail, I understand.

The answers to these questions can be found via the links on the Variable Appendicies page. In particular, the link for 2016 ACS Top Codes and State means provides information about the details of how the U.S. Census Bureau determines top-codes and how these values are presented in IPUMS USA data. Addtionally, this page includes a link to this spreadsheet that lists the top and bottom codes for 2016 ACS data.

Regarding the second part of your question: There are two distinct values to keep separated. The first value is the income value that is the highest true value of income in the data. This can be thought of as the cut-off threshold. In ACS samples from 2003 and onward this cutt-off threshold value varies by state and is the income value represented by the 99.5th percentile in the income distribution. (I don’t suggest thinking of this as the median of the top 1 percent. As this just makes things confusing.) The second value is the income value that is given to indivudals who have true values of income above the cutt-off threshold. In ACS samples from 2003 and onward this value is the mean (average) of values above the cutt-off threshold. Sorry for the confusion as the term “top-code” is often used to represent both of these values.

Jeff,

Thanks for your response. It was very thorough, and now I understand exactly, the Top Code. I notice in your data then, that the 99.5% applies to all survey respondents- at least in California, regardless of age. I have the 19.8% of the California respondents are less than 16. Maybe I did something wrong, but I believe those are included in the calculation that would make the 99.5% percentile of incwage equal to 300,000 (for 2016). I really appreciate, the above response, as clarified the Top Code for me (and others I hope). Have a good holiday!

Jeff

If you are still using the INCWAGE variable, this does sound a bit strange. I just looked into this and everyone who is less than 16 years old has an INCWAGE value of 999999 indicateing “N/A”. This reflects the fact that the universe for INCWAGE is all those who are age 16+. Perhaps look into your code again and try to identify an error. If you find this observation persists, feel free to email your code to ipums@umn.edu.

Jeff,

What I meant was, that by-this is by person weight- if the Top Code includes only the top 1/2 percent of the population in terms of income, the whole population is included in this calculation. This includes everyone, even children. If you take out the children and non working from your calculation- then the Top Code, contains roughly the top 1 percent of income (incwage) earners (by person weight-perwt). Note: This finding is only for the State of California.

Ah, I think I understand what you are saying now. You are correct, the 99.5% applies to all survey respondents.

Jeff,

Thanks for your help so far. I have another question for you. I see that INCTOT doesn’t seem to have a Top Code, or a cap, am I reading this correctly? Do the Top Codes on INCWAGE affect the distribution of INCTOT? or is INCTOT not affected by the Top Codes on INCWAGE or any other income variables?

Sincerely,

Jeff

As noted on the Comparability Tab, INCTOT is the sum of components that are themselves already top-coded. So, if INCWAGE is top-coded, for example. Then INCTOT will include this top-coded value in the sum, rather than the origional income values.

I have yet another question. When I look at the ACS data salary and wages (incwage) for California, I get a lower mean salary than for the OES BLS survey. This is surprising to me, because the BLS survey doesn’t include bonuses in its wages and salary definition and the ACS survey does include bonuses. Also the ACS’s salary mean is significantly lower than the BEA mean values for California (and both the Bureau of Economic Analysis and ACS include bonuses. Do you have any explanation of why this might be the case? Thank you for taking the time to answer my question.

Sincerely,

Jglickman234

It is difficult to say without looking at the specific subset of data you are using. Generally speaking, however, it is not all that surprising to see different data sources of sample data produce slightly different summary statistics. These discrepencies can be due to a host of factors including: sample selection, weighting methodology, questionnaire wording, etc. If you’d like to send in your code for me to take a look, please email ipums@umn.edu.

Hi Jeff, is INCWAGE also top coded at 5001 in the 100% 1940 census data? Just want to confirm with you as I found many people whose incwage far exceeds 5001 in the full count data (e.g. one person has a value of 540000 on INCWAGE).

The current version of the 1940 full-count data has INCWAGE top-coded at 5001. This was revised in April of this year. Am I correct in guessing that you are using an older version of the file?

Yes you are right. I was using an older version that I downloaded last year.

Hi Matthew,

I downloaded a 1940 full count data set last year and another 40 full count data set just now, with different sets of variables. You mentioned before that the 40 full count data was revised this year. I tried to merge the two data sets (using sample, serial, and pernum) but found out that a number of observations exist either in the old or new version only. Looks like one of the three id variables was partly revised. Is it possible to do some adjustment so that I can merge the old and new versions of 40 full count data set? Or are the affected observations randomly distributed so that we can simply drop observations that cannot be merged? Thank you!

The only reliable way to link between different versions of IPUMS data is by using HISTID. If your old extract contains HISTID then you should be able to link on that. The SERIAL and PERNUM values sometimes change between versions and it can be difficult to determine why a particular case might have changed. If your old extract doesn’t have HISTID, unfortunately I don’t think there’s a simple way to link the datasets for those whose values changed.

1 Like

Thank you! I appreciate the information.