Top-coded variables


With the top-coded salary variable in the Higher Ed dataset, when it is said to be top-coded is it 1. anomalised and marked as missing, is 2. the observation removed from the dataset or is it 3. rounded to the mark at which it is top-coded?

Thank you!


Your case number 3 is most accurate. The SALARY variable in IPUMS Higher Ed is top-coded at 150,000 US dollars (except for in the SESTAT-NSCG and NSRCG surveys where these values are top-coded at 100,000 US dollars). This means that for any SALARY value reported above the top-code, the value is replaced with the top-code value. So, for example, if someone reports a salary of 170,000 US dollars, then this value will show up as 150,000 in the data.


Thanks for that I really appreciate it Jeff! One follow-up question, if you wouldn’t mind. As you mentioned the SETAT-NSCG and NSRCG is top-coded at $100,000. Examing the data in stata in for example, the year 1995 and 2013, it shows observations from these two survey actuallly reaching up to $150,000 in salary. Do you know why this is?

Thank you!


You are right about this! The documentation is misleading in this case. It should say that SALARY in SESTAT-NSRCG surveys are top-coded at 100,000 US dollars and all other surveys are top-coded at 150,000 US dollars. Sorry for the confusion here. We will update the documentation appropriately.


Apologies, but even in SESTAT-NSRCG surveys, it seems observations are top-coded at $150,000? It seems all surveys include observations up to $150,000.


I don’t think so. After cleaning out observations with special codes for skips and missing, I find the following:

. by surid, sort : summarize salary

-> surid = NSCG

Variable | Obs Mean Std. Dev. Min Max
salary | 457043 65154.34 36686.89 0 150000

-> surid = NSRCG

Variable | Obs Mean Std. Dev. Min Max
salary | 90087 36660.75 20529.4 0 100000

Perhaps I am missing something?


Apologies, you’re right, sorry!