Nonexistent variables, topcodes vs nonresponse, & the lengths of 9-strings

Nonexistent variables:
Most of the income variables contain between 30K and 80K actual NAs, or things that register as actual NAs with rlang::are_na. I take these to be the values in years when the income variable in question does not exist.


Top codes or item nonresponse?
All of the swap variables except those corresponding to incss, incwelfr, incssi, incdsa2, and inclongj contain one or more values of “99997”. The shorter version “9997” does not occur in these variables.


it says:

Edited supplemental files.
Changes were made to a Census Bureau income data replacement values file called “swapvalues”, provided in a form that is compatible with IPUMS data extracts on the Income Component Cell Means Replacement Values page. Some income replacement values were coded as 99999, which is typically an NIU code. These values have been replaced with 99997 to indicate that the income value was topcoded due to a limited number of digits, and the codes should be treated as an income value.

However, here:
Jeff Bloem states that these codes represent item non-response

If these variables are indeed top-coded by this the 7-terminal value may I assume that the value at which they are top-coded is 1 * 10^n, where n is the number of digits in the variable?

Is there anywhere in the ddi that that the number of characters in a field is consistently given? It is only sometimes in the coding instruction. May I safely assume that the number of digits that a variable has in the ddi, when given, is the same as the number that the same variable has in the swapcode file?

Questions relating to field widths & 9-strings
Maybe coding instructions in the original variables will help clarify these issues?. The coding instructions for the original values (not the swap files, unless they are the same) gives strings of nines from five to eight long as not-in-universe, and strings of the same length with a terminal 8 as missing (for only a few variables) and in a terminal 7 as top-coded for numerous variables.

I find these variations in length confusing. At first I thought that they were all set at the width of the fields, and varied for that reason, but this is not the case. See, e.g. FTOTVAL, top-coded at 50,000 (five digits), but with an NIU value of 999999 (six digits). Within the swap files, are these strings of nines with varying terminal digits of the same length as the field widths?

These seven variables include some observations coded with “999999” (six nines) in the swap file:
incwage incbus incfarm oincwage oincbus oincfarm oinclongj

All of these six variables have five-digit terminal 7 values, so it appears that my guess above, thaty the lengths are set by the field lengths, is incorrect. In addition, these two variables have length-4 9-strings with a terminal 9: incss and incssi.

How should I interpret these 9-terminal nine-strings in the swap file variables? If they are missing values, do we know anything about how they are missing (and why they have not been imputed in the swap file)?

I’ll aim to address each question one at a time.

Yes, any values that is blank (i.e., “N/A” in R) is due to the fact that the variable in question is not available in your data set that pools multiple samples together.

Regarding whether the swap value of 99997 represents either a top-code or item non-response is a good question. In general, there is a lack of documentation from the Census Bureau on this bit of information specifically. In practice, however, the distinction between whether these values are top-coded (i.e., top-coded in the restricted files) or represent item non-response makes little difference. In either case, the “real” value of these income values is missing from the data.

Note that the number of 9’s in the CPS special missing codes has little meaning. You are correct that this detail is largely determined by the width of the valid values in the given variable. However, the number of leading 9’s may not align perfectly between the “original” harmonized variable and the swap values. In general, any values of leading 9’s with a terminal 9 (NIU/blank), 8 (no response), or 7 (don’t know) indicate a missing value of some sort.

Dear Jeff–
You write:
" In practice, however, the distinction between whether these values are top-coded (i.e., top-coded in the restricted files) or represent item non-response makes little difference."

I don’t believe this is correct. If this is a top code, I need to figure out what the income level is at which top-coding begins, and then fit a truncated distribution to the observations up to that level. And the number and weights of such values is potentially important, as it can provide some information about whether the shape of the curve you are using to estimate top incomes and shares changes above the censorship point.

Also, if you fit a truncated distribution and then use that distribution to project the income above the truncation value, which seems like a reasonable procedure, I can tell you (from numerical experiments) that if the swap file is truncated at, e.g., a million dollars, you will get a much bigger number for the income share of the top tenth of a percent than you will if you assume there is no top-coding and just fit the data that is there, dropping all the “nines” variables. .

If it is a missing value, we really wish we knew why these particular values were not imputed va hot-deck or whatever. If it is really an NIU value, then it should be NIU for all the income fields and the record should be dropped. That makes it easy. If it is really item non-response, then it is hard to know what to do with it. My own imputation from other income values and demographic variables, maybe? But estimated only on the swap value sample, maybe? That would probably be fine for SSI or pension values, not so great for wages or core capital income.

Am I reading your documentation correctly as saying that all the 99…97 values were 99…9 values in the original files as they came from the Census? Did the swap file originally come with any census documentation? A contact person? A division name? I have searched the census site and not found this file at all, or anything related to it, or any mention of it. Do you know the original name of the file?

You are correct about all of this and it seems my previous response lacked critical detail. As is discussed in the attached paper by Larrimore et al., these swap adjustment techniques are limited by the internal censoring of income values in the restricted CPS files. On page 111:

“[…] because cell means [and swap values] are based on the internal March CPS, which is subject to some censoring at the very top of the distribution, the cell means series does not reveal the full scope of the US income distribution. However, while the public use March CPS generally provides complete information for only around 94–98% of the sample population, the internal March CPS is able to provide income information for 99.0 to 99.8% of the sample distribution.”

Therefore, there are some income values in the internal CPS files that are censored. One could call these either “top-coded” or “missing,” however, in practice we just do not have information about these income values. Therefore, as is noted on this page, values in the swap file coded as 99997 should not replace original values.

Regarding the 99, 98, 97, and 96 special codes: While the corresponding codes in the original Census files are different (and often use negative values), the IPUMS CPS Team re-codes this information using the CPS special codes convention discussed above. Unfortunately, there is very little documentation available from the Census about these original swap files. The documentation we provide is our best effort to document the information available in these files.