Dear Folk–
I am aware of at least 8 Kinds of “missingness" in the IPUMS CPS data (and I think IPUMS USA as well).
-
There are numerical values that are replaced by specific integers that denote the variable has been top-coded; or
-
bottom-coded, without providing information about the top- or bottom-coded value.
-
There are variable- and year-specific values of top-or bottom-coded variables equal to the mean value of the top-or bottom-coded observations (unless these have all been replaced by swap values now).
-
There are data quality flags that indicate if this is a person’s answer about themselves or somebody else.
-
There is some way of indicating that there was nobody to interview – not sure if this will generally be a value in the variable itself or in the data quality flags.
-
There are refusals to answer specific questions – also not sure if this will be a value in the variable or in data quality flags.
-
There are “don’t know” answers that may be distinct from refusals to answer.
-
Not in universe values, perhaps of several kinds.
For missing types 3 through 6, these values may be blank, or they may be imputed; I’d want to distinguish these cases.
I have two questions about these various kinds of missing values:
First, is this list complete, or are there other kinds of missing that are quantitatively important? (I’m not going to fuss about, e.g. fragment values).
Second, are these kinds of missing consistently coded across questions, in terms of the entries that indicate each kind of missingness, such that a data preparation function could identify and treat them unifirmly? Also, are variables consistant as to how information about missingness appears, e.g. in a quality flag as vs in an answer? And if so, is this information gathered somewhere that I could see?
You folks are terrific.
Peace, Andrew Hoerner
I’ll do my best to fully address your questions here. While you label this list as “missingness” it feels important to note that there are really several distinct categories listed here: data that is top/bottom coded, not in the universe (NIU) codes, allocated information, unknown or refused to answer, and missing data. However, your list feels nearly complete to the different types of systematic coding that happens in our data harmonization process with the exception of one thing: quality flags that indicate data allocation. For example, QSEX in IPUMS CPS addresses various types of possible allocations.
As for the uniformity in how these different types of data are coded, there is consistency in some sense, but you will really need to reference the individual documentation. For example, in considering income and tax variables, NIU cases are all 9s, but missing cases are 9s that end with an 8 (i.e. 999998). Here is the overall structure for IPUMS CPS income and tax variables. Detailed top coding information for IPUMS CPS can be found here.
Similarly, there are set allocation procedures, but the coding varies from variable to variable. For example, compare QSEX to QEDUC. You will see that a code for one does not necessarily carry the same meaning in another.
Again, using EDUC as an example, you can see that NIU here is represented by the code 001 - NIU or blank.
There are plans in place for making these different special values more systematically identifiable, but for the time being you will need to reference the documentation for each variable.
Would I be correct in assuming that “to allocated value” refers to assignment by hot deck or something similar, while “to longitudinal value” means that the same question in regards to the same person has been answered in a more revealing way in some other period?
This Census CPS documentation provides a table for the interpretations of different types of allocation. So, yes, “to allocated value” refers to something such as hot deck allocation. “To longitudinal value” means that the variable was changed to the corresponding allocated value from a prior CPS interview.
You may also be interested in the IPUMS on data editing and allocation found here.