Parsing NIU and Missing codes in the DDI file

I was working with ipumsr to parse an IPUMs extract file. I had a question about some of the variable metadata.

Now when I have categorical data, most of the information is contained in a <catgry> tag, which specifies the levels or groups in that variable as well as the numerical index. For example the index 1 corresponds to the month of January.

However, it seems like some variables have specific categorical-like data saved in the <codInstr> tag, such as Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only) for the variable INCTOT, which is “Total Personal Income”.

It seems a bit odd to include some categorical information in the codInstr tag. I was wondering how common this case is. Are there other variables that have categorical codings listed in the codInstr tag? Are the references in that tag usually limited to NIU and Missing info, or could there be other codings as well?

Thanks for any help you can provide.

Yes, it is relatively common for IPUMS USA variables that are continuous to also include values that are categorical. You can view the labels for these values using ipums_val_labels() or by viewing the INCTOT variable (e.g. if your data was named data you could do data$INCTOT). Information from both the catgry and codInstr tags are captured here.

These codes are also mentioned in the codes tab for each variable. The codes tab for INCTOT notes these as specific variable codes. This is most commonly used for missing and NIU data, but it is also used in other cases such as for bottom and top codes. For example, all respondents who report an INCTOT value above a certain threshold are assigned the same top code by the Census Bureau to preserve confidentiality (see the IPUMS User Guide page on threshold values). This is also mentioned in the codes tab on the website. We recommend that users review the codes tabs for all variables in their extract in order to code respondents correctly for their analysis.

Thank you Ivan. This was very helpful. Yeah, I understand now. I am working on writing a package to process some of the IPUMs data, and was just trying to understand the layout. Thanks for your help.