Some questions re converting ipumsr variables from haven/labeled vectors to Boolean, numeric, factors, etc

I am trying to write a few “first cut” functions to convert variables that ipumsr supplies as haven-style labeled vectors to pairs of vectors, as follows:

First, certain data entries denote values that are in some sense missing, but these missing values are of many types and the type-information is often important. I propose to preserve the type-information by saving it in a same-length vector that saves the form of missingness for missing values, and otherwise takes a standard value, probably a length-zero character vector. This will include incorporating quality flag information in some cases. The original vector will have its values replaced by NAs in these locations.

Second, once the missing values are removed, I want to do as best I can to distinguish:

  1. binary-valued variables (e.g. received child support);
  2. numerical-valued vectors (income components);
  3. unordered-factor-like vectors (race, geography); and
  4. ordered factor-like variables (age, – maybe income ranges rather than values for some of the older data?).

I’ll be looking at the meta-data files, and assuming that I will need to do some of this by hand. But if you are able to give me a leg up as to whether there are meta-data descriptors that militate for particular data types, I’d be grateful for whatever insight you might be able to share.

Then I have a bunch of questions about missing data descriptions that I will post as a separate question.

Warmest regards, Andrew Hoerner

Have you seen the value-labels vignette? It discusses some functions provided by ipumsr to help with this process. In particular, something like:

lbl_na_if(cps$INCTOT, ~.val >=99999990) 

is often a great way to convert the missing values (which IPUMS generally codes as large numbers starting with 9). There are also several other packages that offer alternative interfaces to working with labelled data, such as labelled and sjlabelled

As you allude to, this process does require a fair amount of time and attention to decide how best to encode the data for your analysis. While it would be nice if you didn’t have to do this, in general our philosophy is that each analysis has different requirements and so there is no way we could provide data that didn’t require some recoding. Therefore we strive to provide documentation that makes it as painless as possible for you to do so.

For the second part, no I don’t think we don’t provide anything explicit that would distinguish between those types of variables quickly, instead it’ll require you diving into the documentation.