Data and query provenance -- i.e., reliably recreating data queries for replication


Thanks for the effort in putting IPUMS together. It is a great resource for the research community.

Our current work utilizes data from a variety of IPUMS data sources, including data derived from the full count (100% sample) census datasets.

We would like to provide readers of our paper with a convenient means of recreating our results directly from the raw data.

Currently, we provide a README along with our code that guides the end-user with stepping through the IPUMS interface to recreate our data extracts. However, this is a somewhat tedious process that introduces the potential for errors/confusion.

Question: Is there a means of recreating a query based on a script (for example, uploading a YAML file) and/or recreating a query based on a stable URL? Ideally, we would be able to provide researchers aiming to replicate our results with a single script, control file, and/or URL that they could then use to download the data from IPUMS.


Can variable marginals/conditionals (by geographic unit) be redistributed from the IPUMS USA full count data?

This is a neat idea and a functionality we could probably think about adding someday. At the present time, however, the only way to gain access to IPUMS data is through the online data extract system.