Can variable marginals/conditionals (by geographic unit) be redistributed from the IPUMS USA full count data?

IPUMS is a great resource for the research community. Thanks for taking the time to put all of these datasets together.


We have a forthcoming paper that makes use of a variety of datasets from IPUMS (census data and ACS data). We are exploring various approaches for providing the data to other researchers for replication purposes and to analyze our proposed approach.

Currently, we provide a README along with our code that guides the end-user with stepping through the IPUMS interface to recreate our data extracts, as an alternative to a direct URL/etc. (Data and query provenance -- i.e., reliably recreating data queries for replication). However, this is a somewhat time-consuming process for end-users, and the underlying data and variables names could change over time, as the data is updated. As an alternative, we are considering providing the pre-processed versions of the data in a public location, such as a Dataverse archive (e.g., While providing the full data or the aggregated results seems straightforward for academic purposes for the non-full count data (cf. Can I privately share a subset of IPUMS USA data?), we want to clarify what would be possible with the IPUMS USA full count files.


For the purposes of an academic journal article, would it be acceptable to publicly post the following processed information from the IPUMS USA full count files (e.g., 1880-1940) in a .csv (or similar) format:

For each geographic unit (e.g., MCD or county), where X and T are distinct variables (for example, RACE and LIT), provide the constructed 2x2 marginals and conditionals:

p(X), p(not X), p(T), p(not T), p(T|X), p(T|not X), N (number of people/households in the geographic unit).

For example, each line of each of the .csv files (one for each X,T pair) would have float values as follows (in the following, these are just arbitrary and are not real values from the census):

0.2, 0.7, 0.25, 0.75, 0.98, 0.6, 1000.0

Note that the identifier for the geographic unit is not provided. Depending on the geographic unit for the particular variable pair, there could be on the order of 100 to 30,000 such lines for each (X,T) pair.

Thanks for the detailed background information relating to this question. Since you are using the full count files, could you email us (at a sample of the files that you intend to post? This will give us a better idea of what your replication files will look like.

Thanks for the follow-up. I just sent the requested email to