Estimating Storage Requirements for Complete IPUMS Database

I am embarking on a project to build a system specifically designed to handle the entirety of the IPUMS databases. This includes IPUMS-USA (ACS and decennial Census data), IPUMS CPS, IPUMS International, and any other datasets that are part of the IPUMS collection.

My goal is to download all available years of data for each IPUMS dataset, including all available variables. I understand that this will result in a massive amount of data, and I am trying to estimate the storage requirements to accommodate this.

Could you please provide an estimate of the total file size for all years of uncompressed IPUMS data, with all variables included, across all the different IPUMS datasets? Additionally, do you have any recommendations for additional storage that might be required for data processing and analysis?

I appreciate your assistance and look forward to your guidance.

Thanks for your interest in IPUMS data. I don’t have a direct or definitive answer to your question; we natively use compressed formats of the data because of the massive amount of data we have. Without more clarity on your intended application of the data, I will share two general thoughts. First, given the pace at which IPUMS releases new data or incremental updates, the database you allude to would quickly become out of sync with the actual IPUMS databases; beyond missing new samples, updates to variable codes to accommodate new samples would render the two versions incompatible rather than just leaving yours with fewer variables or observations included. Second, I encourage you to read the terms of use for IPUMS data; please pay particular attention to the redistribution clauses as well as specific restrictions, particularly those around non-commercial use, for a subset of data collections.

My interest is purely academic. However, your interface is very hard to use. Despite having software engineering experience, it is almost impossible for me to decipher how to obtain any particular dataset with your API. You have multiple required field with obscure identifier names and it’s not clear how your metadata corresponds to the datasets. Some of the examples in the docs do not work. I don’t know how other economists are able to use your API. If you allowed users to download the JSON file of the Data Extractor query instead of dat, csv, etc., then it would demystify your API. We could just copy the JSON and use it instead of having to decipher what is or isn’t needed in specific random examples in the API docs.

I understand and appreciate your organization’s robust and perfectionist ideal for every use case, but all I want to do is say the dataset I want, say the variables I want, then read it into memory for analysis. It would be easier for me to to just download the entire datasets manually, then create my own database to query the data I need. There is no practical reason for me to incrementally update everything except once a year when the next survey is released.

To partially answer my question, focusing on just ACS IPUMS, there is under 12GB for a 1-year sample when read it into memory. So, 20 years is about 240 GB. A 5-year sample is under 60GB. So, 20 of the 5-year samples comes out to around 1,200 GB. This can all easily fit on a 2TB SSD with space left over for a few more years. A modern workstation could fit all the data into memory, but more investigation is needed to determine the capacity needed to holistically consider the entire IPUMS archive at once.

Thanks for the follow-up message and clarification. It sounds like you are trying to download IPUMS microdata via the IPUMS API and running into issues. Please correct me if I am wrong and you are also trying to access summary file data via the API or also ran into issues using the traditional web interface. Note that we welcome feedback on the API via email (ipums+api@umn.edu) or via the API topic on the User Forum.

If there are specific errors or issues in the API examples and documentation, please let us know about them so we can correct them. Without further detail about the examples or documents that aren’t working, I can share a few general comments that are hopefully useful to you.

First, we provide native-client libraries for R and Python users interested in working with the IPUMS API in those languages. These native-client libraries are designed to streamline the process of submitting API requests without getting into all of the details of the API interface (i.e., just name the variables and samples of interest and submit), bypassing the need to create a JSON definition directly (though these libraries also include functionality that allows users to export the resulting JSON definition). We also provide JSON versions of extract requests submitted via the web interface upon direct request; please contact ipums@umn.edu with the data collection and extract number for which you would like a JSON file if you are interested.

Second, the IPUMS API is relatively new and is currently limited to providing data access for household-person microdata (currently available for IPUMS USA, IPUMS CPS, and IPUMS International) as well as data and metadata access to summary data (currently available for IPUMS NHGIS). I am sharing a full list of supported and unsupported features. We are aware that metadata access via API for the microdata collections will be valuable to users; in the meantime, on the household-person microdata page of our developer portal we provide suggestions for accessing key metadata.

Third, note that IPUMS provides data from a variety of sources and in different formats through our User Interface. These are unique data collections and while they share an underlying metadata structure, the user-facing metadata between the different collections is not designed to be used in conjunction.

Finally, it sounds like you were able to make headway in determining the necessary storage requirements using the 1-year ACS as a reference. As you benchmark storage needs against the ACS PUMS, I will note that our data structures and the sizes of the datasets are very diverse–making back-of-the-envelope calculations about size of the entire IPUMS data collections difficult. Our microdata database contains 2.5 billion records across more than 2,500 datasets and our summary file data contain nearly 500 billion data cells. Some microdata files will be higher density than the ACS (full count files, some IPUMS International samples), while others will be lower-density samples but include more variables (CPS, NHIS, MEPS, Time Use, DHS, PMA). IPUMS NHGIS includes many millions of geographic identifiers describing geographic areas throughout the U.S. at dozens of levels, going down to individual blocks.

I hope this helps. Please follow up with questions.