Alternatives to Stata for ACS analysis

Is anyone actually using anything other than Stata to work with ACS on a regular basis? Python seems to have close to no survey data support. The R survey package is way too slow to use on an ACS 1% sample. Is everyone just using Stata?

Thanks for your patience – we are catching up on user queries after a holiday and then a weather-related closure in Minnesota. While you are asking about the ability to work with ACS microdata in different statistical packages, it sounds like perhaps the underlying issue is more about processing speed and less about suitability of different statistical packages for working with these data. I will address both items, but please follow up if I have misunderstood your question.

While Stata certainly has many commands that are well-suited to working with ACS PUMS files from IPUMS USA or other person-level microdata files, there is largely comparable functionality in SAS, SPSS, and R. Because you mention R specifically, I would recommend the srvyr package as it has a bit more functionality for certain aspects of these types of data (e.g., support for replicate weights). While I am not aware of a Python corollary to srvyr, I have seen examples where users directly implement some of the functionality that exists in Stata commands or R packages; for an example see this forum post about applying replicate weights for the CPS ASEC data in Python.

While there are limitations to each of these statistical packages, an analysis of a 1-year ACS sample should not demand too much processing time and you noted that analyzing data in R was particularly slow. I also reviewed your most recent data extracts and noticed they did not only contain a 1% ACS sample, but all included the 1960 5% sample as well as 11 other 1% samples. Such an extract would include more than 40 million records and would be a challenge for anyone to analyze without access to large-scale computing resources. You could reduce the size of your custom data extract by requesting samples individually or using the select cases tool to only download data on your population of interest.

I hope this helps!

Thanks for the detailed reply, Karl. I believe the srvyr package is just a wrapper around the survey package, and so is subject to the same performance and memory limitations.

Yes, indeed the problem is with processing speed, rather than suitability. I am finding that Stata calculates a survey-weighted total almost instantly, while survey/srvyr takes 20 minutes or so for a single total. Maybe I am doing something wrong with srvyr, but that’s what I’m seeing.

I used command-line tools to break the CSV into one file per year, so I’m only working with about 3 million records at a time.

Thanks for the clarification. While I am not an R user, I can share that my colleagues who use R have not run into the processing speed issues you describe. I am linking to the IPUMS USA page on working with replicate weights, which includes sample code for applying them in R that may be helpful to you. Finally, I will note that I find running commands with the Stata svy commands is certainly slower, even on the IPUMS servers. When working with these data I typically test and troubleshoot my code before implementing the svy suite of commands.