A couple of ipumsr memory questions

I’m trying to understand why I am running out of memory in trying to use read_ipums_micro() to to translate an IPUMS-CPS compressed data extract into an R object, and what I can do about it. A code snippet showing the core of the function I am using is below. I have 20 GB of RAM, and I have successfully unpacked IPUMS-CPS extracts as large as 500 MB (decompressed size 2.4 GB), but it is choking on an 800 GB extract (decompressed size 5.4 GB – note higher compression ratio).

I figure 3 to 5 GB of RAM for system stuff, web browser, etc. This implies the read-translate-save process takes about three times as much RAM as the decompressed csv.gz extract size. Is that just what it takes, so I have to cut this extract in two (and another in three or four) if I hope to use it? Is there anything I should do to make the decompression and translation consume less memory, or, say, to save it to disk as it goes along, instead of all at the end?

Ultimately I intend to get everything into PostgreSQL, but I am not ready for that yet.

Also, I have a goal of getting my package to run on ordinary computers, for which my working definition is 3 GB of RAM. I want to be able to tell people what they can or can not do as far as extracts. If the “three times decompressed size” rule suggested by the above holds true, that would apper to imply a maximum compressed file size of about 100 MB to be confident of successful unpacking. Does that conform to your experience?

this_ddi <- read_ipums_ddi(this_xml)
this_data <- read_ipums_micro(this_ddi, verbose = FALSE)
if(saveas == “RDS”){
this_path <- paste0(nm, “.RDS”, sep = “”)
print(this_path)
saveRDS(this_data, file = this_path)

Actually, I believe that I was mistaken about the nature of the problem here. It appears that my file names got out of sync with the file contents, and that these two files are just the household- and person-level replicate weights. I’m going to see if I can solve this using your “chunked” read micro functions, by bind_rows()ing subsets of the replicate weights, read by rows, back into complete columns. If this works I’ll post the code here.

I don’t think there’s a hard and fast rule to know how much memory you’ll need based on the size of the file.

The size of the uncompressed text file does not necessarily correlate to the size of the data in memory of R, because R converts the text to it’s data types. Plus, different versions of R behave differently on this, but many functions will create copies of the data at times that can be surprsing. So if you’re running close to running out of memory, it can be really easy to accidentally put yourself over the edge. Most of my knowledge on this comes from Hadley Wickham’s Advanced R book - http://adv-r.had.co.nz/memory.html

It looks like you’re on the right track, but if you are concerned about hitting memory issues, I really think either a database or the chunked functions are a better way to go.

(Also note that in the development version of ipumsr, I’m working on a slightly more flexible way to get chunks, called yields - you can see the beginning of my documetation here: https://github.com/mnpopcenter/ipumsr…)