I’m trying to understand why I am running out of memory in trying to use read_ipums_micro() to to translate an IPUMS-CPS compressed data extract into an R object, and what I can do about it. A code snippet showing the core of the function I am using is below. I have 20 GB of RAM, and I have successfully unpacked IPUMS-CPS extracts as large as 500 MB (decompressed size 2.4 GB), but it is choking on an 800 GB extract (decompressed size 5.4 GB – note higher compression ratio).
I figure 3 to 5 GB of RAM for system stuff, web browser, etc. This implies the read-translate-save process takes about three times as much RAM as the decompressed csv.gz extract size. Is that just what it takes, so I have to cut this extract in two (and another in three or four) if I hope to use it? Is there anything I should do to make the decompression and translation consume less memory, or, say, to save it to disk as it goes along, instead of all at the end?
Ultimately I intend to get everything into PostgreSQL, but I am not ready for that yet.
Also, I have a goal of getting my package to run on ordinary computers, for which my working definition is 3 GB of RAM. I want to be able to tell people what they can or can not do as far as extracts. If the “three times decompressed size” rule suggested by the above holds true, that would apper to imply a maximum compressed file size of about 100 MB to be confident of successful unpacking. Does that conform to your experience?
this_ddi <- read_ipums_ddi(this_xml)
this_data <- read_ipums_micro(this_ddi, verbose = FALSE)
if(saveas == “RDS”){
this_path <- paste0(nm, “.RDS”, sep = “”)
print(this_path)
saveRDS(this_data, file = this_path)