Problems with subsetting replicate weight variables using ipumsr::read_ipums_micro_chunked

Dear Folks
I am having some difficulty in trying to cut my extracts containing replicate weights up into more manageable chunks, using read_ipums_micro_chunked, e.g. this:

PR_RE1 <- read_ipums_micro_chunked(read_ipums_ddi(
“./CPS_1962-2018/cps_00177.xml”), IpumsDataFrameCallback$new(f),
vars = REPWT1:REPWT40)

Gave me the following error:
Error: Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, : Evaluation error: values must be length 1,
but FUN(X[[1]]) result is length 2.

Getting this to work is my top priority.
But what I would really like to do ultimately is something more like this. I am confident that this is not only inelegant but also broken in multiple ways, but I hope it conveys what I am trying to do:

peel <- function(ddi_names, vars_lst, human_names, suf =
seq_along(vars_lst[[1]]), path =“./” ){
ddi_paths <- paste0(ddi_names, path)
if (!(length(ddi_names) == length(human_names))) stop (
“Length of ddi_names and human_names must be equal”)
args1 <- tibble(ddi_names, human_names)
for(i in seq_along(ddi_names)){
args2 <- list(nms = list(paste0(args1$human_names[i], suff, path)),
vars_lst[[i]])
for (j in seq_along(vars_lst[[i]])){
saveRDS(read_ipums_micro_chunked(
read_ipums_ddi(paste0(path, ddi_names[[i]], “.xml”)),
IpumsDataFrameCallback$new(f),
vars = args2$vars_lst[[j]]),
file = args2$human_names[[j]])
}
}
}

Which is to say, for each replicate weights variable set, read, convert, reassemble, and save as an RDS file a bunch of small subsets of the replicate weights variables, each with a distinct name:

peel(ddi_names = c(“cps000177”, “cps000180”),
vars_lst = list(HH_REPS = list(REPWT1:REPWT40, REPWT41:REPWT80,
REPWT81:REPWT120, REPWT120:REPWT160),
PR_REP = list(REPWTP1:REPWTP40, REPWTP41:REPWTP80,
REPWTP81:REPWTP120, REPWTP120:REPWTP160)),
human_names = c(“HH_RE”, “PR_RE”))

This approach is a bit of a Rube Goldberg contraption, and I bet you have some two-line way of doing this with ipumsr. Well, maybe longer for the distinct names.

Hmm. . . Your forum software does unfortunate things to code indentation.

Can you post what f is? I’m not sure I’m following what you’re after , but there’s a good chance you should be using IpumsSideEffectCallback instead of IpumsDataFrameCallback.

I’m not sure I know what question you are asking. Let me give a more detailed description of what I want the function to do, so you can avoid working your way through my spagetti code.

For each replicate weight file, I have a short_name prefix, HH_RE & PR_RE.

I want to split each replicate weight file up into variable ranges with n pieces, here 4. I do this by handing it a list of terms like REPWT41:REPWT80. I also have four suffixes, just the numbers 1-4.

I read the whole file in 10K chunks, , converting to R format as I go and retaining the metadata.

But to keep the memory load down, but only save the subset of the replicates weights I am working on at the momnent (like REPWT41:REPWT80)

When I have have finished reading the whole file, I reassemble the 10K-line chunks into a single tibble or data frame.

I save it as an RDS file, named prefix suffix .RDS. I go back to the beginning and read the next range. If I have read aallthe ranges for a file, I go to the next file.

At the end of the day I have eight files, each of which contains a quarter of the variables in one set of replicate weights.

Is that more clear, I hope?

I don’t really understand the callback functions very well, perhaps because I have never used readr. The reason I thought I wanted IpumsDataFrameCallback rather than the side effect version is that I believed the former but not the latter reassembled the chunks befor retrurning them. Is that correct? I’d rather do the re-assembling and then save, rather than saving, re-reading, assembling, and re-saving.

The true, if somewhat embarrassing, answer to your question about the meaning of f is that I was just parroting some readr code code using callbacks without understanding. In the past when I have dealt with the problem of files to big for memory, I have usually used connections, where I would open a read cconnection to the big file, read a chunk, pull out some variables, open a write connection to my new, smaller file, and then loop around, keeping both connections open, until I got to the end of the big file.

But in this case, following that strategy would leave me without the benefits if ipumsr translation/formatting/metadata.

Oh sorry, no need to feel embarrassed. I just meant that I think part of your code is missing. The code you’ve called includes this:

IpumsDataFrameCallback$new(f),

But you haven’t included the definition of f, so when I try to run it, I get this error:

Error in .subset2(public_bind_env, “initialize”)(…) :
object ‘f’ not found

Rather than let this answerbase software mangle the code formatting, I’ve posted 3 files on a github gist here:

https://gist.github.com/gergness/060f…

The first of the 3 files is most similar to what I think you’re describing. However, I don’t think this will save you any memory overhead. It is still loading the full dataset into memory.

The second of the 2 files doesn’t use chunks, but passes through the data for each group of variables so probably uses less memory than the first.

The third is the one that uses the least amount of memory. It uses file connections to save csv data during each chunk. It also shows how you could use the ddi to set the variable attributes even though they wouldn’t have been saved alongside the csv data.

Perfect, Greg! You’re a wonder!