Data-file size issues?

I’ve been testing the API on a USA full-count census file and have encountered what might be an issue related to the file size.

I first attempted to use pipes for the define_extract_usa(), submit_extract(), wait_for_extract(), download_extract(), and read_ipums_micro() calls. The data file downloaded fairly quickly, but the read_ipums_micro() call threw an exception. Subsequent troubleshooting revealed that the downloaded data file was incomplete (637,644,800 bytes compared to the expected 640,941,957). When I executed those functions as separate calls, not using pipes, the data file was complete, and the read_ipums_micro() call succeeded.

I’ve only done the two trials, though (I’ve used the already-downloaded files for other tests), and it might be a coincidence that the exception occurred when using pipes, but nonetheless there is an issue. Perhaps the problem is with extract_is_completed_and_has_links(), such as if extract$download_links[[“data”]][[“url”]] is populated before the file is completely ready.

Good morning Phil! Last week another user encountered a similar issue. We believe there’s a timeout somewhere in the process that’s causing full count files to be cut off. When you say fairly quickly, what do you mean? About how long did it process before it threw the exception?

The team was still working on the issue at the close of last week. Since your incomplete file size is very close to the expected file size, it could just be coincidence that it timed out just before completing rather than specifically related to a pipe issue. But as you say, it could also be an issue with the client implementation as well. I’ll check in with both the API and API client teams so we can look at this from both perspectives and I’ll update this when we know more. Thanks!

Unfortunately I don’t have the exact timing. What I know is that the data file downloaded during the second 300-second delay, so it was roughly 610-900 seconds. A wait of more than 10 minutes but less than 15 matches the file system timestamps on the script that I had written and on the downloded file.

When I attempted to use the downloaded file, read_ipums_micro() seemed able to read through most of the file before throwing the exception. I also tried using read_ipums_micro_chunked(), printing each chunk as it was read, which confirmed that most of the file was readable.

I should clarify that the piped call to read_ipums_micro() did not throw an exception. The piped sequence of function calls failed after something like three hours because my SSH session dropped. While I was waiting for the piped sequence to complete I noticed that the data file had already been downloaded. My first thought was that perhaps read_ipums_micro() couldn’t handle so large a file, and that’s why I first tried using read_ipums_micro_chunked().

By the way, the exception that I received was:
terminate called after throwing an instance of ‘Rcpp::exception’
what(): Could not close file
Aborted (core dumped)

Thanks for notifying us of this issue, Phil. Can you share your code, or at least the details of the extract you had download issues with, so I can try to reproduce the error?

The extract that I had trouble with is USA extract 11 for my account.

The code that I used to create and download it was:

library(tidyverse)
library(ipumsr)
samples <- c("us1850c")
variables <- c("SERIAL","COUNTYICP","GQ","LINE","PERNUM","FAMUNIT","SEX","AGE","MARRINYR","RACE","RACED","BPL","SCHOOL","LIT","OCC1950","REALPROP","BLIND","DEAF","IDIOTIC","INSANE","HISTID","PAUPER","CRIME","URBAN","STATEICP")
usa_1850 <-
  define_extract_usa(
    "USA 1850 full count",
    samples,
    variables
  ) %>%
    submit_extract() %>%
    wait_for_extract() %>%
    download_extract() %>%
    read_ipums_micro()

Thanks for sharing this code, Phil! I was able to run the code successfully the two times I tried it, so I suspect that the error you received was due to connectivity issues that prevented the full data file from being downloaded. However, if you continue to get similar errors only when running the code in pipes, let us know and we can continue troubleshooting – it could also be some difference between my machine and yours that explains why the code worked for me.

If the issue was connectivity, it would be ideal for the download_extract() function to throw an error indicating that the download has failed, so that users are not surprised when they can’t read in a (partially) downloaded file. I will consult with the API team to see if there are ways we could detect connectivity issues and inform users in this way.