Data-file size issues?

Phil_Feller · June 26, 2022, 1:51am

I’ve been testing the API on a USA full-count census file and have encountered what might be an issue related to the file size.

I first attempted to use pipes for the define_extract_usa(), submit_extract(), wait_for_extract(), download_extract(), and read_ipums_micro() calls. The data file downloaded fairly quickly, but the read_ipums_micro() call threw an exception. Subsequent troubleshooting revealed that the downloaded data file was incomplete (637,644,800 bytes compared to the expected 640,941,957). When I executed those functions as separate calls, not using pipes, the data file was complete, and the read_ipums_micro() call succeeded.

I’ve only done the two trials, though (I’ve used the already-downloaded files for other tests), and it might be a coincidence that the exception occurred when using pipes, but nonetheless there is an issue. Perhaps the problem is with extract_is_completed_and_has_links(), such as if extract$download_links[[“data”]][[“url”]] is populated before the file is completely ready.

fran · June 27, 2022, 12:45pm

Good morning Phil! Last week another user encountered a similar issue. We believe there’s a timeout somewhere in the process that’s causing full count files to be cut off. When you say fairly quickly, what do you mean? About how long did it process before it threw the exception?

The team was still working on the issue at the close of last week. Since your incomplete file size is very close to the expected file size, it could just be coincidence that it timed out just before completing rather than specifically related to a pipe issue. But as you say, it could also be an issue with the client implementation as well. I’ll check in with both the API and API client teams so we can look at this from both perspectives and I’ll update this when we know more. Thanks!

Phil_Feller · June 27, 2022, 2:26pm

Unfortunately I don’t have the exact timing. What I know is that the data file downloaded during the second 300-second delay, so it was roughly 610-900 seconds. A wait of more than 10 minutes but less than 15 matches the file system timestamps on the script that I had written and on the downloded file.

When I attempted to use the downloaded file, read_ipums_micro() seemed able to read through most of the file before throwing the exception. I also tried using read_ipums_micro_chunked(), printing each chunk as it was read, which confirmed that most of the file was readable.

I should clarify that the piped call to read_ipums_micro() did not throw an exception. The piped sequence of function calls failed after something like three hours because my SSH session dropped. While I was waiting for the piped sequence to complete I noticed that the data file had already been downloaded. My first thought was that perhaps read_ipums_micro() couldn’t handle so large a file, and that’s why I first tried using read_ipums_micro_chunked().

By the way, the exception that I received was:
terminate called after throwing an instance of ‘Rcpp::exception’
what(): Could not close file
Aborted (core dumped)

Derek_Burk · June 28, 2022, 8:20pm

Thanks for notifying us of this issue, Phil. Can you share your code, or at least the details of the extract you had download issues with, so I can try to reproduce the error?

Phil_Feller · June 28, 2022, 8:54pm

The extract that I had trouble with is USA extract 11 for my account.

The code that I used to create and download it was:

library(tidyverse)
library(ipumsr)
samples <- c("us1850c")
variables <- c("SERIAL","COUNTYICP","GQ","LINE","PERNUM","FAMUNIT","SEX","AGE","MARRINYR","RACE","RACED","BPL","SCHOOL","LIT","OCC1950","REALPROP","BLIND","DEAF","IDIOTIC","INSANE","HISTID","PAUPER","CRIME","URBAN","STATEICP")
usa_1850 <-
  define_extract_usa(
    "USA 1850 full count",
    samples,
    variables
  ) %>%
    submit_extract() %>%
    wait_for_extract() %>%
    download_extract() %>%
    read_ipums_micro()

Derek_Burk · June 29, 2022, 12:28pm

Thanks for sharing this code, Phil! I was able to run the code successfully the two times I tried it, so I suspect that the error you received was due to connectivity issues that prevented the full data file from being downloaded. However, if you continue to get similar errors only when running the code in pipes, let us know and we can continue troubleshooting – it could also be some difference between my machine and yours that explains why the code worked for me.

If the issue was connectivity, it would be ideal for the download_extract() function to throw an error indicating that the download has failed, so that users are not surprised when they can’t read in a (partially) downloaded file. I will consult with the API team to see if there are ways we could detect connectivity issues and inform users in this way.

Topic		Replies	Views
Data download issue Stata18 USA	1	299	September 29, 2023
Working with 100% samples USA	1	305	October 22, 2020
Data Not Processing USA	1	476	April 13, 2018
Extract processing USA	2	317	September 24, 2019
Cannot extract file USA	1	294	May 31, 2020

Data-file size issues?

Related topics