-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The issue
Deposits with 5,000+ files are taking a long time to publish. (I updated this from 15,000 to 5,000 based on testing on stage.) Even if the files are very small, it can take multiple hours to shelve 5,000 files.
We recently changed a timeout from 15 minutes to 10 hours in an attempt to get an item with 23,000 file deposited, and even then it took over 20 hours (two timeouts and three retries) to shelve the last 10,000 files in for that item. The item happened to be in progress when there was a systems outage, so it is not clear how long it had spend in shelving before the outage.
Production items that have had this problem recently:
- zg292rq7608 (recently resolved by increasing timeout)
- tm782sf2963 (finished after timing out twice with a 10 hour timeout after the item had already been running for 15+ hours previously)
There are multiple HB errors associated with these many-file deposits:
Observing the progress of these deposits, I think this is what's happening:
- while shelving: the publish timeout results in the job retrying before it can complete, which has led us to increase the timeout to 10 hours in order to avoid rapid retries
- files very slowly get copied to Stacks (the process starts out fast, but slows to <100 every 5 minutes)
- if there are retries, many files are created in
/access-transfer
- many more than are contained in the original deposit. It's likely the retries are rewriting the same files repeatedly. - when the files finally are all copied to Stacks, the job attempts to publish the purl metadata
- the purl metadata publish takes >15 minutes, which led us to increase a timeout there to 30 minutes (possibly more now?)
The result is that transfers that should taking much longer than they should, with the most extreme cases taking multiple days and requiring careful monitoring and attention.
See my comment below for more data on how long the copying steps are taking.
To reproduce
Accession an item on stage that has 5,000+ files and watch the shelving progress. You can see that copying files into Stacks takes multiple hours. For example, publish took 2 hours on this item that has 5000 files but only 23 KB of data. An item with 16,000 files took almost an entire day.
Additional background
We don't get many deposits over 5,000 files - and we ask people not to exceed 25,000 files in the H2 deposit form - but we have been supporting deposits in that range since we launched the H2-Globus deposit integration. We did not see this constellation of issues before we re-did the publish/shelve approach during the recent versioning work.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status