Advice on faster extraction of many files in zip #568
Replies: 2 comments 9 replies
-
It may be possible to do what you are suggesting with the new API. You can use |
Beta Was this translation helpful? Give feedback.
-
I'm partly answering my own question here - apologies if there's a more appropriate way of doing this. If I had time I'd perhaps submit a pull request, but I'm hoping this might be cleaned up and integrated by someone else. Anyway, I've switched to MiniZip-NG native API and managed to get acceptable performance with it. However, I've had to make some extensions on the interface to achieve this. The first problem was that I couldn't find any way to retrieve the central directory position to open another (memory) stream at the correct location. So I'd like to suggest adding this function:
The next requirement is to go to a file without scanning the central directory again. Note that we've pre-cached the file information so we can pass it back in here:
Finally since the file information is available, we'd like to open a file without scanning the local header to minimise data access:
The above is clearly incomplete - we have some hacky solution that I'd rather not submit and I'm hoping someone will provide a neater solution (ideally to go into future release). |
Beta Was this translation helpful? Give feedback.
-
Hi - I'm looking for some advice on how minizip-ng can be used to efficiently read a zip with many files. I'm currently using the old minizip API (functions in mz_compat.h), although I'm happy to change if I can find the right calls.
I believe my problem relates to unzGoToFilePos reading from the central directory before it allows the next zip file entry to be read. Unfortunately I'm finding this compromises I/O performance for a zip containing many files. I suspect repeated reads from the end of the file then back towards the start mean that it doesn't efficiently buffer data.
My comparison is the InfoZip unzip utility - I'm running it on my test input comprising around 40k files with each one being around 3000 bytes compressed data. Even from a network share this can be decompressed and piped to a single output file in less than <2s. The command is something like this: unzip -p \network\share\file.zip > test.dat
However when I try to emulate this process with calls to minizip-ng the best I can do is >10s (it doesn't sound like a big difference but it's enough to impair our system). To improve this I would like to read offsets from the central directory, and then seek each entry in turn before reading it.
Unfortunately I've hit an obstacle with the first part - I'm using unzGetCurrentFileInfo but it appears to be missing this line (at least it's what I'm expecting it to do):
pfile_info->disk_offset = file_info->disk_offset;
Even if that worked I can't see any way to use the offset in the way I hoped - am I overlooking the right function or is this simply not exposed? In case it's useful (and I'm perhaps doing something really dumb) here's is my test code:
Beta Was this translation helpful? Give feedback.
All reactions