Advice on faster extraction of many files in zip #568

andrew-wroe · 2021-05-03T17:36:07Z

andrew-wroe
May 3, 2021

Hi - I'm looking for some advice on how minizip-ng can be used to efficiently read a zip with many files. I'm currently using the old minizip API (functions in mz_compat.h), although I'm happy to change if I can find the right calls.

I believe my problem relates to unzGoToFilePos reading from the central directory before it allows the next zip file entry to be read. Unfortunately I'm finding this compromises I/O performance for a zip containing many files. I suspect repeated reads from the end of the file then back towards the start mean that it doesn't efficiently buffer data.

My comparison is the InfoZip unzip utility - I'm running it on my test input comprising around 40k files with each one being around 3000 bytes compressed data. Even from a network share this can be decompressed and piped to a single output file in less than <2s. The command is something like this: unzip -p \network\share\file.zip > test.dat

However when I try to emulate this process with calls to minizip-ng the best I can do is >10s (it doesn't sound like a big difference but it's enough to impair our system). To improve this I would like to read offsets from the central directory, and then seek each entry in turn before reading it.

Unfortunately I've hit an obstacle with the first part - I'm using unzGetCurrentFileInfo but it appears to be missing this line (at least it's what I'm expecting it to do):

pfile_info->disk_offset = file_info->disk_offset;

Even if that worked I can't see any way to use the offset in the way I hoped - am I overlooking the right function or is this simply not exposed? In case it's useful (and I'm perhaps doing something really dumb) here's is my test code:

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <io.h>
#include <memory>
#include <vector>

#include "mz_compat.h"

#define READ_SIZE 1<<16

int main(int argc, char** argv)
{
	int error = -1;
	if (argc < 2)
		return error;

	std::unique_ptr<void, decltype(&unzClose)> zipfile_ptr(unzOpen(argv[1]), unzClose);
	if (!zipfile_ptr)
		return error;

	unz_global_info global_info;
	unzFile zipfile = zipfile_ptr.get();

	error = unzGetGlobalInfo(zipfile, &global_info);
	if (error != UNZ_OK)
		return error;

	std::unique_ptr<char[]> read_buffer(new char[READ_SIZE]);
	setmode(fileno(stdout), O_BINARY);

	std::vector<unz_file_pos> file_positions;
	file_positions.reserve(global_info.number_entry);

	error = unzGoToFirstFile(zipfile);
	if (error != UNZ_OK)
		return error;

	for (uint32_t i = 0; i < global_info.number_entry; ++i)
	{
		unz_file_pos file_pos;
		error = unzGetFilePos(zipfile, &file_pos);
		if (error != UNZ_OK)
			return error;

		file_positions.emplace_back(file_pos);

		error = unzGoToNextFile(zipfile);
		if (error == UNZ_END_OF_LIST_OF_FILE)
			break;
		if (error != UNZ_OK)
			return error;
	}

	for (auto& file_pos : file_positions)
	{
		error = unzGoToFilePos(zipfile, &file_pos);
		if (error != UNZ_OK)
			return error;

		error = unzOpenCurrentFile(zipfile);
		if (error != UNZ_OK)
			return error;

		error = !0;
		while (error > 0)
		{
			error = unzReadCurrentFile(zipfile, read_buffer.get(), READ_SIZE);
			if (error < 0)
				return error;

			if (error > 0)
				fwrite(read_buffer.get(), error, 1, stdout);
		};

		error = unzCloseCurrentFile(zipfile);
		if (error != UNZ_OK)
			return error;
	}

	return error;
}

nmoinvaz · 2021-05-03T17:47:15Z

nmoinvaz
May 3, 2021
Maintainer

It may be possible to do what you are suggesting with the new API. You can use mz_zip_set_cd_stream to set the stream of the central directory after opening with mz_zip_open. You can open or create the central directory stream using mz_stream_os_* or mz_stream_mem_*. Then you can have mz_zip_set_cd_stream set the stream and start pos of the central directory. This would prevent you from having to seek all over the place. Two streams allows each one to buffer and seek independently. cd_stream was separated from mainstream for use with split-disk archives.

2 replies

andrew-wroe May 3, 2021
Author

@nmoinvaz - thanks for the almost instant feedback. Let me experiment with that setup and I'll get back with results. It certainly sounds like separate streams and thus buffers will alleviate the issue.

nmoinvaz May 3, 2021
Maintainer

FYI when I did testing many years ago against zip archives with many small files we noticed that reallocation of buffers hurt performance. Even the reallocation of the deflate state struct.

This is why some web servers that use zlib will pass it their own memory allocations.

andrew-wroe · 2021-05-22T20:22:09Z

andrew-wroe
May 22, 2021
Author

I'm partly answering my own question here - apologies if there's a more appropriate way of doing this. If I had time I'd perhaps submit a pull request, but I'm hoping this might be cleaned up and integrated by someone else.

Anyway, I've switched to MiniZip-NG native API and managed to get acceptable performance with it. However, I've had to make some extensions on the interface to achieve this.

The first problem was that I couldn't find any way to retrieve the central directory position to open another (memory) stream at the correct location. So I'd like to suggest adding this function:

int64_t mz_zip_get_cd_start_pos(void* handle)
{
    mz_zip* zip = (mz_zip*)handle;
    if (zip == NULL)
        return MZ_PARAM_ERROR;
    return zip->cd_start_pos;
}

The next requirement is to go to a file without scanning the central directory again. Note that we've pre-cached the file information so we can pass it back in here:

int32_t mz_zip_goto_file_entry(void* handle, int64_t cd_pos, const mz_zip_file* file_info)
{
    mz_zip* zip = (mz_zip*)handle;
    if (zip == NULL)
        return MZ_PARAM_ERROR;
    if (file_info == NULL)
        return MZ_PARAM_ERROR;
    if (cd_pos < zip->cd_start_pos  || cd_pos > zip->cd_start_pos + zip->cd_size)
        return MZ_PARAM_ERROR;
    zip->cd_current_pos = cd_pos;
    zip->file_info = *file_info;
    zip->entry_scanned = 1;
    return MZ_OK;
}

Finally since the file information is available, we'd like to open a file without scanning the local header to minimise data access:

int32_t mz_zip_file_entry_read_open(void* handle, uint8_t raw, const char* password)
{
    mz_zip* zip = (mz_zip*)handle;
    if (zip == NULL)
        return MZ_PARAM_ERROR;
    if ((zip->open_mode & MZ_OPEN_MODE_READ) == 0)
        return MZ_PARAM_ERROR;
    if (zip->entry_scanned == 0)
        return MZ_PARAM_ERROR;

    int32_t err = mz_stream_seek(zip->stream, zip->file_info.disk_offset + zip->disk_offset_shift, MZ_SEEK_SET);
    if (err != MZ_OK)
        return err;

    // TODO: Add code to read just filename and extra field size and fill in zip->local_file_info.
    ...

    err = mz_stream_seek(zip->stream, zip->local_file_info.filename_size + zip->local_file_info.extrafield_size, MZ_SEEK_CUR);
    if (err != MZ_OK)
        return err;

    return mz_zip_entry_open_int(handle, raw, 0, password);
}

The above is clearly incomplete - we have some hacky solution that I'd rather not submit and I'm hoping someone will provide a neater solution (ideally to go into future release).

7 replies

nmoinvaz May 24, 2021
Maintainer

Finally since the file information is available, we'd like to open a file without scanning the local header to minimise data access:

I believe it reads the local header because the size can be variable.

nmoinvaz May 24, 2021
Maintainer

Are you using the mz_stream_buffered? Because it should buffer a lot of the small reads/writes. See how mz_zip_reader uses it.

andrew-wroe May 25, 2021
Author

Are you using the mz_stream_buffered? Because it should buffer a lot of the small reads/writes. See how mz_zip_reader uses it.

We're using mz_stream_open and my assumption was that it use a buffer. We do get a performance improvement when reading the header in one chunk though. I'm still investigating this because I agree that reading small values should not be an issue.

nmoinvaz May 25, 2021
Maintainer

mz_stream_os_open just uses whatever the OS file functions are. mz_stream_buffered will do larger reads and cache them. A long time go I did performance testing and noticed in procmon that without mz_stream_buffered it was requesting a lot of I/O operations from the OS when it didn't need to causing slow down in zip with tons of small files.

andrew-wroe May 25, 2021
Author

Have you thought about always opening two streams in mz_zip_open? One for seeking the central directory and one for seeking files?

We've already switched to using a memory stream for the central directory, and the bottleneck we saw here has disappeared. However, we still have performance issues when minizip-ng reads the local file header (hence the attempt to skip over this).

I'm trying mz_stream_buffer instead to see if that helps. I was indeed expecting the OS buffering to be enough, but I guess something odd is happening to prevent it. We'll only need our first two functions if this works (the third one can be disregarded).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on faster extraction of many files in zip #568

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advice on faster extraction of many files in zip #568

andrew-wroe May 3, 2021

Replies: 2 comments · 9 replies

nmoinvaz May 3, 2021 Maintainer

andrew-wroe May 3, 2021 Author

nmoinvaz May 3, 2021 Maintainer

andrew-wroe May 22, 2021 Author

nmoinvaz May 24, 2021 Maintainer

nmoinvaz May 24, 2021 Maintainer

andrew-wroe May 25, 2021 Author

nmoinvaz May 25, 2021 Maintainer

andrew-wroe May 25, 2021 Author

andrew-wroe
May 3, 2021

Replies: 2 comments 9 replies

nmoinvaz
May 3, 2021
Maintainer

andrew-wroe May 3, 2021
Author

nmoinvaz May 3, 2021
Maintainer

andrew-wroe
May 22, 2021
Author

nmoinvaz May 24, 2021
Maintainer

nmoinvaz May 24, 2021
Maintainer

andrew-wroe May 25, 2021
Author

nmoinvaz May 25, 2021
Maintainer

andrew-wroe May 25, 2021
Author