-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large number of small file size #80
Comments
This is one of the downsides to the way rechunker and Zarr currently work. If you can think of a workaround, I'd love to know about it. |
What would be great (as far as I'm concerned) is if rechunker was able to detect such situation and directly propose a multipass approach. |
How would you detect this situation? What criterion would you use? |
the intermediate chunk size can probably be related to an equivalent file size which needs to be above the minimum file size threshold |
I actually did that manually by adjusting the target chunk size iteratively |
But don't you care about the total number of intermediate files? AFAIK, that is the thing that stresses the filesystem, not the chunk size per se. |
I had the feeling that as long as the files were large enough the filesystem was fine, no matter the number of files. |
I doubt they would have complained if you created one single 1kb file. So "this situation" is some function of file size and number of files. I'd like to understand that function better before we start brainstorming solutions. |
Ok, let me try to explain. We discussed this a bit in https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/13?u=geynard. At CNES, we have a GPFS file system. It can handle 8,5 PiB of data, but has also a limited amount of metadata (inodes) it can record. I'm not sure of the number here, but we've converge to fix a limit on the mean file size of any given space on our storage: 1MiB. This doesn't mean that you can't create smaller files, even lots of them, but just that for each project space, you'll get an upper limit on inodes and thus files and directories number (which can be high if you're allotted space is too). Be aware that some other computing centers are much more conservative about inodes limits. This GPFS file system has also a given bandwidth, up to 50GiB/s for us (shared between all nodes and users), but more importantly a limit on the number of IOps (Input Output operation per second, so file creation, random writes...) it can perform. This limit depends on the nature of the operation and a lot of other parameters, and is very hard to measure. But this is the one we reach the most (and our FS is really optimized for this) : a single user with a few hundred cores can slow down the entire FS with bad IOs. Two principal examples of bad IOs:
So I hope this clarifies things, and in the end you must be careful about two things:
To go deeper, if I have to translate that in pseudo code: if len(chunks) > some_high_limit
do bigger chunks
if len(chunks) > 100_000 and sizeof(chunks) < 10MiB
do bigger chunks
if sizeof(chunks) < 1MiB
do bigger chunks Of course, all these limits will change from HPCs to HPCs, but I'll bet no admin would enjoy lots of files of less than a MiB. |
Very interesting. Will digest for a bit. The tradeoffs remind me a lot of this paper: Predicting and Comparing the Performance of Array Management Libraries, which compares HDF5 and Zarr. |
Note also that object store behave quite differently as you very well know @rabernat. For those, there is no metadata table, no inodes, the only pain point is write latency. This means you'll only want to have chunks big enough to avoid stacking up write or read calls because of the tens of milliseconds latency of each one. |
To be honest, we really designed rechunker with object storage in mind, even though there are probably more users on HPC. Can you look at the algorithm and think of some strategy we could use to mitigate this inode problem? Zarr doesn't support consolidating multiple chunks into one file, although this has been discussed (can't find the github isssue now). An alternative would be to try using TileDB as an intermediate storage format. TileDB should support concurrent writes to the same chunk, allowing us to use bigger chunks. To do that, we would need a PR to rechunker to implement a different array backend. @apatlpo - it would be good to get the exact size of the input chunks, intermediate chunks, and target chunks for your use case. How many files / what size are we talking about? Is this LLC4320? |
To be clear, object storage also has fixed overhead per object. It's less of an issue than in distributed filesystems, but you will still run into performance issues if you store lots of files smaller than 1 MB. This sounds like a reason to revisit #36, since distributed execution engines like Spark and Dataflow are designed to handle lots spilling outs of small intermediate outputs to disk. Within Zarr, one way to solve this would be to make an intermediate "storage" object that maps multiple chunks to different parts of the same underlying files. |
In theory Beam can run on top of Spark, but probably not quite as smoothly as on Cloud Dataflow. (I haven't tried it) |
I took a quick look at it, the most simple thing at first I cant think of is raising a warning if the intermediate chunks are smaller in size than a given limit or there are too many of them. The advice would be to do an intermediate chunking, or to use consolidate reads and writes? In a second time, we could try to find another algorithm to determine the intermediate chunk size better than using the minimum of each axis. E.g. if shape is (4,1) in input and (1,4) in output, see if we could use (2,2) in intermediate if we have enough memory. I think this is what
Yeah totally! Less impact on the overall service (and so other users), but performance would be affected also!
Spark could use HPC node local storage to spill to disk intermediate chunks, thus not affecting shared FS. Maybe it could also consolidate intermediate chunks on shared FS into bigger files using internal format, not sure.
We can run Spark on CNES HPC system, it's a bit more complicated than Dask (no dask-jobqueue equivalent, so we do it by hand). Never tried Beam though. |
@rabernat Yep, original data is llc4320 with the shapes you know by heart I'm sure.
The above approach spills about 200 000 intermediate files that are between 1 and 3MB which is a bit lower than what I was told to use on CNES cluster. So I do only a partial transposition instead, which looks like:
i.e. about 50 000 files that are between 5 and 10 MB So not a blocking point at all in my case as you can guess. |
I'm also using Pangeo and < I better get back to attempting to rechunk this 2TB dataset > |
usage question I guess
I'm working on an HPC system with a GPFS filesystem and I've been told a couple of times by sys admins (ping @guillaumeeb for confirmation) that I should not produce (very) large number of small filesizes (typically <10Mb).
This potentially puts a lower bound on the intermediate file size.
With rechunker, the knobs that control the intermediate file size are the memory per worker and the target chunk size.
The former is limited by the size of your node (say 100GB), it leads to an upper bound for the intermediate fiile size
Does it mean that in some situations you may have no other choices than apply rechunker in multiple passes in order to complete a rechunk that is too ambitious given the constraints above ?
The text was updated successfully, but these errors were encountered: