Skip to content

Added support for I/O from zipfiles and buffers #914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 26, 2025

Conversation

pfebrer
Copy link
Contributor

@pfebrer pfebrer commented Apr 27, 2025

Motivation

When dealing with very big datasets (e.g. ML) you often need to compress it into a zip or tar file to move it around. Needing to then decompress it for reading the files is a pain, because:

  1. Decompressing the full dataset might take a very long time
  2. File number limitations in HPC clusters. (also recently I've used a cluster that removes unused files every 30 days, which can basically destroy the parts of the dataset that you have not used).

Why zip and not tar?

Because tar doesn't keep an index of the contained files, so it can be very slow to access files within it when the tarfile gets big.

Therefore I have determined that putting effort into zip files is much more worth it.

Approach taken in this PR

Turns out that the zipfile library already had a zipfile.Path object that mimics pathlib.Path. Therefore, zipfile.Path can be used within the sile framework. It just needed some extra functionality to fully work, so I created an extension ZipPath, which:

  • Implements missing pathlib.Path functions that the sile framework was using, e.g. with_suffix.
  • Wraps the open method of the returned file handles so that on close we can close also the root zip file. This is specially important when writing.

What works at the moment:

  • Reading and writing text based files.
  • Reading cdf files. This is done through the memory argument of netCDF4.Dataset, which accepts the in memory bytes.
  • Reading binary files. This is done by writing on a temporary file. I need some clever way of knowing when the temporary file can be removed though.
  • Searching for other files in the directory. E.g. when reading from an fdf inside a zipfile, it can find other files (other fdfs, basis files, matrices) that are also inside the zipfile.

What doesn't work for now:

  • Writing CDF files. This can also be done through the memory argument.
  • Writing binary files. This can also be done through temporary files.

Comments

I will implement the missing parts and also add tests when I get feedback on whether this approach is fine or it is preferrable to create separate sile classes as with BufferSile.

The handling of CDF and binary files I think it should be generalized to handle generic python filehandles/objects (e.g. bytes/BytesIO) and not just contents of a zip file.

How to test

Here is a zip file that you can use for testing, which contains a SIESTA run of a water molecule:

run.zip

You can read the Hamiltonian like:

import sisl

sisl.get_sile("/path/to/run.zip/run/RUN.fdf").read_hamiltonian()

The sile framework will automatically detect that the path contains a zip file and do the necessary adjustments.

Hope you like it!

# is a .zip file in the middle of the path
try:
filename = ZipPath.from_path(filename, self._mode)
except FileNotFoundError:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.
Copy link

codecov bot commented Apr 27, 2025

Codecov Report

Attention: Patch coverage is 99.10979% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.90%. Comparing base (41dd390) to head (8e1b62a).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/sisl/io/sile.py 98.25% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #914      +/-   ##
==========================================
+ Coverage   86.82%   86.90%   +0.08%     
==========================================
  Files         409      412       +3     
  Lines       53894    54207     +313     
==========================================
+ Hits        46795    47111     +316     
+ Misses       7099     7096       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zerothi
Copy link
Owner

zerothi commented Apr 27, 2025 via email

@pfebrer
Copy link
Contributor Author

pfebrer commented Apr 27, 2025

The basis commits are already removed 👍

Copy link
Owner

@zerothi zerothi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll likely have some more comments, but it looks very clean, simple and effective. Major Q is whether write works, otherwise an error should be raised.

# We therefore store the contents in a temporary file and set this
# as the file to be read from.
if (
name.startswith("read_")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about write? Should that even be allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. It is a bit more scary because if we don't manage the write correctly it can end up in a corrupted zip file, that's why I didn't want to implement it fast, I need to put a bit of thought into it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, doing an NotImplementedError is sufficiently at the moment, it doesn't seem like it is something that's very useful ATM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but on the other hand if I do it now it will be faster than if I come back at it 2 years later 😅

@pfebrer
Copy link
Contributor Author

pfebrer commented Apr 28, 2025

Yes, I think there will be no problem in making write work 👍

@pfebrer
Copy link
Contributor Author

pfebrer commented Apr 28, 2025

By the way, what I've done for CDF and binary files can work really with any buffer, its not like it just works with zipfiles. So I think it would be worth it now that we are at it to make it support arbitrary buffers, which is something that has come up in the past.

Any ideas of how things could be structured to support the buffers?
Just extend the BufferSile to handle more buffer types, create a new class...?

@zerothi
Copy link
Owner

zerothi commented Apr 28, 2025

By the way, what I've done for CDF and binary files can work really with any buffer, its not like it just works with zipfiles. So I think it would be worth it now that we are at it to make it support arbitrary buffers, which is something that has come up in the past.

Any ideas of how things could be structured to support the buffers? Just extend the BufferSile to handle more buffer types, create a new class...?

If the BufferSile could be extended as a baseclass for binary + text files, just like the sile, then I think that would be the most optimal way of doing things.

So:

classDiagram
    BaseSile <|-- Sile
    BaseSile <|-- SileCDF 
    BaseBufferSile
    BaseBufferSile <|-- BufferSile
    BaseBufferSile <|-- BufferCDF 
Loading

@pfebrer
Copy link
Contributor Author

pfebrer commented Apr 28, 2025

Ok, will try to do that 👍

By the way, where is the documentation for generating this diagrams? hahah

@zerothi
Copy link
Owner

zerothi commented Apr 28, 2025

Ok, will try to do that 👍
Nice!

By the way, where is the documentation for generating this diagrams? hahah

They are mermaid charts which are inherently allowed in Github text! ;)

# Write a temporary zipfile
tempfile_path = tempfile.mktemp(suffix=".zip")
with zipfile.ZipFile(tempfile_path, "w") as f:
...

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.
@pfebrer pfebrer changed the title WIP: Added support for I/O from zipfiles Added support for I/O from zipfiles and buffers May 22, 2025
Up to now, the support for buffers was limited to text buffers. Now,
byte buffers are also supported, so one can read/write binary files
and CDF files from/to buffers.

This commit also introduces full support for zipfiles.
@zerothi
Copy link
Owner

zerothi commented May 23, 2025

I think this is great!

Only one comment.

I am not sure I like the get_sile("hello/test.zip/...") should automatically create the test.zip. For reading it is fine, because you can detect whether it is a zip file, but for creation, I am a bit worried that it might be ambiguous, should it be a directory, or a zip file?
I think for creating files we shouldn't allow that nomenclature... Could one extend the ZipPath to have a base Zip-path so you could do something like:

from sisl.io import ZipPath

zipfile = ZipPath("/hello/test.zip") / "dir_in_zip_file"
si.get_sile(zipfile, mode="w")

With the above, there should be no ambiguity.
I

@pfebrer
Copy link
Contributor Author

pfebrer commented May 23, 2025

Hmm ok I understand but, for writing, the zip file will often also exist already, you just want to add files to It. So what we could do maybe is that sisl will never create a new zip file.

If you want to open the zip file externally I think the best thing to do is to use the zipfile classes. I think ZipPath should remain internal to sisl. This is what I show in the last example of the changes rst.

@zerothi
Copy link
Owner

zerothi commented May 23, 2025

Hmm ok I understand but, for writing, the zip file will often also exist already, you just want to add files to It. So what we could do maybe is that sisl will never create a new zip file.
If it exists, there will be no ambiguity, unless you specify mode='w', in which case you want to overwrite it...

If you want to open the zip file externally I think the best thing to do is to use the zipfile classes. I think ZipPath should remain internal to sisl. This is what I show in the last example of the changes rst.
Ok!

@pfebrer
Copy link
Contributor Author

pfebrer commented May 23, 2025

Yes, when the zipfile is opened inside sisl it is always with mode "r" or "a", never "w".

@zerothi
Copy link
Owner

zerothi commented May 23, 2025

Yes, when the zipfile is opened inside sisl it is always with mode "r" or "a", never "w".

what happens when the user requests mode='w'?

@pfebrer
Copy link
Contributor Author

pfebrer commented May 23, 2025

The mode="w" that the user requests is for the file, not for the container zip file.

It is the same as if we wanted to write to a folder, we want to add things to the folder, never overwrite the full folder to write the file.

@zerothi
Copy link
Owner

zerothi commented May 23, 2025

The mode="w" that the user requests is for the file, not for the container zip file.

It is the same as if we wanted to write to a folder, we want to add things to the folder, never overwrite the full folder to write the file.

Hmm... Yeah, I see... :)

@pfebrer
Copy link
Contributor Author

pfebrer commented May 25, 2025

Ok, so this is now super ready, I'm very happy with the final result, everything works:

Read/write from/to buffers and zipfiles for ALL siles. 🥳

Plus all write/read operations for buffers and zipfiles are tested (both text, binary and CDF). So things are quite robust and I'm confident it should work fine.

I just have to figure out why the tests with temporary files fail in the CI (I think only with python 3.9). Could you try the tests in your computer and see if everything is fine?

I also refactored the I/O documentation to include information on how to do I/O with zipfiles and buffers, and added a link to the I/O documentation in the introduction section, because it seemed to me that it was not visible enough and it is an important part of the package 😅

@pfebrer
Copy link
Contributor Author

pfebrer commented May 25, 2025

Ok, I can reproduce the problems in python 3.9 😓

@pfebrer
Copy link
Contributor Author

pfebrer commented May 25, 2025

So it is quite a mess to support zipfiles both in python 3.9 and python>3.9 because there was one key change in between: python/cpython#84744 (comment).

This means that we have to have many extra checks and code for a python version that we are going to drop in October :( The easiest fix also involves requiring an extra dependency zipp.

So I suggest it is not worth it to support zipfiles in python 3.9, we can just raise a meaningful error to update to python >= 3.10 if the user wants to use zipfiles. Is this ok?

@zerothi
Copy link
Owner

zerothi commented May 26, 2025

So it is quite a mess to support zipfiles both in python 3.9 and python>3.9 because there was one key change in between: python/cpython#84744 (comment).

This means that we have to have many extra checks and code for a python version that we are going to drop in October :( The easiest fix also involves requiring an extra dependency zipp.

So I suggest it is not worth it to support zipfiles in python 3.9, we can just raise a meaningful error to update to python >= 3.10 if the user wants to use zipfiles. Is this ok?

Yes, that is fine. :)
Thanks, sorry, this week is busy up to the workshop, so I'll likely first have time to get things in after the workshop...

@zerothi zerothi merged commit 699f64e into zerothi:main May 26, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants