-
-
Notifications
You must be signed in to change notification settings - Fork 62
Added support for I/O from zipfiles and buffers #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #914 +/- ##
==========================================
+ Coverage 86.82% 86.90% +0.08%
==========================================
Files 409 412 +3
Lines 53894 54207 +313
==========================================
+ Hits 46795 47111 +316
+ Misses 7099 7096 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The basis commits are already removed 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll likely have some more comments, but it looks very clean, simple and effective. Major Q is whether write
works, otherwise an error should be raised.
src/sisl/io/sile.py
Outdated
# We therefore store the contents in a temporary file and set this | ||
# as the file to be read from. | ||
if ( | ||
name.startswith("read_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about write? Should that even be allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. It is a bit more scary because if we don't manage the write correctly it can end up in a corrupted zip file, that's why I didn't want to implement it fast, I need to put a bit of thought into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, doing an NotImplementedError
is sufficiently at the moment, it doesn't seem like it is something that's very useful ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but on the other hand if I do it now it will be faster than if I come back at it 2 years later 😅
Yes, I think there will be no problem in making write work 👍 |
By the way, what I've done for CDF and binary files can work really with any buffer, its not like it just works with zipfiles. So I think it would be worth it now that we are at it to make it support arbitrary buffers, which is something that has come up in the past. Any ideas of how things could be structured to support the buffers? |
If the So: classDiagram
BaseSile <|-- Sile
BaseSile <|-- SileCDF
BaseBufferSile
BaseBufferSile <|-- BufferSile
BaseBufferSile <|-- BufferCDF
|
Ok, will try to do that 👍 By the way, where is the documentation for generating this diagrams? hahah |
They are mermaid charts which are inherently allowed in Github text! ;) |
Up to now, the support for buffers was limited to text buffers. Now, byte buffers are also supported, so one can read/write binary files and CDF files from/to buffers. This commit also introduces full support for zipfiles.
I think this is great! Only one comment. I am not sure I like the from sisl.io import ZipPath
zipfile = ZipPath("/hello/test.zip") / "dir_in_zip_file"
si.get_sile(zipfile, mode="w") With the above, there should be no ambiguity. |
Hmm ok I understand but, for writing, the zip file will often also exist already, you just want to add files to It. So what we could do maybe is that If you want to open the zip file externally I think the best thing to do is to use the |
|
Yes, when the zipfile is opened inside sisl it is always with mode "r" or "a", never "w". |
what happens when the user requests |
The It is the same as if we wanted to write to a folder, we want to add things to the folder, never overwrite the full folder to write the file. |
Hmm... Yeah, I see... :) |
Ok, so this is now super ready, I'm very happy with the final result, everything works: Read/write from/to buffers and zipfiles for ALL siles. 🥳 Plus all write/read operations for buffers and zipfiles are tested (both text, binary and CDF). So things are quite robust and I'm confident it should work fine. I just have to figure out why the tests with temporary files fail in the CI (I think only with python 3.9). Could you try the tests in your computer and see if everything is fine? I also refactored the I/O documentation to include information on how to do I/O with zipfiles and buffers, and added a link to the I/O documentation in the introduction section, because it seemed to me that it was not visible enough and it is an important part of the package 😅 |
Ok, I can reproduce the problems in python 3.9 😓 |
So it is quite a mess to support zipfiles both in python 3.9 and python>3.9 because there was one key change in between: python/cpython#84744 (comment). This means that we have to have many extra checks and code for a python version that we are going to drop in October :( The easiest fix also involves requiring an extra dependency So I suggest it is not worth it to support zipfiles in python 3.9, we can just raise a meaningful error to update to python >= 3.10 if the user wants to use zipfiles. Is this ok? |
Yes, that is fine. :) |
Motivation
When dealing with very big datasets (e.g. ML) you often need to compress it into a zip or tar file to move it around. Needing to then decompress it for reading the files is a pain, because:
Why zip and not tar?
Because tar doesn't keep an index of the contained files, so it can be very slow to access files within it when the tarfile gets big.
Therefore I have determined that putting effort into zip files is much more worth it.
Approach taken in this PR
Turns out that the
zipfile
library already had azipfile.Path
object that mimicspathlib.Path
. Therefore,zipfile.Path
can be used within the sile framework. It just needed some extra functionality to fully work, so I created an extensionZipPath
, which:pathlib.Path
functions that the sile framework was using, e.g.with_suffix
.open
method of the returned file handles so that on close we can close also the root zip file. This is specially important when writing.What works at the moment:
memory
argument ofnetCDF4.Dataset
, which accepts the in memory bytes.What doesn't work for now:
memory
argument.Comments
I will implement the missing parts and also add tests when I get feedback on whether this approach is fine or it is preferrable to create separate sile classes as with
BufferSile
.The handling of CDF and binary files I think it should be generalized to handle generic python filehandles/objects (e.g.
bytes
/BytesIO
) and not just contents of a zip file.How to test
Here is a zip file that you can use for testing, which contains a SIESTA run of a water molecule:
run.zip
You can read the Hamiltonian like:
The sile framework will automatically detect that the path contains a zip file and do the necessary adjustments.
Hope you like it!