-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fsspec in place of open #168
Conversation
Thanks for working on this @hardingnj! I'll take a look and see why tests aren't passing. The use of BGZF is meant to allow block-based access to gzip compressed files (https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html). The Bio.bgzf module is in fact using the python gzip module, but with extra logic to allow construction of virtual offsets into these compressed blocks. |
deprecate bgzf lib, handled via compression = 'infer' added explicit __array__ method in FastaRecord
Hi @mdshw5 , Have added a bit more work. I still have test failures, but these I think are expected failures that might be handled ok in Also added an |
Thanks. The test that is failing is due to python's universal newline support. |
@@ -514,8 +488,8 @@ def read_fai(self): | |||
|
|||
def build_index(self): | |||
try: | |||
with self._fasta_opener(self.filename, 'rb') as fastafile: | |||
with open(self.indexname, 'w') as indexfile: | |||
with self._fasta_opener(self.filename, 'r', compression=self._compression) as fastafile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hardingnj The mode here should be 'rb' if possible so that later code can see the line terminator characters.
It seems like the |
Ahah- I thought it was something to do with that... though I didn't understand why you expect different line lengths depending on the presence of windows line endings. |
I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something? |
@@ -1021,6 +999,17 @@ def __getitem__(self, rname): | |||
if isinstance(rname, integer_types): | |||
rname = next(islice(self.records.keys(), rname, None)) | |||
try: | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cruft- crept in when trying to debug!
It's necessary to be able to seek past all of the 64k blocks that don't contain the sequence you need, and also seek to the correct position within the block of interest. @peterjc did a fantastic job documenting his module, and the section on file handle offsets is more than worth your time to read. Thanks for your contributions here - I do think the fsspec module makes sense but I do want to make sure to incorporate BGZF support properly. I've also looked more closely at the test results and my BGZF tests are indeed failing without the |
Thanks @mdshw5 . Of course- very happy to make the effort to integrate properly. I'll need to read a bit further into |
I came across this PR yesterday when exploring if some of our API's could be changed to support def __init__(self, filename, ...):
if isinstance(filename, str):
self.file = fsspec.open(filename)
else:
self.file = filename # fsspec.OpenFile The advantage of this choice is that users can configure any file object they'd like (auth, caching, etc). It's important to note that with self.file as f:
f # now low-level file object is created This means it's OK to pass around these objects rather than an opener function. In addition, the with self.file as f:
if self._bgzf:
f = bgzf.BgzfReader(fileobj=f)
... I tried to implement something on my own yesterday, but ran into some issues with how to handle the indexfile. |
@manzt Thanks for the detailed explanation and the example code. I have to admit that I did not understand fsspec but am now a bit more clear about why this is useful. If you want to contribute any code, even if it doesn't fully work I would be glad to collaborate. |
Speaking for the |
On first approximation, I don't think there will be any required changes to |
Hi, as discussed in #165 here is a PR leveraging fsspec.
The code changes are straightforward. Though I am having trouble getting tests to pass. Any pointers would be appreciated.
The other thing that might be worth considering is the use of
Bio.bgzf
. I think that reading ofgz
files can be done directly, asbgz
should be fully compatible. It might be a case of usingfsspec.open()
with the compression set as "auto". Although there might be a use ofbgz
I am not aware of?