Implement fsspec in place of open #168

hardingnj · 2020-08-06T09:26:58Z

Hi, as discussed in #165 here is a PR leveraging fsspec.

The code changes are straightforward. Though I am having trouble getting tests to pass. Any pointers would be appreciated.

The other thing that might be worth considering is the use of Bio.bgzf. I think that reading of gz files can be done directly, as bgz should be fully compatible. It might be a case of using fsspec.open() with the compression set as "auto". Although there might be a use of bgz I am not aware of?

mdshw5 · 2020-08-06T14:12:41Z

Thanks for working on this @hardingnj! I'll take a look and see why tests aren't passing. The use of BGZF is meant to allow block-based access to gzip compressed files (https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html). The Bio.bgzf module is in fact using the python gzip module, but with extra logic to allow construction of virtual offsets into these compressed blocks.

deprecate bgzf lib, handled via compression = 'infer' added explicit __array__ method in FastaRecord

…implement_fsspec

hardingnj · 2020-08-06T16:16:46Z

Hi @mdshw5 ,

Have added a bit more work. I still have test failures, but these I think are expected failures that might be handled ok in fsspec.

Also added an __array__ method to FastaRecord, which I think is the preferred way of creating numpy arrays.

mdshw5 · 2020-08-06T16:31:58Z

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

mdshw5 · 2020-08-06T16:33:58Z

pyfaidx/__init__.py

@@ -514,8 +488,8 @@ def read_fai(self):

    def build_index(self):
        try:
-            with self._fasta_opener(self.filename, 'rb') as fastafile:
-                with open(self.indexname, 'w') as indexfile:
+            with self._fasta_opener(self.filename, 'r', compression=self._compression) as fastafile:


@hardingnj The mode here should be 'rb' if possible so that later code can see the line terminator characters.

mdshw5 · 2020-08-06T17:09:04Z

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

hardingnj · 2020-08-06T17:57:26Z

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

Ahah- I thought it was something to do with that... though I didn't understand why you expect different line lengths depending on the presence of windows line endings.

hardingnj · 2020-08-06T17:58:19Z

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

hardingnj · 2020-08-06T17:59:53Z

pyfaidx/__init__.py

@@ -1021,6 +999,17 @@ def __getitem__(self, rname):
        if isinstance(rname, integer_types):
            rname = next(islice(self.records.keys(), rname, None))
        try:
+            #


cruft- crept in when trying to debug!

mdshw5 · 2020-08-06T23:51:09Z

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

It's necessary to be able to seek past all of the 64k blocks that don't contain the sequence you need, and also seek to the correct position within the block of interest. @peterjc did a fantastic job documenting his module, and the section on file handle offsets is more than worth your time to read. Thanks for your contributions here - I do think the fsspec module makes sense but I do want to make sure to incorporate BGZF support properly.

I've also looked more closely at the test results and my BGZF tests are indeed failing without the Bio.bgzf module. I've got some work in #164 that will make the BGZF support more robust and improve performance as well.

hardingnj · 2020-08-12T13:47:37Z

Thanks @mdshw5 . Of course- very happy to make the effort to integrate properly. I'll need to read a bit further into bgzf first though

manzt · 2022-02-08T16:43:47Z

I came across this PR yesterday when exploring if some of our API's could be changed to support fsspec file-like objects. It would be desirable to allow a users to pass their own fsspec.OpenFile if possible and only create a new one in __init__ if a string is provided. E.g.

    def __init__(self, filename, ...):
       if isinstance(filename, str):
         self.file = fsspec.open(filename)
       else:
         self.file = filename # fsspec.OpenFile

The advantage of this choice is that users can configure any file object they'd like (auth, caching, etc). It's important to note that fsspec.OpenFile objects don't actually hold an low-level file object but are containers. Low-level file objects are only created when the with context manger is used:

with self.file as f:
  f # now low-level file object is created

This means it's OK to pass around these objects rather than an opener function. In addition, the Bio.bgzf.BgzfReader can be configured explicitly to wrap a file object via bgzf.BgzfReader(fileobj=fileobj) so useage could look something like:

with self.file as f:
  if self._bgzf:
      f = bgzf.BgzfReader(fileobj=f)
  ...

I tried to implement something on my own yesterday, but ran into some issues with how to handle the indexfile.

mdshw5 · 2022-02-08T19:01:07Z

@manzt Thanks for the detailed explanation and the example code. I have to admit that I did not understand fsspec but am now a bit more clear about why this is useful. If you want to contribute any code, even if it doesn't fully work I would be glad to collaborate.

hardingnj · 2022-02-08T20:44:22Z

Thanks @manzt and @mdshw5 , I really think this is worthwhile- happy to help develop the PR to address the above issues.

peterjc · 2022-02-08T23:04:49Z

Speaking for the Bio/bgzf.py code in Biopython, I'm open to (hopefully backward compatible) changes, but as yet am unfamiliar with fsspec. It may be only changes here in pyfaidx are needed.

manzt · 2022-02-08T23:21:09Z

It may be only changes here in pyfaidx are needed.

On first approximation, I don't think there will be any required changes to Bio/bgzf.py since BgzfReader can take a file-like object instead of a path.

Nick Harding added 4 commits July 16, 2020 14:25

Add fsspec for open

477db4d

Remove print statement

8a7de5b

Add conda compat

e0e9a38

Add fsspec to ci

393804b

mdshw5 and others added 3 commits August 6, 2020 10:14

Print resulting index to inspect

bc21ce3

Further work on fsspec

9679b97

deprecate bgzf lib, handled via compression = 'infer' added explicit __array__ method in FastaRecord

Merge branch 'implement_fsspec' of github.com:hardingnj/pyfaidx into …

6b3af27

…implement_fsspec

mdshw5 reviewed Aug 6, 2020

View reviewed changes

hardingnj commented Aug 6, 2020

View reviewed changes

manzt mentioned this pull request Feb 7, 2022

Ability to support file-like objects in clodius.tiles higlass/clodius#142

Open

ap-- mentioned this pull request May 31, 2022

Add fsspec support #190

Merged

mdshw5 closed this Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fsspec in place of open #168

Implement fsspec in place of open #168

hardingnj commented Aug 6, 2020

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 6, 2020

mdshw5 commented Aug 6, 2020

mdshw5 Aug 6, 2020

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 6, 2020

hardingnj commented Aug 6, 2020

hardingnj Aug 6, 2020

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 12, 2020

manzt commented Feb 8, 2022 •

edited

Loading

mdshw5 commented Feb 8, 2022

hardingnj commented Feb 8, 2022

peterjc commented Feb 8, 2022

manzt commented Feb 8, 2022 •

edited

Loading

Implement fsspec in place of open #168

Implement fsspec in place of open #168

Conversation

hardingnj commented Aug 6, 2020

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 6, 2020

mdshw5 commented Aug 6, 2020

mdshw5 Aug 6, 2020

Choose a reason for hiding this comment

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 6, 2020

hardingnj commented Aug 6, 2020

hardingnj Aug 6, 2020

Choose a reason for hiding this comment

mdshw5 commented Aug 6, 2020

hardingnj commented Aug 12, 2020

manzt commented Feb 8, 2022 • edited Loading

mdshw5 commented Feb 8, 2022

hardingnj commented Feb 8, 2022

peterjc commented Feb 8, 2022

manzt commented Feb 8, 2022 • edited Loading

manzt commented Feb 8, 2022 •

edited

Loading

manzt commented Feb 8, 2022 •

edited

Loading