Skip to content

glob performance regression #641

@mhfrantz

Description

@mhfrantz

When using GCSFileSystem.glob with a pattern like "bucket-name/prefix*suffix", version 2023.9.0 introduced a performance regression. Previously, this glob would be resolved with an efficient API call whose performance was proportional to the number of matching objects. Since 2023.9.0, the performance seems to scale with the number of objects in the bucket. In my system, the buckets have a "flat" pseudo-folder structure with 1e5+ objects.

Debug output from 2023.6.0:

DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Debug output from 2023.9.0 (and more recent versions like 2024.6.0):

DEBUG:asyncio:Using selector: EpollSelector
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
DEBUG:gcsfs.credentials:GCS refresh
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:gcsfs:GET: b/{}/o, ('bucket-name',), None
[repeated 100+ times]

Perhaps the prefix argument is no longer being specified to the GCS backend (e.g. in GCSFileSystem._list_objects). I've been studying the differences between 2023.6.0 and 2023.9.0 in both this repo and filesystem_spec, but I haven't seen evidence of this change being explicit or intentional. The unit testing of glob seems to be functional, so it wouldn't catch a performance regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions