Skip to content
Open
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
ecbec0e
adds logic for handling requests from new adapter is bucket type os s…
ankitaluthra1 Sep 27, 2025
d506a60
adds logic for hnadling requests from new adapter is bucket type is z…
ankitaluthra1 Oct 1, 2025
9a70933
refactor bucket type enum to include value for HNS bucket type | Fixe…
ankitaluthra1 Oct 3, 2025
eb209f5
adds feature toggle behind which experimental feature support would b…
ankitaluthra1 Oct 6, 2025
0b606b1
updates tests to include experimental feature toggle
ankitaluthra1 Oct 6, 2025
409c86e
minor fix
ankitaluthra1 Oct 6, 2025
8a4ca85
moves mrd creation inside core.py
ankitaluthra1 Oct 11, 2025
c5a9eae
moves mrd creation to separate GCSFile
ankitaluthra1 Oct 12, 2025
e95495f
minor comment
ankitaluthra1 Oct 12, 2025
95440d5
Extend gcsfs to create new filesystem
suni72 Oct 11, 2025
dfcbb7f
Extend gcsfs and gcsfile to override methods. download happens synchr…
suni72 Oct 11, 2025
f3d1031
Merge pull request from ankitaluthra1/lankita-poc-zonal
ankitaluthra1 Oct 13, 2025
20b36fc
fixes new test in test_core.py
ankitaluthra1 Oct 15, 2025
c9f569a
refactors/renames GCS Filesystem Adapter
ankitaluthra1 Oct 15, 2025
e5fc935
Override _cat_file to handle other read methods of gcsfs
suni72 Oct 15, 2025
50d30b3
creates grpc client inside GCSHNSFilesystem
ankitaluthra1 Oct 16, 2025
cbf00b3
refactors to reuse grpc client in cat_ranges
ankitaluthra1 Oct 16, 2025
a202579
Move classmethods in ZonalFile to a util file
suni72 Oct 17, 2025
5064a97
Implement logic to process limits as offset and length
suni72 Oct 19, 2025
ac276f6
Add fallback logic for non zonal buckets
suni72 Oct 19, 2025
dbecb12
Add unit tests for zb_hns_utils
suni72 Oct 20, 2025
37a495b
Add test for GCSFSAdapter with read_block
suni72 Oct 21, 2025
cba7d27
Update _get_storage_layout to use a single return statement
suni72 Oct 24, 2025
4ff9fe4
Updated gcs_adapter.open to pass on correct default values to GCSFile
suni72 Oct 27, 2025
389a4b0
Move logic for handing 0 length in MRD to zb_hns_utils
suni72 Oct 27, 2025
133b4fa
Add comments for clarity
suni72 Oct 27, 2025
31b2a2f
Merge pull request #4 from ankitaluthra1/zb-features
suni72 Oct 27, 2025
e36b7ed
Updated zonal file to only create mrd for read mode.
suni72 Oct 27, 2025
25cd0ef
Updated gcs_adapter fixture to not setup bucket when real gcs endpoin…
suni72 Oct 27, 2025
8425464
Updated gcs_adapter.open to pass on correct default values to GCSFile
suni72 Oct 27, 2025
75fecce
Move logic for handing 0 length in MRD to zb_hns_utils
suni72 Oct 27, 2025
2b6af9c
Updated zonal file to only create mrd for read mode.
suni72 Oct 27, 2025
b648df4
Updated gcs_adapter fixture to not setup bucket when real gcs endpoin…
suni72 Oct 27, 2025
f71f4e8
Merge branch 'zb-ft-2' into zb-features
suni72 Oct 27, 2025
1c99137
Update test_read_block_zb to use subtests to avoid frequent setup run
suni72 Oct 28, 2025
1099375
fix: Optimizes info() and exists() methods
Mahalaxmibejugam Oct 29, 2025
d834b07
fix: Optimizes info() and exists() methods
Mahalaxmibejugam Oct 29, 2025
bfd513f
fixes lint errors
ankitaluthra1 Oct 29, 2025
efabe35
fixes comments
ankitaluthra1 Oct 29, 2025
2ed3cc6
fixes lint errors
ankitaluthra1 Oct 29, 2025
957d7b5
adds grpc and google-iam dependency
ankitaluthra1 Oct 29, 2025
cd222cb
Fix missing argument in open
suni72 Oct 30, 2025
17618d5
Fix: Raise NotImplementedError for modes other than read in Zonal bucket
suni72 Oct 30, 2025
064c286
Add ClientInfo in AsyncGrpcClient
suni72 Oct 30, 2025
8b2a8d9
refactors storage layout to use sdk control client
ankitaluthra1 Oct 31, 2025
b1f0117
fixes lint errors
ankitaluthra1 Oct 31, 2025
b254a14
Merge branch 'internal-main' into zb-features
ankitaluthra1 Nov 2, 2025
7983528
Merge pull request #5 from ankitaluthra1/zb-features
ankitaluthra1 Nov 2, 2025
cdf574a
Merge pull request from ankitaluthra1/internal-main
ankitaluthra1 Nov 2, 2025
b411bef
Merge branch 'fsspec:main' into main
ankitaluthra1 Nov 2, 2025
c47b53f
fixes lint errors
ankitaluthra1 Nov 3, 2025
e37d0fb
fixes lint errors
ankitaluthra1 Nov 3, 2025
11af000
fixes conda install error
ankitaluthra1 Nov 3, 2025
c4cf777
mocks fake test credentials for grpc client
ankitaluthra1 Nov 3, 2025
5714012
fixes conflicting lint rules black and isort
ankitaluthra1 Nov 3, 2025
fccd43a
adds missing pytest package in conda install
ankitaluthra1 Nov 3, 2025
11dd722
refactor get bucket type
ankitaluthra1 Nov 4, 2025
a2b5077
Implement Zonal Read Stream Cleanup (#7)
suni72 Nov 4, 2025
0c3df8e
adds GCSFS_EXPERIMENTAL_ZB_HNS_SUPPORT as env variable instead of kwargs
ankitaluthra1 Nov 9, 2025
0e9c2ab
removes timeout from pytest fixture coming as mark has no effect on t…
ankitaluthra1 Nov 9, 2025
c3ad9b2
Refactor: Rename GcsFileSystemAdapter to ExtendedGcsFileSystem & Fix …
suni72 Nov 12, 2025
a8945c2
Replaces __new__ with conditional import in init
ankitaluthra1 Nov 12, 2025
774b53d
Merge branch 'main' into main
ankitaluthra1 Nov 14, 2025
90d0cf4
fixes lint errors
ankitaluthra1 Nov 14, 2025
8672c28
simplified logic in cleanup_gcs for unit tests
suni72 Nov 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .isort.cfg
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[settings]
profile = black
known_third_party = aiohttp,click,decorator,fsspec,fuse,google,google_auth_oauthlib,pytest,requests,setuptools
2 changes: 2 additions & 0 deletions environment_gcsfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,11 @@ dependencies:
- google-auth-oauthlib
- google-cloud-core
- google-cloud-storage
- grpcio
- pytest
- pytest-timeout
- pytest-asyncio
- pytest-subtests
- requests
- ujson
- pip:
Expand Down
16 changes: 16 additions & 0 deletions gcsfs/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,20 @@ class GCSFileSystem(asyn.AsyncFileSystem):
protocol = "gs", "gcs"
async_impl = True

def __new__(cls, *args, **kwargs):
"""
Factory to return a GCSFileSystemAdapter instance if the experimental
flag is enabled.
"""
experimental_support = kwargs.pop("experimental_zb_hns_support", False)

if experimental_support:
from .gcsfs_adapter import GCSFileSystemAdapter

return object.__new__(GCSFileSystemAdapter)
else:
return object.__new__(cls)

def __init__(
self,
project=DEFAULT_PROJECT,
Expand All @@ -301,6 +315,7 @@ def __init__(
endpoint_url=None,
default_location=None,
version_aware=False,
experimental_zb_hns_support=False,
**kwargs,
):
if cache_timeout is not None:
Expand All @@ -327,6 +342,7 @@ def __init__(
self.session_kwargs = session_kwargs or {}
self.default_location = default_location
self.version_aware = version_aware
self.experimental_zb_hns_support = experimental_zb_hns_support

if check_connection:
warnings.warn(
Expand Down
228 changes: 228 additions & 0 deletions gcsfs/gcsfs_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
import logging
from enum import Enum

from fsspec import asyn
from google.api_core import exceptions as api_exceptions
from google.api_core import gapic_v1
from google.api_core.client_info import ClientInfo
from google.cloud import storage_control_v2
from google.cloud.storage._experimental.asyncio.async_grpc_client import AsyncGrpcClient

from . import __version__ as version
from . import zb_hns_utils
from .core import GCSFile, GCSFileSystem
from .zonal_file import ZonalFile

logger = logging.getLogger("gcsfs")

USER_AGENT = "python-gcsfs"


class BucketType(Enum):
ZONAL_HIERARCHICAL = "ZONAL_HIERARCHICAL"
HIERARCHICAL = "HIERARCHICAL"
NON_HIERARCHICAL = "NON_HIERARCHICAL"
UNKNOWN = "UNKNOWN"


gcs_file_types = {
BucketType.ZONAL_HIERARCHICAL: ZonalFile,
BucketType.NON_HIERARCHICAL: GCSFile,
BucketType.HIERARCHICAL: GCSFile,
BucketType.UNKNOWN: GCSFile,
}


class GCSFileSystemAdapter(GCSFileSystem):
"""
This class will be used when experimental_zb_hns_support is set to true for all bucket types.
GCSFileSystemAdapter is a subclass of GCSFileSystem that adds specialized
logic to support Zonal and Hierarchical buckets.
"""

def __init__(self, *args, **kwargs):
kwargs.pop("experimental_zb_hns_support", None)
super().__init__(*args, **kwargs)
self.grpc_client = None
self.storage_control_client = None
# initializing grpc and storage control client for Hierarchical and
# zonal bucket operations
self.grpc_client = asyn.sync(self.loop, self._create_grpc_client)
self._storage_control_client = asyn.sync(
self.loop, self._create_control_plane_client
)
self._storage_layout_cache = {}

async def _create_grpc_client(self):
if self.grpc_client is None:
return AsyncGrpcClient(
client_info=ClientInfo(user_agent=f"{USER_AGENT}/{version}"),
).grpc_client
else:
return self.grpc_client

async def _create_control_plane_client(self):
# Initialize the storage control plane client for bucket
# metadata operations
client_info = gapic_v1.client_info.ClientInfo(
user_agent=f"{USER_AGENT}/{version}"
)
return storage_control_v2.StorageControlAsyncClient(
credentials=self.credentials.credentials, client_info=client_info
)

async def _get_bucket_type(self, bucket):
if bucket in self._storage_layout_cache:
return self._storage_layout_cache[bucket]
try:

# Bucket name details
bucket_name_value = f"projects/_/buckets/{bucket}/storageLayout"
# Make the request to get bucket type
response = await self._storage_control_client.get_storage_layout(
name=bucket_name_value
)

if response.location_type == "zone":
return BucketType.ZONAL_HIERARCHICAL
else:
# This should be updated to include HNS in the future
return BucketType.NON_HIERARCHICAL
except api_exceptions.NotFound:
print(f"Error: Bucket {bucket} not found or you lack permissions.")
return BucketType.UNKNOWN
except Exception as e:
logger.error(
f"Could not determine bucket type for bucket name {bucket}: {e}"
)
# Default to UNKNOWN
return BucketType.UNKNOWN

_sync_get_bucket_type = asyn.sync_wrapper(_get_bucket_type)

def _open(
self,
path,
mode="rb",
block_size=None,
cache_options=None,
acl=None,
consistency=None,
metadata=None,
autocommit=True,
fixed_key_metadata=None,
generation=None,
**kwargs,
):
"""
Open a file.
"""
bucket, _, _ = self.split_path(path)
bucket_type = self._sync_get_bucket_type(bucket)
self._storage_layout_cache[bucket] = bucket_type
return gcs_file_types[bucket_type](
self,
path,
mode,
block_size,
cache_options=cache_options,
consistency=consistency,
metadata=metadata,
acl=acl,
autocommit=autocommit,
fixed_key_metadata=fixed_key_metadata,
generation=generation,
**kwargs,
)

# Replacement method for _process_limits to support new params (offset and length) for MRD.
async def _process_limits_to_offset_and_length(self, path, start, end):
"""
Calculates the read offset and length from start and end parameters.

Args:
path (str): The path to the file.
start (int | None): The starting byte position.
end (int | None): The ending byte position.

Returns:
tuple: A tuple containing (offset, length).

Raises:
ValueError: If the calculated range is invalid.
"""
size = None

if start is None:
offset = 0
elif start < 0:
size = size or (await self._info(path))["size"]
offset = size + start
else:
offset = start

if end is None:
size = size or (await self._info(path))["size"]
effective_end = size
elif end < 0:
size = size or (await self._info(path))["size"]
effective_end = size + end
else:
effective_end = end

size = size or (await self._info(path))["size"]
if offset < 0:
raise ValueError(f"Calculated start offset ({offset}) cannot be negative.")
if effective_end < offset:
raise ValueError(
f"Calculated end position ({effective_end}) cannot be before start offset ({offset})."
)
elif effective_end == offset:
length = 0 # Handle zero-length slice
elif effective_end > size:
length = max(0, size - offset) # Clamp and ensure non-negative
else:
length = effective_end - offset # Normal case

return offset, length

sync_process_limits_to_offset_and_length = asyn.sync_wrapper(
_process_limits_to_offset_and_length
)

async def _is_zonal_bucket(self, bucket):
bucket_type = await self._get_bucket_type(bucket)
self._storage_layout_cache[bucket] = bucket_type
return bucket_type == BucketType.ZONAL_HIERARCHICAL

async def _cat_file(self, path, start=None, end=None, **kwargs):
"""
Fetch a file's contents as bytes.
"""
mrd = kwargs.pop("mrd", None)
mrd_created = False

# A new MRD is required when read is done directly by the
# GCSFilesystem class without creating a GCSFile object first.
if mrd is None:
bucket, object_name, generation = self.split_path(path)
# Fall back to default implementation if not a zonal bucket
if not await self._is_zonal_bucket(bucket):
return await super()._cat_file(path, start=start, end=end, **kwargs)

mrd = await zb_hns_utils.create_mrd(
self.grpc_client, bucket, object_name, generation
)
mrd_created = True

offset, length = await self._process_limits_to_offset_and_length(
path, start, end
)
try:
return await zb_hns_utils.download_range(
offset=offset, length=length, mrd=mrd
)
finally:
# Explicit cleanup if we created the MRD and it has a close method
if mrd_created:
await mrd.close()
49 changes: 46 additions & 3 deletions gcsfs/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
import shlex
import subprocess
import time
from contextlib import nullcontext
from unittest.mock import patch

import fsspec
import pytest
Expand Down Expand Up @@ -91,10 +93,9 @@ def docker_gcs():
def gcs_factory(docker_gcs):
params["endpoint_url"] = docker_gcs

def factory(default_location=None):
def factory(**kwargs):
GCSFileSystem.clear_instance_cache()
params["default_location"] = default_location
return fsspec.filesystem("gcs", **params)
return fsspec.filesystem("gcs", **params, **kwargs)

return factory

Expand Down Expand Up @@ -125,6 +126,48 @@ def gcs(gcs_factory, populate=True):
pass


@pytest.fixture
def gcs_adapter(gcs_factory, populate=True):
# Check if we are running against a real GCS endpoint
is_real_gcs = (
os.environ.get("STORAGE_EMULATOR_HOST") == "https://storage.googleapis.com"
)

patch_manager = (
patch("google.auth.default", return_value=(None, "fake-project"))
if not is_real_gcs
else nullcontext()
)

with patch_manager:
gcs_adapter = gcs_factory(experimental_zb_hns_support=True)
try:
# Only create/delete/populate the bucket if we are NOT using the real GCS endpoint
if not is_real_gcs:
try:
gcs_adapter.rm(TEST_BUCKET, recursive=True)
except FileNotFoundError:
pass
try:
gcs_adapter.mkdir(TEST_BUCKET)
except Exception:
pass
if populate:
gcs_adapter.pipe(
{TEST_BUCKET + "/" + k: v for k, v in allfiles.items()}
)
gcs_adapter.invalidate_cache()
yield gcs_adapter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to reach here because of a setup exception, we will exit the fixture and return None to the test function? It would be better to expose the error (which would cause an E in a test run).

finally:
try:
# Only remove the bucket/contents if we are NOT using the real GCS
if not is_real_gcs:
gcs_adapter.rm(gcs_adapter.find(TEST_BUCKET), recursive=True)
gcs_adapter.rm(TEST_BUCKET)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gcs_adapter.rm(gcs_adapter.find(TEST_BUCKET), recursive=True)
gcs_adapter.rm(TEST_BUCKET)
gcs_adapter.rm(TEST_BUCKET, recursive=True)

except Exception:
pass


@pytest.fixture
def gcs_versioned(gcs_factory):
gcs = gcs_factory()
Expand Down
22 changes: 22 additions & 0 deletions gcsfs/tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from datetime import datetime, timezone
from itertools import chain
from unittest import mock
from unittest.mock import patch
from urllib.parse import parse_qs, unquote, urlparse
from uuid import uuid4

Expand All @@ -18,6 +19,7 @@
from gcsfs import __version__ as version
from gcsfs.core import GCSFileSystem, quote
from gcsfs.credentials import GoogleCredentials
from gcsfs.gcsfs_adapter import GCSFileSystemAdapter
from gcsfs.tests.conftest import a, allfiles, b, csv_files, files, text_files
from gcsfs.tests.utils import tempdir, tmpfile

Expand Down Expand Up @@ -1731,3 +1733,23 @@ def test_near_find(gcs):
def test_get_error(gcs):
with pytest.raises(FileNotFoundError):
gcs.get_file(f"{TEST_BUCKET}/doesnotexist", "other")


def test_gcs_filesystem_when_experimental_zonal_toggle_is_not_passed(gcs_factory):
gcs = gcs_factory()

assert isinstance(
gcs, gcsfs.GCSFileSystem
), "Expected File system instance to be GCSFileSystem"
assert not isinstance(
gcs, GCSFileSystemAdapter
), "Expected File system instance to be GCSFileSystem"


def test_gcs_filesystem_adapter_when_experimental_zonal_toggle_is_true(gcs_factory):
with patch("google.auth.default", return_value=(None, "fake-project")):
gcs = gcs_factory(experimental_zb_hns_support=True)

assert isinstance(
gcs, GCSFileSystemAdapter
), "Expected File system instance to be GCSFileSystemAdapter"
Loading
Loading