Skip to content

Unable to fetch checkpoints from AzureBlobStorage when using TupleAzureBlobStorageBackend with a prefix "first_part/second_part" in Windows #11258

@Rohitjoshi07

Description

@Rohitjoshi07

Describe the bug
I had configured my checkpoint store in DatacontextConfig using (class_name: TupleAzureBlobStoreBackend) and added a prefix with azure credentials(account_url and sas_token), when i load the gx_context and do gx_context.checkpoints.get(checkpoint_name), its giving BlobNotFound Error, however while adding a checkpoint its able to add it properly in the AzureBlobStorage using the given prefix.

Also while fetching checkpoint is already present in the azure blob storage with the correct name. Digging deeper into it i found out in implementation of tuple_store_backend.py [class: TupleAzureStoreBackend], list_keys() method: its using relpath for obj.name from azure blob store:
ex: obj_name = data_quality/checkpoint/ck.json
will get converted to : az_blob_key = data_quality\checkpoint\ck.json
but my prefix is : data_quality/checkpoint
and as per os WINDOWS: os.path.sep= ''
hence this will fail the first if condition where we are checking if az_blob_key startswith our prefix and os seperator.

`
@OverRide
def list_keys(self, prefix: Tuple = ()) -> List[Tuple]:
# Note that the prefix arg is only included to maintain consistency with the parent class signature # noqa: E501 # FIXME CoP
key_list = []

    for obj in self._container_client.list_blobs(name_starts_with=self.prefix):
        az_blob_key = os.path.relpath(obj.name)
        if az_blob_key.startswith(f"{self.prefix}{os.path.sep}"):
            az_blob_key = az_blob_key[len(self.prefix) + 1 :]

        if self._is_missing_prefix_or_suffix(
            filepath_prefix=self.filepath_prefix,
            filepath_suffix=self.filepath_suffix,
            key=az_blob_key,
        ):
            continue
        key = self._convert_filepath_to_key(az_blob_key)

        key_list.append(key)
    return key_list

@override
def _get_all(self) -> list[Any]:
    keys = self.list_keys()
    return [self._get(key) for key in keys]

`

To Reproduce

Configure GX context [File or Ephemeral] with a checkpoint store (class_name: TupleAzureBlobStoreBackend and prefix: "data_quality/checkpoints") and try to fetch the checkpoint using
gx_context.checkpoints.get(checkpoint_name) #checkpoint_name without .json suffix

Expected behavior
list_keys method should be able to resolve the paths properly and during comparision resolve the prefix too, incase resolving blob_key.
in list_keys method: if az_blob_key.startswith(f"{self.prefix}{os.path.sep}"): should be converted to
if az_blob_key.startswith(f"{os.path.relpath(self.prefix)}{os.path.sep}"):
then its working fine and It is able to read the checkpoint and return if present in azure blob storage with the mentioned prefix.

Environment (please complete the following information):

  • Operating System: [e.g. Linux, MacOS, Windows]: Windows
  • Great Expectations Version: [e.g. 0.13.2]: 1.5.1
  • Data Source: [e.g. Pandas, Snowflake]: pandas
  • Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP]: Azure

Additional context
Currently running in local:

Image
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions