-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
I had configured my checkpoint store in DatacontextConfig using (class_name: TupleAzureBlobStoreBackend) and added a prefix with azure credentials(account_url and sas_token), when i load the gx_context and do gx_context.checkpoints.get(checkpoint_name), its giving BlobNotFound Error, however while adding a checkpoint its able to add it properly in the AzureBlobStorage using the given prefix.
Also while fetching checkpoint is already present in the azure blob storage with the correct name. Digging deeper into it i found out in implementation of tuple_store_backend.py [class: TupleAzureStoreBackend], list_keys() method: its using relpath for obj.name from azure blob store:
ex: obj_name = data_quality/checkpoint/ck.json
will get converted to : az_blob_key = data_quality\checkpoint\ck.json
but my prefix is : data_quality/checkpoint
and as per os WINDOWS: os.path.sep= ''
hence this will fail the first if condition where we are checking if az_blob_key startswith our prefix and os seperator.
`
@OverRide
def list_keys(self, prefix: Tuple = ()) -> List[Tuple]:
# Note that the prefix arg is only included to maintain consistency with the parent class signature # noqa: E501 # FIXME CoP
key_list = []
for obj in self._container_client.list_blobs(name_starts_with=self.prefix):
az_blob_key = os.path.relpath(obj.name)
if az_blob_key.startswith(f"{self.prefix}{os.path.sep}"):
az_blob_key = az_blob_key[len(self.prefix) + 1 :]
if self._is_missing_prefix_or_suffix(
filepath_prefix=self.filepath_prefix,
filepath_suffix=self.filepath_suffix,
key=az_blob_key,
):
continue
key = self._convert_filepath_to_key(az_blob_key)
key_list.append(key)
return key_list
@override
def _get_all(self) -> list[Any]:
keys = self.list_keys()
return [self._get(key) for key in keys]
`
To Reproduce
Configure GX context [File or Ephemeral] with a checkpoint store (class_name: TupleAzureBlobStoreBackend and prefix: "data_quality/checkpoints") and try to fetch the checkpoint using
gx_context.checkpoints.get(checkpoint_name) #checkpoint_name without .json suffix
Expected behavior
list_keys method should be able to resolve the paths properly and during comparision resolve the prefix too, incase resolving blob_key.
in list_keys method: if az_blob_key.startswith(f"{self.prefix}{os.path.sep}"): should be converted to
if az_blob_key.startswith(f"{os.path.relpath(self.prefix)}{os.path.sep}"):
then its working fine and It is able to read the checkpoint and return if present in azure blob storage with the mentioned prefix.
Environment (please complete the following information):
- Operating System: [e.g. Linux, MacOS, Windows]: Windows
- Great Expectations Version: [e.g. 0.13.2]: 1.5.1
- Data Source: [e.g. Pandas, Snowflake]: pandas
- Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP]: Azure
Additional context
Currently running in local: