-
Notifications
You must be signed in to change notification settings - Fork 155
Open
Description
Hi,
We went down a rabbit hole trying to find this one.
apache/arrow#31339
Turns out Pandas can't read partitioned parquet files from a directory because of PyArrow using GCSFS.
However in this repo there seems to be no mention of this. Are you aware of any situation where the library is non-deterministic/has caching issues when listing a directory?
import gcsfs
PATH = "bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes"
fs = gcsfs.GCSFileSystem()
print(fs.info(PATH))
print(fs.info(PATH))
print(fs.info(PATH))
Returns:
{'kind': 'storage#object', 'id': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes//1721313663057121', 'selfLink': 'https://www.googleapis.com/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F', 'mediaLink': 'https://storage.googleapis.com/download/storage/v1/b/bucket-dev-storage/o/kedro%2Fstaging%2Fdata%2F05_model_input%2Fdrugs_diseases_nodes%2F?generation=1721313663057121&alt=media', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes/', 'bucket': 'bucket-dev-storage', 'generation': '1721313663057121', 'metageneration': '1', 'contentType': 'application/octet-stream', 'storageClass': 'STANDARD', 'size': 0, 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==', 'crc32c': 'AAAAAA==', 'etag': 'COH5u4vpsIcDEAE=', 'timeCreated': '2024-07-18T14:41:03.059Z', 'updated': '2024-07-18T14:41:03.059Z', 'timeStorageClassUpdated': '2024-07-18T14:41:03.059Z', 'type': 'file'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
{'bucket': 'bucket-dev-storage', 'name': 'bucket-dev-storage/kedro/staging/data/05_model_input/drugs_diseases_nodes', 'size': 0, 'storageClass': 'DIRECTORY', 'type': 'directory'}
Note first call vs. 2 and 3 are different results. What's up with that?
Metadata
Metadata
Assignees
Labels
No labels