-
-
Notifications
You must be signed in to change notification settings - Fork 188
Description
I have recently updated a code base from an older version of fastparquet (2023.2.* to 2024.11.0), and trying to read in some data started to generate a cryptic error of
ValueError: Timestamp('1-01-01 00:00:00') is not in list
I was able to track this down to two specific partitions in my data. My data is stored with partitions like hospital=XYZ
, and the two offending partitions were hospital=RD1
and hospital=RD8
. (For reference, these are codes identifying hospitals in England, which largely follow the pattern of 3 characters starting with an R followed by 2 other alphanumeric values, for example R0A, RNA).
Minimal Complete Verifiable Example:
This simpler example does not result in the same exception, but the
import os
import tempfile
import pandas as pd
tmpdir = tempfile.mkdtemp()
os.makedirs(f"{tmpdir}/hospital=RD1", exist_ok=True)
os.makedirs(f"{tmpdir}/hospital=RD8", exist_ok=True)
df1 = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
df2 = pd.DataFrame({"id": [4, 5, 6], "value": [40, 50, 60]})
df1.to_parquet(f"{tmpdir}/hospital=RD1/part-00000.parquet")
df2.to_parquet(f"{tmpdir}/hospital=RD8/part-00000.parquet")
pd.read_parquet(tmpdir, "fastparquet")
This gives me
id | value | hospital |
---|---|---|
1 | 10 | 0001-01-01 00:00:00 |
2 | 20 | 0001-01-01 00:00:00 |
3 | 30 | 0001-01-01 00:00:00 |
4 | 40 | 0001-01-02 00:00:00 |
5 | 50 | 0001-01-02 00:00:00 |
6 | 60 | 0001-01-02 00:00:00 |
I am expecting this partition column hospital
to be the strings RD1
and RD8
, but instead they seem to get converted to dates. When used with the real data (with ~150 partitions) is where the error occurs. I can prevent the error from occuring by instructing pandas not to read in the hospital column, but this is not a practical solution as the column is required.
If I switch the engine to pyarrow the error goes away, but the loading of data is significantly slower.
Anything else we need to know?:
Environment:
- Dask version: n/a
- Python version: 3.17.7
- Operating System: Windows 11 2023H2
- Install method (conda, pip, source): uv + pip