Skip to content

Issue loading partitioned data with values of RDx #958

@tomjemmett

Description

@tomjemmett

I have recently updated a code base from an older version of fastparquet (2023.2.* to 2024.11.0), and trying to read in some data started to generate a cryptic error of

ValueError: Timestamp('1-01-01 00:00:00') is not in list

I was able to track this down to two specific partitions in my data. My data is stored with partitions like hospital=XYZ, and the two offending partitions were hospital=RD1 and hospital=RD8. (For reference, these are codes identifying hospitals in England, which largely follow the pattern of 3 characters starting with an R followed by 2 other alphanumeric values, for example R0A, RNA).

Minimal Complete Verifiable Example:

This simpler example does not result in the same exception, but the

import os
import tempfile
import pandas as pd

tmpdir = tempfile.mkdtemp()

os.makedirs(f"{tmpdir}/hospital=RD1", exist_ok=True)
os.makedirs(f"{tmpdir}/hospital=RD8", exist_ok=True)

df1 = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})
df2 = pd.DataFrame({"id": [4, 5, 6], "value": [40, 50, 60]})

df1.to_parquet(f"{tmpdir}/hospital=RD1/part-00000.parquet")
df2.to_parquet(f"{tmpdir}/hospital=RD8/part-00000.parquet")

pd.read_parquet(tmpdir, "fastparquet")

This gives me

id value hospital
1 10 0001-01-01 00:00:00
2 20 0001-01-01 00:00:00
3 30 0001-01-01 00:00:00
4 40 0001-01-02 00:00:00
5 50 0001-01-02 00:00:00
6 60 0001-01-02 00:00:00

I am expecting this partition column hospital to be the strings RD1 and RD8, but instead they seem to get converted to dates. When used with the real data (with ~150 partitions) is where the error occurs. I can prevent the error from occuring by instructing pandas not to read in the hospital column, but this is not a practical solution as the column is required.

If I switch the engine to pyarrow the error goes away, but the loading of data is significantly slower.

Anything else we need to know?:

Environment:

  • Dask version: n/a
  • Python version: 3.17.7
  • Operating System: Windows 11 2023H2
  • Install method (conda, pip, source): uv + pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions