Skip to content

Naming related to data files #231

@SergeiPatiakin

Description

@SergeiPatiakin

There are three concepts related to Iceberg data files:

A: files defining rows to add
B: files defining rows to remove
C: files defining either rows to add or rows to remove (encompasses both A and B)

The term "delete file" unambiguously refers to B. On the other hand, "data file" can refer to A or to C. When one sees a variable or field called data_files, it is hard to know whether that refers to A or C. For example at

it refers to A but at
let data_files_iter = delete_files.iter().chain(data_files.iter());
it refers to C.

I wonder if we can come up with distinct canonical terms for A, B, C. E.g.

A: "data file"
B: "delete file"
C: "content file"

Or alternatively:

A: "insert file"
B: "delete file"
C: "data file"

Or any other 3 distinct terms for A, B, C.

Unfortunately the Iceberg spec suffers from the same issue so we cannot use it for inspiration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions