Skip to content

[FEA] Refactor to a hash-based dictionary implementation in libcudf #20234

@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
Libcudf includes a dictionary type for columns (type_id::DICTIONARY32), with child columns for keys and indices. For instance the following column:
["A", "B", "A", "A"]
could be encoded as:
keys: ["A", "B"]
indices: [0, 1, 0, 0]

As of 25.12, the libcudf dictionary uses sorted keys. Dictionary operators like encode, merge, add_keys rely on cudf::distinct followed by cudf::sort to provide keys and then cudf::lower_bound to compute the indices.

The Apache Arrow spec for dictionary encoding does not require sorted keys, so from_arrow on a dictionary would not be zero-copy as long as cudf requires a sort and reindex.

Describe the solution you'd like
Libcudf dictionaries could use a hash-based implementation instead.

Then the dictionary columns would have null mask, indices child column and unordered keys child column. They could use the column data member to point to a contiguous hash table for key_exists and get_index_of_key checks.

Adding or removing keys would be faster because we would not need to re-sort the keys and re-map the indices.

get_index could turn from O(log n) to O(1).

If the key count grows too large for the hash map, then we might need to build a new map or simply use a dynamic hash map. Also if we remove a lot of keys, we might need a way to re-build the hash table and key column to purge unused elements.

More scoping is needed.

Describe alternatives you've considered
We will want to do more performance analysis of the current sort-based implementation to be sure that key sorting and searching the sorted keys are important bottlenecks.

Additional context
Please feel free to edit and refine this proposal as appropriate.

Also see #15199 about expanding dictionary use in libcudf.

The dictionary encoded layout in Arrow does not require sorted keys.
https://arrow.apache.org/docs/format/Intro.html#dictionary-encoded-layout
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions