[FEA] Refactor to a hash-based dictionary implementation in libcudf

**Is your feature request related to a problem? Please describe.**
Libcudf includes a dictionary type for columns (`type_id::DICTIONARY32`), with child columns for keys and indices. For instance the following column:
`["A", "B", "A", "A"]`
could be encoded as:
`keys: ["A", "B"]`
`indices: [0, 1, 0, 0]`

As of 25.12, the libcudf dictionary uses sorted keys. Dictionary operators like `encode`, `merge`, `add_keys` rely on `cudf::distinct` followed by `cudf::sort` to provide keys and then `cudf::lower_bound` to compute the indices.

The [Apache Arrow spec for dictionary encoding](https://arrow.apache.org/docs/format/Intro.html#dictionary-encoded-layout) does not require sorted keys, so `from_arrow` on a dictionary would not be zero-copy as long as cudf requires a sort and reindex. 

**Describe the solution you'd like**
Libcudf dictionaries could use a hash-based implementation instead.

Then the dictionary columns would have null mask, indices child column and unordered keys child column. They could use the column `data` member to point to a contiguous hash table for `key_exists` and `get_index_of_key` checks. 

Adding or removing keys would be faster because we would not need to re-sort the keys and re-map the indices. 

`get_index` could turn from O(log n) to O(1).

If the key count grows too large for the hash map, then we might need to build a new map or simply use a dynamic hash map. Also if we remove a lot of keys, we might need a way to re-build the hash table and key column to purge unused elements.

More scoping is needed.

**Describe alternatives you've considered**
We will want to do more performance analysis of the current sort-based implementation to be sure that key sorting and searching the sorted keys are important bottlenecks.

**Additional context**
Please feel free to edit and refine this proposal as appropriate.

Also see https://github.com/rapidsai/cudf/issues/15199 about expanding dictionary use in libcudf.

The dictionary encoded layout in Arrow does not require sorted keys.
https://arrow.apache.org/docs/format/Intro.html#dictionary-encoded-layout
<img width="664" height="442" alt="Image" src="https://github.com/user-attachments/assets/f3edc8de-62a1-43a1-a59d-4cbab2c925e2" />



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Refactor to a hash-based dictionary implementation in libcudf #20234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Refactor to a hash-based dictionary implementation in libcudf #20234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions