-
Notifications
You must be signed in to change notification settings - Fork 977
Description
Is your feature request related to a problem? Please describe.
Libcudf includes a dictionary type for columns (type_id::DICTIONARY32
), with child columns for keys and indices. For instance the following column:
["A", "B", "A", "A"]
could be encoded as:
keys: ["A", "B"]
indices: [0, 1, 0, 0]
As of 25.12, the libcudf dictionary uses sorted keys. Dictionary operators like encode
, merge
, add_keys
rely on cudf::distinct
followed by cudf::sort
to provide keys and then cudf::lower_bound
to compute the indices.
The Apache Arrow spec for dictionary encoding does not require sorted keys, so from_arrow
on a dictionary would not be zero-copy as long as cudf requires a sort and reindex.
Describe the solution you'd like
Libcudf dictionaries could use a hash-based implementation instead.
Then the dictionary columns would have null mask, indices child column and unordered keys child column. They could use the column data
member to point to a contiguous hash table for key_exists
and get_index_of_key
checks.
Adding or removing keys would be faster because we would not need to re-sort the keys and re-map the indices.
get_index
could turn from O(log n) to O(1).
If the key count grows too large for the hash map, then we might need to build a new map or simply use a dynamic hash map. Also if we remove a lot of keys, we might need a way to re-build the hash table and key column to purge unused elements.
More scoping is needed.
Describe alternatives you've considered
We will want to do more performance analysis of the current sort-based implementation to be sure that key sorting and searching the sorted keys are important bottlenecks.
Additional context
Please feel free to edit and refine this proposal as appropriate.
Also see #15199 about expanding dictionary use in libcudf.
The dictionary encoded layout in Arrow does not require sorted keys.
https://arrow.apache.org/docs/format/Intro.html#dictionary-encoded-layout