-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
We are working on automatic re-encoding for categorical features during inference. This teaches the booster to handle data encoded differently than the training dataset and eliminates the need for a scikit-learn pipeline for data encoding when using DataFrame inputs.
- Python native
- Dask
- R
-
PySpark -
Scala/Spark - Thread safety
- https://github.com/scikit-learn/scikit-learn/blob/5b0ca3939854a3823beee6840b415a32ef16deb2/sklearn/utils/_tags.py#L36
- Demos
- cudf.pandas
- Polars
- Refresh updater.
- Training continuation.
- Validation datasets.
- Documents. Update the existing requirement on the ordinal encoder.
- Plot. Display name instead of code (optional).
Removed the spark variants, its dataframe doesn't have encoding. Use the StringIndexer
instead.
Related:
Notes:
Looking into the Arrow CPU implementation, its compute module dispatches based on whether a null mask is present. If true, it tries to find consecutive valid values (called a run) and then iterates on this run. This way, it avoids having a predicate for every element for the validity check. The consecutive valid values are found using compiler builtins with leading nnz counting.
Related:
Tracking PRs:
- [breaking] [py] Drop support for datatable. #11070
- Extract array interface handlers. #11089
- Support all integer types in ubjson. #11094
- Various small cleanups. #11097
- [R] Ensure
ProxyDMatrix
creation keeps data until next iteration #11092 - Initial implementation of the ordinal recoder. #11098
- Small cleanups for DMatrix constructor. #11107
- Add datagen for testing string-based categorical data. #11114
- Cleanup CPU predict function. #11139
- Implement the container for categories. #11297
- Store categories from pandas. #11303
- Test dataset for dask dataframe with str columns. #11310
- Store categories from cuDF. #11311
- Store categories from iterator. #11313
- Auto re-coding for the CPU predictor. #11315
- Implement ordinal recoder for the GPU predictor. #11347
- [enc] Add tests for re-coding validation datasets. #11561
- Support categorical features from polars. #11565
- [enc] Add a cat accessor to the booster. #11568