Auto encoding for categorical data during inference.

We are working on automatic re-encoding for categorical features during inference. This teaches the booster to handle data encoded differently than the training dataset and eliminates the need for a scikit-learn pipeline for data encoding when using DataFrame inputs.

- [ ] Python native
- [ ] Dask
- [ ] R
- [x] ~PySpark~
- [x] ~Scala/Spark~
- [x] Thread safety
- [ ] https://github.com/scikit-learn/scikit-learn/blob/5b0ca3939854a3823beee6840b415a32ef16deb2/sklearn/utils/_tags.py#L36
- [ ] Demos
- [x] cudf.pandas
- [x] Polars
- [ ] Refresh updater.
- [ ] Training continuation.
- [x] Validation datasets.
- [ ] Documents. Update the existing requirement on the ordinal encoder.
- [ ] Plot. Display name instead of code (optional).


Removed the spark variants, its dataframe doesn't have encoding. Use the `StringIndexer` instead.

Related:
- https://github.com/dmlc/xgboost/issues/9676

Notes:
Looking into the Arrow CPU implementation, its compute module dispatches based on whether a null mask is present. If true, it tries to find consecutive valid values (called a run) and then iterates on this run. This way, it avoids having a predicate for every element for the validity check. The consecutive valid values are found using compiler builtins with leading nnz counting.

Related:
- https://github.com/dmlc/xgboost/issues/10554

Tracking PRs:
- https://github.com/dmlc/xgboost/pull/11070
- https://github.com/dmlc/xgboost/pull/11089
- https://github.com/dmlc/xgboost/pull/11094
- https://github.com/dmlc/xgboost/pull/11097
- https://github.com/dmlc/xgboost/pull/11092
- https://github.com/dmlc/xgboost/pull/11098
- https://github.com/dmlc/xgboost/pull/11107
- https://github.com/dmlc/xgboost/pull/11114
- https://github.com/dmlc/xgboost/pull/11139
- https://github.com/dmlc/xgboost/pull/11297
- https://github.com/dmlc/xgboost/pull/11303
- https://github.com/dmlc/xgboost/pull/11310
- https://github.com/dmlc/xgboost/pull/11311
- https://github.com/dmlc/xgboost/pull/11313
- https://github.com/dmlc/xgboost/pull/11315
- https://github.com/dmlc/xgboost/pull/11347
- https://github.com/dmlc/xgboost/pull/11561
- https://github.com/dmlc/xgboost/pull/11565
- https://github.com/dmlc/xgboost/pull/11568

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Auto encoding for categorical data during inference. #11088

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Auto encoding for categorical data during inference. #11088

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions