Skip to content

Auto encoding for categorical data during inference. #11088

@trivialfis

Description

@trivialfis

We are working on automatic re-encoding for categorical features during inference. This teaches the booster to handle data encoded differently than the training dataset and eliminates the need for a scikit-learn pipeline for data encoding when using DataFrame inputs.

Removed the spark variants, its dataframe doesn't have encoding. Use the StringIndexer instead.

Related:

Notes:
Looking into the Arrow CPU implementation, its compute module dispatches based on whether a null mask is present. If true, it tries to find consecutive valid values (called a run) and then iterates on this run. This way, it avoids having a predicate for every element for the validity check. The consecutive valid values are found using compiler builtins with leading nnz counting.

Related:

Tracking PRs:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions