Feature/expected categories #1597

ColdTeapot273K · 2024-08-24T13:34:07Z

Add support for processing only explicitly expected categories for preprocessing.OneHotEncoder, preprocessing.OrdinalEncoder, akin to sklearn api for respective encoders.

All doctests pass (i've added some).

Rationale:
sklearn has this neat feature where you can explicitly pass category values you want to see in the encoder state, other values are filtered out. See categories parameter: OneHotEncoder, OrdinalEncoder

This is convenient when you work with high cardinality category spaces where some values are rare and you want to regularize your model. E.g. I've had a practical problem where constraining only to pre-selected top 20% frequent categories in 1 000 000 cardinality space can give you a 10%+ latency boost with no significant loss in metrics, and also make a model lighter on RAM.

This implementation is hackable so if user wants to modify lists of expected categories between training steps, they can do so by direct attribute access. E.g. can glue with modules like TargetAgg for some cool dynamic reevaluation of expected category lists.

P.S. Pls bump Ruff, my LSP config compains coz api changes. Also MyPy complained a lot about about str | dict | defaultdict type hints for category parameter, I just had to give up on them, maybe someone has better ideas how to handle them.

MaxHalford · 2024-08-25T20:22:44Z

P.S. Pls bump Ruff, my LSP config compains coz api changes. Also MyPy complained a lot about about str | dict | defaultdict type hints for category parameter, I just had to give up on them, maybe someone has better ideas how to handle them.

Duly noted, I'll take a look.

MaxHalford

Thanks for this PR, it's useful.

My preference here would be to not adhere to sklearn, and set categories to None rather than auto.

river/preprocessing/one_hot.py

Code review fixes Co-authored-by: Max Halford <[email protected]>

ColdTeapot273K · 2024-09-10T08:56:16Z

Thanks for this PR, it's useful.

My preference here would be to not adhere to sklearn, and set categories to None rather than auto.

Deal. I shall modify.

MaxHalford · 2024-10-22T04:50:06Z

@ColdTeapot273K sorry for not replying in a while! I like the changes, we can merge. Before that though, could you add an entry to unreleased.md?

ColdTeapot273K · 2024-11-05T16:02:15Z

@MaxHalford no problem, i understand.

Done, please check.

ColdTeapot273K added 4 commits August 24, 2024 03:13

Add support for explicitly expected categories for OHE

2fce4f3

cleanup

537003d

Add expected category support for OrdinalEncoder

3dc5dd3

fix mypy complaints

4ff7bb2

ColdTeapot273K requested review from MaxHalford and smastelini as code owners August 24, 2024 13:34

MaxHalford reviewed Aug 25, 2024

View reviewed changes

river/preprocessing/one_hot.py Outdated Show resolved Hide resolved

river/preprocessing/one_hot.py Outdated Show resolved Hide resolved

ColdTeapot273K and others added 2 commits September 10, 2024 13:54

Update river/preprocessing/one_hot.py

86aafa0

Code review fixes Co-authored-by: Max Halford <[email protected]>

Update river/preprocessing/one_hot.py

cbb202c

Code review fixes Co-authored-by: Max Halford <[email protected]>

ColdTeapot273K added 2 commits October 2, 2024 02:21

Adjust default params, update respective docs

4ce2ade

Fix pre-commit hook complaints

58f237c

ColdTeapot273K added 2 commits November 5, 2024 20:47

Upd unreleased.md

fcf01f1

Merge branch 'main' into feature/expected-categories

4cd961e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/expected categories #1597

Feature/expected categories #1597

ColdTeapot273K commented Aug 24, 2024

MaxHalford commented Aug 25, 2024

MaxHalford left a comment

ColdTeapot273K commented Sep 10, 2024

MaxHalford commented Oct 22, 2024

ColdTeapot273K commented Nov 5, 2024

Feature/expected categories #1597

Are you sure you want to change the base?

Feature/expected categories #1597

Conversation

ColdTeapot273K commented Aug 24, 2024

MaxHalford commented Aug 25, 2024

MaxHalford left a comment

Choose a reason for hiding this comment

ColdTeapot273K commented Sep 10, 2024

MaxHalford commented Oct 22, 2024

ColdTeapot273K commented Nov 5, 2024