Fix/deduplicate casting log message#983
Conversation
There was a problem hiding this comment.
Pull request overview
Reduces repeated INFO logging from align_df_categories during prediction workflows by deduplicating “cast/align category” messages across calls, and documents the change for the next patch release.
Changes:
- Add a module-level set to ensure
align_df_categoriesemits cast/align INFO logs only once per column per process/session. - Update the changelog with an unreleased 3.2.1 entry describing the reduced log spam.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/glum/_utils.py |
Deduplicates align_df_categories INFO logs using a module-level emitted-columns set. |
CHANGELOG.rst |
Adds an unreleased entry describing the logging change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MarcAntoineSchmidtQC
left a comment
There was a problem hiding this comment.
Looks good. Thanks!
|
We might also want to think about making these messages DEBUG-level, but the current fix is good nevertheless, thank you! |
stanmart
left a comment
There was a problem hiding this comment.
Oh wait, sorry, I missed that this is a module level global. This might get a bit confusing (log is emitted only once per session, not once per fit).
That's a good point. I think it should be displayed once per fit, but then the question is: Can we make the call |
Exactly, using an instance-level set, we still get duplicated warnings as e.g. for cv we create fresh estimators for each fold/param combo, so each gets its own empty set... |
stanmart
left a comment
There was a problem hiding this comment.
Thank you, this looks good to me. I don't mind deduplicated debug messages too much. They are usually hidden by default, and whenever the user wants debug logging extra info is usually not a problem.

Problem:
align_df_categorieslogs at INFO every time a column is cast to Enum or its categories are re-aligned. Since it runs on every.predict()call (via_convert_from_df), any code path that calls predict in a loop (CV grid search, PD plots, SHAP, H² stats produces hundreds of identical log lines)Solution:
Track emitted columns in a module-level set and only log the first occurrence. Casting/alignment behavior is unchanged.
There might be more nuanced fixes for this, I just realized as I ran some workflows using a categorical feature and found it quite annoying that the entire logs are spammed with the same message, so I went for this quick fix. Happy to get your opinion on this.