Skip to content

Fix/deduplicate casting log message#983

Merged
DavidEiglspergerQC merged 4 commits intomainfrom
fix/deduplicate-casting-log-message
Mar 12, 2026
Merged

Fix/deduplicate casting log message#983
DavidEiglspergerQC merged 4 commits intomainfrom
fix/deduplicate-casting-log-message

Conversation

@DavidEiglspergerQC
Copy link
Contributor

Problem:
align_df_categories logs at INFO every time a column is cast to Enum or its categories are re-aligned. Since it runs on every .predict() call (via _convert_from_df), any code path that calls predict in a loop (CV grid search, PD plots, SHAP, H² stats produces hundreds of identical log lines)

Solution:
Track emitted columns in a module-level set and only log the first occurrence. Casting/alignment behavior is unchanged.

There might be more nuanced fixes for this, I just realized as I ran some workflows using a categorical feature and found it quite annoying that the entire logs are spammed with the same message, so I went for this quick fix. Happy to get your opinion on this.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces repeated INFO logging from align_df_categories during prediction workflows by deduplicating “cast/align category” messages across calls, and documents the change for the next patch release.

Changes:

  • Add a module-level set to ensure align_df_categories emits cast/align INFO logs only once per column per process/session.
  • Update the changelog with an unreleased 3.2.1 entry describing the reduced log spam.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/glum/_utils.py Deduplicates align_df_categories INFO logs using a module-level emitted-columns set.
CHANGELOG.rst Adds an unreleased entry describing the logging change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@stanmart
Copy link
Collaborator

We might also want to think about making these messages DEBUG-level, but the current fix is good nevertheless, thank you!

Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, sorry, I missed that this is a module level global. This might get a bit confusing (log is emitted only once per session, not once per fit).

@MarcAntoineSchmidtQC
Copy link
Member

Oh wait, sorry, I missed that this is a module level global. This might get a bit confusing (log is emitted only once per session, not once per fit).

That's a good point. I think it should be displayed once per fit, but then the question is: Can we make the call _convert_from_df only once in these examples?

@DavidEiglspergerQC
Copy link
Contributor Author

Oh wait, sorry, I missed that this is a module level global. This might get a bit confusing (log is emitted only once per session, not once per fit).

That's a good point. I think it should be displayed once per fit, but then the question is: Can we make the call _convert_from_df only once in these examples?

Exactly, using an instance-level set, we still get duplicated warnings as e.g. for cv we create fresh estimators for each fold/param combo, so each gets its own empty set...

@DavidEiglspergerQC
Copy link
Contributor Author

We now:

  • Use DEBUG level
  • Use instance level deduplication

This solves the whole problem if one runs the workflows at INFO level and "most" of the problem if one runs them at DEBUG level (no tons of duplicate logs e.g. for pd plots).

The only "odd" thing is that at DEBUG level we have duplicate logs for CV as we there create fresh estimators for each fold/param combo and therefore have fold times each log as you can see in this screenshot:

image

Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks good to me. I don't mind deduplicated debug messages too much. They are usually hidden by default, and whenever the user wants debug logging extra info is usually not a problem.

@DavidEiglspergerQC DavidEiglspergerQC merged commit 88547f7 into main Mar 12, 2026
24 checks passed
@DavidEiglspergerQC DavidEiglspergerQC deleted the fix/deduplicate-casting-log-message branch March 12, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants