Fix/label distribution entropy by michaelellis003 · Pull Request #733 · huggingface/evaluate

michaelellis003 · 2026-02-13T14:32:54Z

Fixes #659

Skewness is statistically inappropriate for categorical label variables — it depends on arbitrary integer encoding and measures symmetry rather than uniformity. For example, [0,0,1,1,1,1,1,2,2] and [0,0,1,1,2,2,2,2,2] have the same class distribution (2, 5, 2) but different skewness values. Entropy is identical for both, as expected.

This PR replaces label_skew with:

label_entropy — Shannon entropy in nats (0 = single class, log(k) = uniform)
label_entropy_normalized — entropy / log(k), giving a 0-to-1 balance score

Note: this is a breaking change — label_skew is removed, not deprecated. The old value was statistically meaningless, so keeping it as a deprecated field would mean continuing to return a wrong number.

Changes:

label_distribution.py: replace scipy.stats.skew with scipy.stats.entropy, remove unused pandas import and string-to-integer conversion
README.md: update output docs, examples, interpretation guidance, and references
test_label_distribution.py: new test suite (8 tests) covering uniformity, imbalance, permutation invariance, string labels, and edge cases

Skewness is statistically inappropriate for categorical label variables because it depends on arbitrary integer encoding and measures symmetry rather than uniformity. Replace it with Shannon entropy, which is permutation-invariant and correctly quantifies how balanced a label distribution is. Changes: - Replace scipy.stats.skew with scipy.stats.entropy - Return label_entropy (nats) and label_entropy_normalized (0 to 1) - Remove unused pandas import and string-to-integer conversion - Update docstrings, README examples, and references - Add test suite covering uniformity, imbalance, permutation invariance, string labels, and edge cases Fixes huggingface#659

michaelellis003 added 2 commits February 12, 2026 19:01

Cast entropy values to float for consistent output across numpy versions

6d8c022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix/label distribution entropy#733

Fix/label distribution entropy#733
michaelellis003 wants to merge 2 commits intohuggingface:mainfrom
michaelellis003:fix/label-distribution-entropy

michaelellis003 commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

michaelellis003 commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant