Skip to content

Comments

Fix/label distribution entropy#733

Open
michaelellis003 wants to merge 2 commits intohuggingface:mainfrom
michaelellis003:fix/label-distribution-entropy
Open

Fix/label distribution entropy#733
michaelellis003 wants to merge 2 commits intohuggingface:mainfrom
michaelellis003:fix/label-distribution-entropy

Conversation

@michaelellis003
Copy link

Fixes #659

Skewness is statistically inappropriate for categorical label variables — it depends on arbitrary integer encoding and measures symmetry rather than uniformity. For example, [0,0,1,1,1,1,1,2,2] and [0,0,1,1,2,2,2,2,2] have the same class distribution (2, 5, 2) but different skewness values. Entropy is identical for both, as expected.

This PR replaces label_skew with:

  • label_entropy — Shannon entropy in nats (0 = single class, log(k) = uniform)
  • label_entropy_normalized — entropy / log(k), giving a 0-to-1 balance score

Note: this is a breaking change — label_skew is removed, not deprecated. The old value was statistically meaningless, so keeping it as a deprecated field would mean continuing to return a wrong number.

Changes:

  • label_distribution.py: replace scipy.stats.skew with scipy.stats.entropy, remove unused pandas import and string-to-integer conversion
  • README.md: update output docs, examples, interpretation guidance, and references
  • test_label_distribution.py: new test suite (8 tests) covering uniformity, imbalance, permutation invariance, string labels, and edge cases

Skewness is statistically inappropriate for categorical label variables
because it depends on arbitrary integer encoding and measures symmetry
rather than uniformity. Replace it with Shannon entropy, which is
permutation-invariant and correctly quantifies how balanced a label
distribution is.

Changes:
- Replace scipy.stats.skew with scipy.stats.entropy
- Return label_entropy (nats) and label_entropy_normalized (0 to 1)
- Remove unused pandas import and string-to-integer conversion
- Update docstrings, README examples, and references
- Add test suite covering uniformity, imbalance, permutation invariance,
  string labels, and edge cases

Fixes huggingface#659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Statistically nonsensical to use skewness in label_distribution

1 participant