-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Labels
questionGeneral question about the softwareGeneral question about the software
Description
Environment details
- SDMetrics version: 0.21.0
Background
The SDMetrics library is set up to produce 1 final score in the range [0,1]
, where 0 is the worst and 1 is the best.
For some of the SDMetrics, we are interested in computing whether the synthetic data is helping to improve some kind of task/property. For example:
- In BinaryClassifierPrecisionEfficacy, we are interested in knowing whether synthetic data will improve a ML classifier's predictions
- In EqualizedOddsImprovement, we are interested in knowing whether the synthetic data will improve fairness
The question is: How should these metrics be formulating the overall score?
Details
The diagram below shows 2 alternatives for returning a final score.
- Alternative A returns the magnitude of improvement. The score is a 0 if synthetic data is not improving the task
- Any score >0 is considered as an improvement, even something small like 0.1. This can be a bit misleading because 0.1 is usually considered a "bad" value for other metrics (such as KSComplement, CategoryCoverage, CorrelationSimilarity, etc.)
- The score makes no distinction between the synthetic data having no effect vs. the synthetic data having a very bad effect (in both cases, the score will be 0)
- Alternative B is about both magnitude and direction of improvement. The score is 0.5 if synthetic has not effect; <0.5 means the synthetic data has a bad effect; and >0.5 if the synthetic data is making improvements.
- Now, only higher scores like 0.7 or 0.8 can be considered "good", while lower scores like 0.1 and 0.2 are considered "bad". This is similar to other SDMetrics (such as KSComplement, CategoryCoverage, CorrelationSimilarity, etc.)
- This metric also gives us the magnitude of improvement (or lack thereof). Now there is a distinction between synthetic data having no effect (0.5), versus synthetic data being actively bad for the usage (eg. 0.1)
- However, it leaves an arbitrary cutoff at 0.5; you need to consider this when interpreting the score
We've currently implemented Alternative B since it seems to have fewer cons. But I'm leaving this as an question to consider Alternative A.
Metadata
Metadata
Assignees
Labels
questionGeneral question about the softwareGeneral question about the software