-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Environment details
- SDMetrics version: 0.21.0
Background
In certain metrics like BinaryClassifierPrecisionEfficacy and EqualizedOddsImprovement, the user is generally interested in augmenting the real data with synthetic data. So ultimately these metrics are meant to compare the real data with the augmented data (aka real + synthetic data).
For these cases, we should decide on a consistent way for the user to input the datasets. Namely, we should decide whether the metric itself should do the augmentation (internally), or whether the user is expected to do it before calling the metric.
Details
Alternative A: The metric should do the augmentation internally. This means that the user would provide the real data and synthetic data individually. Example:
Metric.compute(
real_data=my_real_dataset,
synthetic_data=my_synthetic_dataset
)
- Pros: The metric will guarantee that the augmentation is done
- Cons: There isn't much flexibility to try out other usages
Alternative B: The user should to the augmentation themselves. Then the metric can just compare the 2 datasets it gets directly.
import pandas as pd
my_augmented_dataset = pd.concat([my_real_dataset, my_synthetic_dataset])
Metric.compute(
real_data=my_real_dataset,
augmented_dataset=my_augmented_dataset
)
- Pros: The metric is more straightforward to explain
- Cons: We cannot guarantee that the augmentation is done (unless the metric itself checks to see whether the real dataset is a subset of the augmented, which adds some complex logic)