Skip to content

When data augmentation is recommended, should the metrics do the augmentation internally? Or should the user do it beforehand? #779

@npatki

Description

@npatki

Environment details

  • SDMetrics version: 0.21.0

Background

In certain metrics like BinaryClassifierPrecisionEfficacy and EqualizedOddsImprovement, the user is generally interested in augmenting the real data with synthetic data. So ultimately these metrics are meant to compare the real data with the augmented data (aka real + synthetic data).

For these cases, we should decide on a consistent way for the user to input the datasets. Namely, we should decide whether the metric itself should do the augmentation (internally), or whether the user is expected to do it before calling the metric.

Details

Alternative A: The metric should do the augmentation internally. This means that the user would provide the real data and synthetic data individually. Example:

Metric.compute(
  real_data=my_real_dataset,
  synthetic_data=my_synthetic_dataset
)
  • Pros: The metric will guarantee that the augmentation is done
  • Cons: There isn't much flexibility to try out other usages

Alternative B: The user should to the augmentation themselves. Then the metric can just compare the 2 datasets it gets directly.

import pandas as pd

my_augmented_dataset = pd.concat([my_real_dataset, my_synthetic_dataset])

Metric.compute(
  real_data=my_real_dataset,
  augmented_dataset=my_augmented_dataset
)
  • Pros: The metric is more straightforward to explain
  • Cons: We cannot guarantee that the augmentation is done (unless the metric itself checks to see whether the real dataset is a subset of the augmented, which adds some complex logic)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionGeneral question about the software

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions