-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Problem Description
To improve the performance, the SDMetrics Quality Report may decide to subsample some metrics before running them. For example, the report currently subsamples larger datasets to 50K rows before running the ContingencySimilarity metric. Since the subsampling is random, it will result in the score being non-deterministic. (Note that with 50K rows, we've verified that the overall score will only be affected by a small percentage.)
Nevertheless, it would be good to expose a control that would allow the user to toggle the subsampling on/off -- especially in the case that they are willing to wait for the full computation and want the full, deterministic score. Alternatively, if subsampling is one, it would be good to control the number of rows to subsample.
Expected behavior
For single- and multi-table quality reports, each instance should have an attribute that can be modified by the user for subsampling. The attribute should be called: num_rows_subsample
.
- By default, the attribute should be set to
50000
(50K) - If the user should change the default, the new value should be used when subsampling the data for any metric's computation
- If the user sets
num_rows_subsample=None
, then no subsampling should be done.
The attribute should only affect that particular instance of the quality report.
from sdmetrics.reports.single_table import QualityReport
# set the subsample to 100K rows instead of 50K
report = QualityReport()
report.num_rows_subsample=100000
report.generate(...)
# alternatively, turn the subsampling off
report2 = QualityReport()
report2.num_rows_subsample=None
report2.generate(...)
Additional context
Currently, the only metric that is subsampled is ContingencySimilarity
. However, should we decide to subsample any other metrics in the future, they would use the same value.