Synthetic models and data #14

jack89roberts · 2025-05-16T17:31:21Z

arc_tigers.model.beta_model

Has functions to generate synthetic models. You can run it as a script to get an idea of what the various parameters do and what the simulated model outputs look like.
Main points:
- It uses two Beta distributions - one is used when the target label is 0 (produces scores biased towards 0), and the other when the target label is 1 (produces scores biased towards 1).
- There are two main hyperparameters - positive error rate and negative error rate - these are the proportion of samples that would be incorrectly classified for each class (assuming a threshold of 0.5), and are used to fit the beta distributions. I.e. positive error rate of 0.1 means 10% of the positive samples will get a score below 0.5.
- As a starting point I've added a rough way of setting the error rates based on the dataset imbalance and a term I'm calling "model advantage", which is roughly meant to correspond to how many times better (in terms of error rate) the model is than a baseline of just predicting the majority class. The difficulty scales with the imbalance - i.e. the positive error rate will increase as the data imbalance is increased.
- The BetaModel class is my attempt to make the synthetic models compatible with running evaluations/predictions with the transformers trainer. Seems to work but not sure it's worth it vs. just not using transformers when it's not a transformers model.

arc_tigers.data.synthetic

Generates a HuggingFace Dataset of synthetic labels (0 and 1 with a specified imbalance) and a dummy text column (with 'yes' and 'no' values depending on the label). The text column is just for compatibility with other functions that expect the data/model to be using something other than just a label.

scripts/random_sampling

I've made it even messier.
But it should now have the options to use a synthetic model on either a synthetic dataset or a Reddit dataset. I've successfully run it with both, but I don't have proper versions of the Reddit datasets locally.

Other Notes:

This is building on an old version of branch 4, so I've put the base as that branch to make the diff look neater. But should be switched to main once 4 is merged (and there probably will be conflicts with any recent changes to 4).
I've been using a different definition of imbalance to Jack D. My definition is the percentage of the dataset that belongs to the positive class. Jack D's definition is how many negative samples there are per positive sample. I've converted between the two where needed in the sampling script but we should settle on one.

jack89roberts added 3 commits May 15, 2025 18:28

wip synthetic model

cc1ca09

Add synthetic model functionality

3f50daa

Delete fit_beta.ipynb

86d6438

jack89roberts changed the base branch from main to 4-random-sampling May 16, 2025 17:36

J-Dymond added 4 commits May 20, 2025 14:21

update gitignore

e622bc8

merge with 4-random-sampling PR

f20ab34

type_checker fix

6ddc961

changes to get_reddit_data for multi-class classification

7679692

klh5 marked this pull request as ready for review May 21, 2025 09:39

klh5 merged commit 18e0253 into 4-random-sampling May 21, 2025
5 checks passed

klh5 deleted the 12-synthetic-data-model branch May 21, 2025 09:45

Provide feedback