Skip to content

Synthetic models and data #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 21, 2025
Merged

Conversation

jack89roberts
Copy link
Collaborator

@jack89roberts jack89roberts commented May 16, 2025

arc_tigers.model.beta_model

  • Has functions to generate synthetic models. You can run it as a script to get an idea of what the various parameters do and what the simulated model outputs look like.
  • Main points:
    • It uses two Beta distributions - one is used when the target label is 0 (produces scores biased towards 0), and the other when the target label is 1 (produces scores biased towards 1).
    • There are two main hyperparameters - positive error rate and negative error rate - these are the proportion of samples that would be incorrectly classified for each class (assuming a threshold of 0.5), and are used to fit the beta distributions. I.e. positive error rate of 0.1 means 10% of the positive samples will get a score below 0.5.
    • As a starting point I've added a rough way of setting the error rates based on the dataset imbalance and a term I'm calling "model advantage", which is roughly meant to correspond to how many times better (in terms of error rate) the model is than a baseline of just predicting the majority class. The difficulty scales with the imbalance - i.e. the positive error rate will increase as the data imbalance is increased.
    • The BetaModel class is my attempt to make the synthetic models compatible with running evaluations/predictions with the transformers trainer. Seems to work but not sure it's worth it vs. just not using transformers when it's not a transformers model.

arc_tigers.data.synthetic

  • Generates a HuggingFace Dataset of synthetic labels (0 and 1 with a specified imbalance) and a dummy text column (with 'yes' and 'no' values depending on the label). The text column is just for compatibility with other functions that expect the data/model to be using something other than just a label.

scripts/random_sampling

  • I've made it even messier.
  • But it should now have the options to use a synthetic model on either a synthetic dataset or a Reddit dataset. I've successfully run it with both, but I don't have proper versions of the Reddit datasets locally.

Other Notes:

  • This is building on an old version of branch 4, so I've put the base as that branch to make the diff look neater. But should be switched to main once 4 is merged (and there probably will be conflicts with any recent changes to 4).
  • I've been using a different definition of imbalance to Jack D. My definition is the percentage of the dataset that belongs to the positive class. Jack D's definition is how many negative samples there are per positive sample. I've converted between the two where needed in the sampling script but we should settle on one.

@jack89roberts jack89roberts changed the base branch from main to 4-random-sampling May 16, 2025 17:36
@klh5 klh5 marked this pull request as ready for review May 21, 2025 09:39
@klh5 klh5 merged commit 18e0253 into 4-random-sampling May 21, 2025
5 checks passed
@klh5 klh5 deleted the 12-synthetic-data-model branch May 21, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants