-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To downsample imbalanced data or not, with #TidyTuesday bird feeders | Julia Silge #82
Comments
Hi Julia, Thanks for the video, I have a couple questions:
Thanks! |
Thanks for the great questions @gunnergalactico!
The way this would work is that you could pass a specific seed in as an argument, or if you set the seed outside the call to set.seed(123)
sample.int(10^4, 1)
#> [1] 2463 Created on 2023-01-19 with reprex v2.0.2 If you do that over and over, you'll see that you get the same thing every time.
It was always possible, but you will get errors if you use a model that only produces class predictions and does not produce class probability predictions. An example of a model that can only produce class predictions is the LiblineaR SVM model.
There's an issue open in yardstick about making a similar report (tidymodels/yardstick#308), so I encourage you to chime in there with your use case and what you're looking for. |
How much of a difference will it make to model performance to replace the missing values using a simple mean based imputation, as opposed to a KNN algorithm driven imputer? Given that 37 variables of the 62 are 1/0 and another 10-12 are also ordinal having between 3 to 9 levels? |
@NatarajanLalgudi That is not something I would ever dare to guess as to the result, since it depends so much on the relationships in your data. The way to find out is to try out both approaches and evaluate the results using careful resampling. In tidymodels, you could do this with a workflow set (use whatever model you intend to implement, plus one recipe that uses |
Hi Julia, |
That's a great point @sweiner123! I am already treating this as a linear regression problem so I don't think that would change, but dealing with some of the observations being the same bird baths could be a great option. You could either aggregate the data for each bird bath before starting modeling, or you could use a resampling strategy that makes sure that the same sites are together in a resample (to reduce overly optimistic performance estimates). I would lean toward the second and you can read more about this kind of resampling here. |
That's some really cool resampling! Thanks for pointing that out! |
Nice! The tradeoff between sensitivity and specificity could maybe be more naturally explored by varying the decision threshold. I'd be curious how the log-loss looks for the resampled cases after applying the adjustment from King and Zeng, e.g. https://stats.stackexchange.com/a/611132/232706 |
Hi Julia, and thanks for all these great posts. I have a question about setting up folds for tuning hyperparameters using cross-validation. Here, you create your folds using the training data set, and each fold in |
@bnagelson I think you are understanding correctly that when you use downsampling, fewer observations are used for training than when you don't use downsampling. If you use something like |
Hi Julia, Does tidymodels offer us a way to tweak the cost function for misclassification across the minority and majority classes? Working on a series of classification models that survey a range of the model engines as you do in the textbook chapter. Looking for an alternative to explore rather than upsampling or downsampling. Thanks, |
@jlecornu3 Yep! You can check out the |
Hello Julia, to piggyback on this question, is there a way to apply class weights in tidymodels? The documentation on the website says it’s experimental but doesn’t have an example of how it would be done to counter class imbalance. For example in sklearn i would do something like this from sklearn.utils.class_weight import compute_class_weight target = train.target_column class_weight=dict(zip(np.unique(target), class_weights) i can then pass that into the model. Thanks a bunch! |
@gunnergalactico I can't quite tell from a brief look at the scikit-learn documentation whether it is more like case weights or more like subsampling for class imbalance. Take a look at these two articles to see which is what you are looking for: |
To downsample imbalanced data or not, with #TidyTuesday bird feeders | Julia Silge
A data science blog
https://juliasilge.com/blog/project-feederwatch/
The text was updated successfully, but these errors were encountered: