-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High cardinality predictors for #TidyTuesday museums in the UK | Julia Silge #79
Comments
Once again, an awesome episode! Thanks a lot Julia! One question I have would be if you had any recommended reading on how to pick a model to start playing with? I guess that a lot must come from practicing and becoming familiar with the different models, but do you know a nice place to start to gain expertise on whether one should start with an xgboost model or an svm one? Of course, I'm assuming that model selection does matter, but I had this impression from the screencasts! Once again thanks for the videos, I'm always looking forward for the next one! |
@cedricbatailler I think a good place to start could be ISLR or Applied Predictive Modeling. I don't think either is really focused on software (how) but they are great for learning (what, why). |
Again, a fantastic screencast. Thank you, Julia. Since, model interpretability is gaining a lot of attention, it would be great, if you could showcase in a next episode the capabilities of tidymodels and other packages supporting e.g. SHAP and Shapley values for local and global feature explanations. |
@viv-analytics If you'd like to look at how to approach that with tidymodels, you can check this chapter of Tidy Modeling with R. |
Agree with @viv-analytics. Model interpretability for black box model is essential. I tried |
Hi Julia and thank you for your helpful and informative posts. I embedded a categorical variable with 790 cardinality) using both step_lencode_mixed and step_lencode_bayes for a unbalanced dataset (98.6%/1.4%). I noticed that besides the "..new" level added, there are also other new levels added as follows. for some original levels (e.g., "HRG"), after embedding, there is the original level ("HRG") and a new one ("HRGDisposition"). "Disposition" is the name of the target variable. Is this due to the combination of high cardinality and extremely unbalanced data? The obvious problem is the fact that these new levels (e.g., "HRGDisposition") is not going to be in the "new data" and all of these with be assigned to "..new" level. Am i doing something wrong here? |
@amin0511ss Hmmmm, I wouldn't think so; that doesn't make a ton of sense to me. Can you create a reprex (a minimal reproducible example) for this? It can be tough to create a reprex for something super specific like this, but a reprex can make it easier for us to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on Posit Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌 |
High cardinality predictors for #TidyTuesday museums in the UK | Julia Silge
A data science blog
https://juliasilge.com/blog/uk-museums/
The text was updated successfully, but these errors were encountered: