Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cardinality predictors for #TidyTuesday museums in the UK | Julia Silge #79

Open
utterances-bot opened this issue Nov 27, 2022 · 7 comments

Comments

@utterances-bot
Copy link

High cardinality predictors for #TidyTuesday museums in the UK | Julia Silge

A data science blog

https://juliasilge.com/blog/uk-museums/

Copy link

Once again, an awesome episode! Thanks a lot Julia!

One question I have would be if you had any recommended reading on how to pick a model to start playing with? I guess that a lot must come from practicing and becoming familiar with the different models, but do you know a nice place to start to gain expertise on whether one should start with an xgboost model or an svm one? Of course, I'm assuming that model selection does matter, but I had this impression from the screencasts!

Once again thanks for the videos, I'm always looking forward for the next one!

@juliasilge
Copy link
Owner

@cedricbatailler I think a good place to start could be ISLR or Applied Predictive Modeling. I don't think either is really focused on software (how) but they are great for learning (what, why).

Copy link

Again, a fantastic screencast. Thank you, Julia.

Since, model interpretability is gaining a lot of attention, it would be great, if you could showcase in a next episode the capabilities of tidymodels and other packages supporting e.g. SHAP and Shapley values for local and global feature explanations.

@juliasilge
Copy link
Owner

@viv-analytics If you'd like to look at how to approach that with tidymodels, you can check this chapter of Tidy Modeling with R.

Copy link

Agree with @viv-analytics. Model interpretability for black box model is essential. I tried SHAPforxgboost, fastshap, and shapviz packages and I don't know how to combine these functions with tidymodels objects. Some unexpected errors always occur.

Copy link

Hi Julia and thank you for your helpful and informative posts. I embedded a categorical variable with 790 cardinality) using both step_lencode_mixed and step_lencode_bayes for a unbalanced dataset (98.6%/1.4%). I noticed that besides the "..new" level added, there are also other new levels added as follows. for some original levels (e.g., "HRG"), after embedding, there is the original level ("HRG") and a new one ("HRGDisposition"). "Disposition" is the name of the target variable. Is this due to the combination of high cardinality and extremely unbalanced data? The obvious problem is the fact that these new levels (e.g., "HRGDisposition") is not going to be in the "new data" and all of these with be assigned to "..new" level. Am i doing something wrong here?

@juliasilge
Copy link
Owner

@amin0511ss Hmmmm, I wouldn't think so; that doesn't make a ton of sense to me. Can you create a reprex (a minimal reproducible example) for this? It can be tough to create a reprex for something super specific like this, but a reprex can make it easier for us to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page.

Once you have a reprex, I recommend posting on Posit Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants