Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predict the magnitude of #TidyTuesday tornadoes with effect encoding and xgboost | Julia Silge #92

Open
utterances-bot opened this issue Jun 15, 2023 · 7 comments

Comments

@utterances-bot
Copy link

Predict the magnitude of #TidyTuesday tornadoes with effect encoding and xgboost | Julia Silge

A data science blog

https://juliasilge.com/blog/tornadoes/

Copy link

Hello Julia,

Thanks for another informative post. Your method of handling high cardinality categorical variables through likelihood encoding was interesting.

I noticed that 'st' variable is a top contributor to the model. However, the encoding adds a degree of abstraction. I am trying to interpret effects of specific states on the tornado magnitude. Can we somehow map these encoded 'st' values back to the original states for more intuitive interpretation? Could referring to encoded st values themselves provide a straightforward way to understand their effects?

Moreover, I am pondering if PDP could be used to further explore the effects of each state.

Thanks again for your insightful post. Looking forward to more of it.

@juliasilge
Copy link
Owner

@msahil515 Yes, you can get out the values associated with each value for st by tidying the recipe. Check out how I do that in this similar post -- look for tidy().

You could also use a partial dependence profile to examine the results more. I like using model_profile() from DALEX, as shown here.

Copy link

Hello Julia

I would like to use a different encoding method for categorical variables, similar to the internal pca ordering method used by ranger (adapted from Coppersmith). It is target based and so needs to be done on each fold, rather than prior to splitting the data. How would I be able to incorporate this into a recipe step please?

Many thanks!

@juliasilge
Copy link
Owner

@smithhelen Take a look at this article on how to create your own recipe step.

Copy link

Hello Julia, congrats for your impressive work.

I have a question about the grid in tune_race_anova(). The grid is the total number of combinations of the levels of trees, min_n, and mtry? Or for each of these hyperparameters, it will be considered 15 levels and the total grid will have 15^3?

Thank you.

@juliasilge
Copy link
Owner

@robsonpro Ah no, if you set grid = 15, the way it works is to choose a grid_max_entropy() with 15 elements total. You can read more about this kind of behavior in this chapter, and especially this section. Notice where it says:

The default design used by the tune package is the maximum entropy design.

You can provide your own grid in that argument, using any of the kinds of grid specifications outlined in that chapter. If you use the default or do something like grid = 10, it will do a maximum entropy grid with 10 elements.

@robsonpro
Copy link

Thank you so much for your attention and explanation, @juliasilge. I catch that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants