Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate multiple modeling approaches for #TidyTuesday spam email | Julia Silge #94

Open
utterances-bot opened this issue Nov 24, 2023 · 9 comments

Comments

@utterances-bot
Copy link

Evaluate multiple modeling approaches for #TidyTuesday spam email | Julia Silge

A data science blog

https://juliasilge.com/blog/spam-email/

Copy link

Hi Julia!
First thank you so much for your work in this package and this blog. I can't emphasize how much your work has helped me grow my confidence in this area and most importantly made this fun!

Some of my colleagues use python and I can honestly say I am running circles around them because of the work you and the tidymodels team have done here. I'm a huge fan.

Quick question for you -- just as you have workflow_map() for fitting models to resamples and then we can we can view the results. Is there a similar way to to use workflow_map() on the testing data set?

While this may be counter the overall workflow / pipeline that I see in machine learning where we focus and tune the results against resamples of the testing set and then extract the best model and do a last_fit(), for one reason or another we will want to see to how the many models perform against the testing set.

Is there any way to do this with workflow_set() and workflow_map()?

@juliasilge
Copy link
Owner

@alejandrohagan Thank you so much for the kind words! ❤️

There isn't currently an automatic way to use a workflow_set() with the testing set, mainly because we see a workflow_set() as something you do/use doing model development while the testing set is only used for confirming expected performance after you have chosen a final model. If you have a fitted workflow_set(), then you can use extract_workflow_set_result() to get out a specific fitted workflow and then do whatever you want with it, like predict() on the testing set.

Copy link

Hi Julia,

Thank you for your amazing work on the blog. Your efforts made my learning enjoyable!

I'm curious about the vip() function. When performing multiple modeling and wanting to determine the vip() for all models, should we extract the VI values from fit_resample or last_fit? Additionally, if we need to use it with workflow map fitting, how can we extract the workflow or parsnip from the process? Thank you for your help!

@juliasilge
Copy link
Owner

@NizePetcharat If you want to use variable importance as part of your process of comparing and choosing a model, then I would do that with your resamples, yes. You might check out this Stack Overflow answer where I outline how to approach this.

Copy link

Hi,

If it turns out with formula_rf_tune is the best, how can we extract mtry etc for Train and evaluate final model?

@juliasilge
Copy link
Owner

@NarainritKaruna Take a look at how you can use extract_workflow_set_result(): https://workflowsets.tidymodels.org/reference/extract_workflow_set_result.html

Copy link

NarainritKaruna commented Jul 8, 2024

Thanks Julia,

As I usually use select_best(), then I will get mtry and min_n. However, when use extract_workflow_set_result (spam_res,"formula_rf_tune") there is no parameters (mtry & min_n)

Edited I finally got it. Thanks by pulling from "results"

Copy link

Hi Julia, thank you for this tutorial (and all your other tutorials too)! I have data that has originated from four separate studies. Each study examines the effect of a different medication on treatment response. All four studies have the same baseline variables and outcomes variables. Beyond classifying treatment success based on baseline variables, I am wanting to determine which baseline variables are most important in classifying success to each of those four medications. I am planning to do this using a workflow set of different algorithms (xgboost, random forest, svm), finalising the best performing workflow and examining variable importance. I am unsure of how best to compare across the four studies with regard to variable importance. Would it be best to collate all the data and run one model with interaction effects between each medication and each baseline variable, or run four separate models (one for each medication) and compare the importance of variables across the four models? I am unsure whether the former approach would allow me to isolate the importance of specific variables by medication. Would really appreciate your thoughts on this!

@juliasilge
Copy link
Owner

@gezelle-d You might want to take a look at https://www.tmwr.org/explain and especially section 18.4. If it makes the most sense to train one single model for all four types of medication, then you can still understand something about variable importance for the four options. In the analogy of the approach shown in Fig 18.6, the four types of homes would be like your four medications.

I also recommend posting this kind of question on Posit Community, which is a great forum for getting help with these kinds of modeling discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants