Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded support for apply_forest_proba (Issue #209) #210

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

salbert83
Copy link
Contributor

@salbert83 salbert83 commented Jan 15, 2023

I think regression uses the functions in ../classification/main.jl for applying forests to a set of features, so no new development required for this.

Fixes #209

@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2023

Codecov Report

Merging #210 (686a44b) into dev (835f3cd) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #210      +/-   ##
==========================================
+ Coverage   87.99%   88.02%   +0.03%     
==========================================
  Files          10       10              
  Lines        1249     1253       +4     
==========================================
+ Hits         1099     1103       +4     
  Misses        150      150              
Impacted Files Coverage Δ
src/classification/main.jl 96.16% <100.00%> (+0.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@rikhuijzer rikhuijzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems like a step in the right direction to solve #209

apply_tree_proba(tree::Root{S, T}, features::AbstractMatrix{S}, labels; use_multithreading = false) where {S, T} =
apply_tree_proba(tree.node, features, labels, use_multithreading = use_multithreading)
apply_tree_proba(tree::LeafOrNode{S, T}, features::AbstractMatrix{S}, labels; use_multithreading = false) where {S, T} =
stack_function_results(row->apply_tree_proba(tree, row, labels), features, use_multithreading = use_multithreading)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stack_function_results(row->apply_tree_proba(tree, row, labels), features, use_multithreading = use_multithreading)
stack_function_results(row->apply_tree_proba(tree, row, labels), features; use_multithreading)

Lower bound is set to 1.6 so no need to repeat the keyword name

Comment on lines +32 to +38
for i in 1:N
out[i, :] = row_fun(X[i, :])
end
else
for i in 1:N
out[i, :] = row_fun(X[i, :])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cases are the same?

@@ -16,6 +16,8 @@ cm = confusion_matrix(labels, preds)
@test depth(model) == 1
probs = apply_tree_proba(model, features, classes)
@test reshape(sum(probs, dims=2), n) ≈ ones(n)
probs_m = apply_tree_proba(model, features, classes, use_multithreading=true)
Copy link
Member

@rikhuijzer rikhuijzer Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there isn't a format style guide for this repository, consistent use of spaces around keyword argument equal signs seems like a good start. Here at line 19 are no spaces and at 33 and 59 there are. MLJ style is no spaces around keyword arguments equals I think.

Same holds for using the semicolon to separate the arguments from the keyword arguments. At some places in this PR it is done and and some not. Here, it's generally advised to use semicolons because they improve clarity.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this @rikhuijzer

@salbert83 Be good to add a docstring, as here: #208

And even better, also add an example in the README.md section on native interface. This should make the new feature more discoverable.

@ablaom
Copy link
Member

ablaom commented Jan 25, 2023

@salbert83 Would you have some time soon to respond to the review?

@ablaom
Copy link
Member

ablaom commented Feb 6, 2023

@rikhuijzer I'm not getting a response here. Are you willing and able to fishish this?

@rikhuijzer
Copy link
Member

It looks like this would result in a nested @threads call. One time in stack_function_results and one time inside the row_fun that is passed into stack_function_results. Nested @threads should be possible with the :dynamic scheduler, which appears to have only been added in Julia 1.8 (https://github.com/JuliaLang/julia/blob/master/HISTORY.md). Also, I don't know how to benchmark whether adding multithreading actually saves time or not.

So, let's leave this open until the lower bound is set to Julia 1.8 or until someone who really needs this implements and shows benchmarks?

@ablaom
Copy link
Member

ablaom commented Feb 7, 2023

Thanks @rikhuijzer for looking into this.

It looks like this would result in a nested @threads call.

So, does this also apply to the existing implementation added in https://github.com/JuliaAI/DecisionTree.jl/pull/188/files that therefore needs attention?

I also notice that the existing implementation is buy-in (use_multithreading=false is the default) whereas the present addition is buy-out.

@rikhuijzer
Copy link
Member

So, does this also apply to the existing implementation added in https://github.com/JuliaAI/DecisionTree.jl/pull/188/files that therefore needs attention?

Maybe that explains #188 (comment). I'm afraid, I don't know and also I never need multithreading so I'm not the right person to ask unfortunately.

Maybe figuring out the right multithreading for this package is something you would like, @ExpandingMan?

@ablaom
Copy link
Member

ablaom commented Feb 8, 2023

@OkonSamuel Do you see obvious issues with the way multithreading is currently implemented in prediction? It's here:

Threads.@threads for i in 1:N

@ablaom
Copy link
Member

ablaom commented Feb 10, 2023

@rikhuijzer I don't believe nested multithreading is an issue. This has been tested before in MLJTuning where optimization multithreading has within it resampling multithreading. My interpretation of the 1.9 changes cited is only that nested threading will typically be more efficient with new default settings for the scheduler.

I don't see anything obvious wrong about the proposed implementation (or the existing one), provided user must buy-in, but will wait for the pinged experts to hopefully weigh in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multithreading support in RF predictions: probabilistic classification - and regression
4 participants