Improve GPU performance

This issue is in relation with SciML small grants program. 

The objective of this project is to improve the GPU backend to bring the training benchmarks to in the competitive range to XGBoost. 

We want to bring training time on GPU to not more than 25% that of XGBoost on the benchmarks: https://github.com/Evovest/EvoTrees.jl/blob/main/benchmarks/regressor.jl

### Current benchmarks

device | nobs | nfeats | max_depth | train_evo | train_xgb | diff
-- | -- | -- | -- | -- | -- | --
gpu | 1000000 | 10 | 6 | 2.2 | 1.0 | 111%
gpu | 1000000 | 10 | 11 | 24.3 | 3.1 | 679%
gpu | 1000000 | 100 | 6 | 3.6 | 3.2 | 13%
gpu | 1000000 | 100 | 11 | 31.1 | 8.5 | 265%
gpu | 10000000 | 10 | 6 | 8.4 | 7.8 | 8%
gpu | 10000000 | 10 | 11 | 42.1 | 13.6 | 209%
gpu | 10000000 | 100 | 6 | 21.7 | 28.9 | -25%
gpu | 10000000 | 100 | 11 | 68.6 | 57.8 | 19%

Note that currently, the desired performance would already be met for some scenarios. E.g.: 10M observations, 100 feature, and depth of 6 or 11.

However, the performance gap is important for larger depths and smaller number of observations. E.g.: 6.79 times slower for depth 11, 1M observations and 10 features.

A key bottleneck is expected to originate from the important overhead associated at the per node level at larger depths, as their number grows as 2^depth. Therefore, a large number of kernels is launched as the depth of the tree grows. Also, only the gradients
and histograms are currently computed on the GPU, while gains and best node split could also be computed on the GPU
and limit the GPU to CPU communications.

### Existing work
Path for improvements has been developed in https://github.com/Evovest/EvoTrees.jl/tree/gpu-hist

device | nobs | nfeats | max_depth | train_evo | train_xgb | diff
-- | -- | -- | -- | -- | -- | --
gpu | 1000000 | 10 | 6 | 1.6 | 1.0 | 68%
gpu | 1000000 | 10 | 11 | 4.8 | 2.7 | 75%
gpu | 1000000 | 100 | 6 | 3.9 | 2.9 | 33%
gpu | 1000000 | 100 | 11 | 11.4 | 8.0 | 43%
gpu | 10000000 | 10 | 6 | 11.0 | 7.6 | 45%
gpu | 10000000 | 10 | 11 | 21.6 | 13.2 | 64%
gpu | 10000000 | 100 | 6 | 39.1 | 27.3 | 43%
gpu | 10000000 | 100 | 11 | 81.2 | 51.0 | 59%

Its approach is to build histograms and gains as a single Arrays for the full tree, allowing a single kernel call at each depth, regardless of the number of nodes. 

The other important design change is that instead of tracking a vector of observations ids for each node, a single vector of length == length(obs) is used to track the node id to which each observation is now assigned: https://github.com/Evovest/EvoTrees.jl/blob/c8694f7237f98f0cbf004866f8d92590169d47fb/ext/EvoTreesCUDAExt/fit-utils.jl#L127

However, this work is not leveraging the histogram subtraction trick: the histogram with largest number of observations can be derived by subtracting the other child histogram to the parent one. It's not clear if the subtraction trick would still be useful with this design where each observation needs to be iterated to know to which node it belongs.

### General considerations
There's no hard requirement about keeping the current design where each node is treated as a distinct entity or adopting the `gpu-hist` WIP work where a single kernel can be launched per depth. 

Both the improvements to overhead costs associated with the large number of nodes and the change in structure to have the computation cost to grow linearly with depth may be sound design; and the mixing both ideas may be possible.

Having the GPU computation performed through [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl) would be a nice to have, notably to allow AMD GPU support. 


### Success criteria

EvoTrees should be at most 25% slower than XGBoost for the 1M and 10M observations, per the basic [benchmark](https://github.com/Evovest/EvoTrees.jl/blob/main/benchmarks/regressor.jl)

It should be reproducible on either a 3090, 4090 or a RTX A4000.
The solution should be purely Julia based, and not result in a significant increase in code complexity / LoCs.

Below is an example of the target performance. 

device | nobs | nfeats | max_depth | train_evo | train_xgb | diff
-- | -- | -- | -- | -- | -- | --
gpu | 1000000 | 10 | 6 | 1.2 | 1.0 | 25%
gpu | 1000000 | 10 | 11 | 3.4 | 2.7 | 25%
gpu | 1000000 | 100 | 6 | 3.6 | 2.9 | 25%
gpu | 1000000 | 100 | 11 | 10.0 | 8.0 | 25%
gpu | 10000000 | 10 | 6 | 9.5 | 7.6 | 25%
gpu | 10000000 | 10 | 11 | 16.5 | 13.2 | 25%
gpu | 10000000 | 100 | 6 | 34.2 | 27.3 | 25%
gpu | 10000000 | 100 | 11 | 63.8 | 51.0 | 25%



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve GPU performance #288

Current benchmarks

Existing work

General considerations

Success criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

device	nobs	nfeats	max_depth	train_evo	train_xgb	diff
gpu	1000000	10	6	2.2	1.0	111%
gpu	1000000	10	11	24.3	3.1	679%
gpu	1000000	100	6	3.6	3.2	13%
gpu	1000000	100	11	31.1	8.5	265%
gpu	10000000	10	6	8.4	7.8	8%
gpu	10000000	10	11	42.1	13.6	209%
gpu	10000000	100	6	21.7	28.9	-25%
gpu	10000000	100	11	68.6	57.8	19%

Improve GPU performance #288

Description

Current benchmarks

Existing work

General considerations

Success criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions