Skip to content

Improve GPU performance #288

@jeremiedb

Description

@jeremiedb

This issue is in relation with SciML small grants program.

The objective of this project is to improve the GPU backend to bring the training benchmarks to in the competitive range to XGBoost.

We want to bring training time on GPU to not more than 25% that of XGBoost on the benchmarks: https://github.com/Evovest/EvoTrees.jl/blob/main/benchmarks/regressor.jl

Current benchmarks

device nobs nfeats max_depth train_evo train_xgb diff
gpu 1000000 10 6 2.2 1.0 111%
gpu 1000000 10 11 24.3 3.1 679%
gpu 1000000 100 6 3.6 3.2 13%
gpu 1000000 100 11 31.1 8.5 265%
gpu 10000000 10 6 8.4 7.8 8%
gpu 10000000 10 11 42.1 13.6 209%
gpu 10000000 100 6 21.7 28.9 -25%
gpu 10000000 100 11 68.6 57.8 19%

Note that currently, the desired performance would already be met for some scenarios. E.g.: 10M observations, 100 feature, and depth of 6 or 11.

However, the performance gap is important for larger depths and smaller number of observations. E.g.: 6.79 times slower for depth 11, 1M observations and 10 features.

A key bottleneck is expected to originate from the important overhead associated at the per node level at larger depths, as their number grows as 2^depth. Therefore, a large number of kernels is launched as the depth of the tree grows. Also, only the gradients
and histograms are currently computed on the GPU, while gains and best node split could also be computed on the GPU
and limit the GPU to CPU communications.

Existing work

Path for improvements has been developed in https://github.com/Evovest/EvoTrees.jl/tree/gpu-hist

device nobs nfeats max_depth train_evo train_xgb diff
gpu 1000000 10 6 1.6 1.0 68%
gpu 1000000 10 11 4.8 2.7 75%
gpu 1000000 100 6 3.9 2.9 33%
gpu 1000000 100 11 11.4 8.0 43%
gpu 10000000 10 6 11.0 7.6 45%
gpu 10000000 10 11 21.6 13.2 64%
gpu 10000000 100 6 39.1 27.3 43%
gpu 10000000 100 11 81.2 51.0 59%

Its approach is to build histograms and gains as a single Arrays for the full tree, allowing a single kernel call at each depth, regardless of the number of nodes.

The other important design change is that instead of tracking a vector of observations ids for each node, a single vector of length == length(obs) is used to track the node id to which each observation is now assigned:

function update_nodes_idx_gpu!(nidx, is, x_bin, cond_feats, cond_bins, feattypes)

However, this work is not leveraging the histogram subtraction trick: the histogram with largest number of observations can be derived by subtracting the other child histogram to the parent one. It's not clear if the subtraction trick would still be useful with this design where each observation needs to be iterated to know to which node it belongs.

General considerations

There's no hard requirement about keeping the current design where each node is treated as a distinct entity or adopting the gpu-hist WIP work where a single kernel can be launched per depth.

Both the improvements to overhead costs associated with the large number of nodes and the change in structure to have the computation cost to grow linearly with depth may be sound design; and the mixing both ideas may be possible.

Having the GPU computation performed through KernelAbstractions.jl would be a nice to have, notably to allow AMD GPU support.

Success criteria

EvoTrees should be at most 25% slower than XGBoost for the 1M and 10M observations, per the basic benchmark

It should be reproducible on either a 3090, 4090 or a RTX A4000.
The solution should be purely Julia based, and not result in a significant increase in code complexity / LoCs.

Below is an example of the target performance.

device nobs nfeats max_depth train_evo train_xgb diff
gpu 1000000 10 6 1.2 1.0 25%
gpu 1000000 10 11 3.4 2.7 25%
gpu 1000000 100 6 3.6 2.9 25%
gpu 1000000 100 11 10.0 8.0 25%
gpu 10000000 10 6 9.5 7.6 25%
gpu 10000000 10 11 16.5 13.2 25%
gpu 10000000 100 6 34.2 27.3 25%
gpu 10000000 100 11 63.8 51.0 25%

Metadata

Metadata

Labels

help wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions