-
Notifications
You must be signed in to change notification settings - Fork 2
Description
For a feature
Since $predict_matrix() (wrapped by .predict_single_component() in the R package) is inefficient.
To illustrate with an example of the mtcars dataset and cyl variable with 3 possible values (4, 6, 8):
library(randomPlantedForest)
set.seed(23)
rpfit <- rpf(mpg ~ ., data = mtcars, max_interaction = 3, ntrees = 30)
(test_short <- matrix(c(rep(4, 1), 6, 8), dimnames = list(NULL, "cyl")))
#> cyl
#> [1,] 4
#> [2,] 6
#> [3,] 8
(test_long <- matrix(c(rep(4, 100), 6, 8), dimnames = list(NULL, "cyl")))
#> cyl
#> [1,] 4
[...]
#> [100,] 4
#> [101,] 6
#> [102,] 8As expected, the predicted components are identical for identical input values:
randomPlantedForest:::.predict_single_component(
rpfit,
test_short,
predictors = "cyl"
)
#> cyl
#> [1,] 9.992948
#> [2,] 6.209208
#> [3,] 5.967651
randomPlantedForest:::.predict_single_component(
rpfit,
test_long,
predictors = "cyl"
)
#> cyl
#> [1,] 9.992948
[...]
#> [100,] 9.992948
#> [101,] 6.209208
#> [102,] 5.967651Now, repeating a single input value (4) repeats times shows the time it takes for prediction scales about linearly:
bm <- bench::press(
repeats = c(1, 100, 1e3, 1e4, 1e5, 1e6),
{
testmat <- matrix(c(rep(4, repeats), 6, 8), dimnames = list(NULL, "cyl"))
bench::mark(
predict_single = randomPlantedForest:::.predict_single_component(
rpfit,
testmat,
predictors = "cyl"
)
)
}
)
#> Running with:
#> repeats
#> 1 1
#> 2 100
#> 3 1000
#> 4 10000
#> 5 100000
bm
#> # A tibble: 6 × 7
#> expression repeats min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 predict_single 1 108.73µs 114.3µs 8307. 0B 8.12
#> 2 predict_single 100 2.69ms 2.8ms 356. 2.13KB 2.01
#> 3 predict_single 1000 26.72ms 27.4ms 36.6 19.71KB 0
#> 4 predict_single 10000 269.34ms 269.5ms 3.71 195.49KB 0
#> 5 predict_single 100000 2.71s 2.7s 0.370 1.91MB 0.739
#> 6 predict_single 1000000 27.41s 27.4s 0.0365 19.07MB 0.547Created on 2025-03-22 with reprex v2.1.1
I'm not sure what the best way is to approach this as any optimization is likely going to carry a little overhead for small inputs, while paying off for larger inputs. I'm also not sure whether this should be tackled from the C++ side or the R side, but when tackled from R and any future Python wrapper would need to come up with it's own solution.
Is this where Memoization comes in handy?
Ah and we also shouldn't forget that this presumably also holds for interactions, so for a dataset with a large number of binary features for example it could probably save a lot of time to optimize prediction here.