Skip to content

Optimize component-wise prediction for low-cardinality features #59

@jemus42

Description

@jemus42

For a feature $X_1 \in \{1, 2\}$, we might find in the train/test data something like $x_1 = \{1, 1, 1, 1, \ldots, 1, 1, 2, 2, \ldots, 2\}$

Since $\hat{m}_1(x_1)$ is constant for either possible values, repeatedly calling the C++-level $predict_matrix() (wrapped by .predict_single_component() in the R package) is inefficient.

To illustrate with an example of the mtcars dataset and cyl variable with 3 possible values (4, 6, 8):

library(randomPlantedForest)
set.seed(23)
rpfit <- rpf(mpg ~ ., data = mtcars, max_interaction = 3, ntrees = 30)

(test_short <- matrix(c(rep(4, 1), 6, 8), dimnames = list(NULL, "cyl")))
#>      cyl
#> [1,]   4
#> [2,]   6
#> [3,]   8
(test_long <- matrix(c(rep(4, 100), 6, 8), dimnames = list(NULL, "cyl")))
#>        cyl
#>   [1,]   4
[...]
#> [100,]   4
#> [101,]   6
#> [102,]   8

As expected, the predicted components are identical for identical input values:

randomPlantedForest:::.predict_single_component(
  rpfit,
  test_short,
  predictors = "cyl"
)
#>           cyl
#> [1,] 9.992948
#> [2,] 6.209208
#> [3,] 5.967651

randomPlantedForest:::.predict_single_component(
  rpfit,
  test_long,
  predictors = "cyl"
)
#>             cyl
#>   [1,] 9.992948
[...]
#> [100,] 9.992948
#> [101,] 6.209208
#> [102,] 5.967651

Now, repeating a single input value (4) repeats times shows the time it takes for prediction scales about linearly:

bm <- bench::press(
  repeats = c(1, 100, 1e3, 1e4, 1e5, 1e6),
  {
    testmat <- matrix(c(rep(4, repeats), 6, 8), dimnames = list(NULL, "cyl"))

    bench::mark(
      predict_single = randomPlantedForest:::.predict_single_component(
        rpfit,
        testmat,
        predictors = "cyl"
      )
    )
  }
)
#> Running with:
#>   repeats
#> 1       1
#> 2     100
#> 3    1000
#> 4   10000
#> 5  100000

bm
#> # A tibble: 6 × 7
#>   expression     repeats      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 predict_single       1 108.73µs  114.3µs 8307.            0B    8.12 
#> 2 predict_single     100   2.69ms    2.8ms  356.        2.13KB    2.01 
#> 3 predict_single    1000  26.72ms   27.4ms   36.6      19.71KB    0    
#> 4 predict_single   10000 269.34ms  269.5ms    3.71    195.49KB    0    
#> 5 predict_single  100000    2.71s     2.7s    0.370     1.91MB    0.739
#> 6 predict_single 1000000   27.41s    27.4s    0.0365   19.07MB    0.547

Created on 2025-03-22 with reprex v2.1.1

I'm not sure what the best way is to approach this as any optimization is likely going to carry a little overhead for small inputs, while paying off for larger inputs. I'm also not sure whether this should be tackled from the C++ side or the R side, but when tackled from R and any future Python wrapper would need to come up with it's own solution.
Is this where Memoization comes in handy?

Ah and we also shouldn't forget that this presumably also holds for interactions, so for a dataset with a large number of binary features for example it could probably save a lot of time to optimize prediction here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions