Optimize component-wise prediction for low-cardinality features

For a feature $X_1 \in \\{1, 2\\}$, we might find in the train/test data something like $x_1 = \\{1, 1, 1, 1, \ldots, 1, 1, 2, 2, \ldots, 2\\}$

Since $\hat{m}_1(x_1)$ is constant for either possible values, repeatedly calling the C++-level `$predict_matrix()` (wrapped by `.predict_single_component()` in the R package) is inefficient.

To illustrate with an example of the `mtcars` dataset and `cyl` variable with 3 possible values (4, 6, 8):

``` r
library(randomPlantedForest)
set.seed(23)
rpfit <- rpf(mpg ~ ., data = mtcars, max_interaction = 3, ntrees = 30)

(test_short <- matrix(c(rep(4, 1), 6, 8), dimnames = list(NULL, "cyl")))
#>      cyl
#> [1,]   4
#> [2,]   6
#> [3,]   8
(test_long <- matrix(c(rep(4, 100), 6, 8), dimnames = list(NULL, "cyl")))
#>        cyl
#>   [1,]   4
[...]
#> [100,]   4
#> [101,]   6
#> [102,]   8
```

As expected, the predicted components are identical for identical input values:

``` r
randomPlantedForest:::.predict_single_component(
  rpfit,
  test_short,
  predictors = "cyl"
)
#>           cyl
#> [1,] 9.992948
#> [2,] 6.209208
#> [3,] 5.967651

randomPlantedForest:::.predict_single_component(
  rpfit,
  test_long,
  predictors = "cyl"
)
#>             cyl
#>   [1,] 9.992948
[...]
#> [100,] 9.992948
#> [101,] 6.209208
#> [102,] 5.967651
```

Now, repeating a single input value (4) `repeats` times shows the time it takes for prediction scales about linearly:

``` r
bm <- bench::press(
  repeats = c(1, 100, 1e3, 1e4, 1e5, 1e6),
  {
    testmat <- matrix(c(rep(4, repeats), 6, 8), dimnames = list(NULL, "cyl"))

    bench::mark(
      predict_single = randomPlantedForest:::.predict_single_component(
        rpfit,
        testmat,
        predictors = "cyl"
      )
    )
  }
)
#> Running with:
#>   repeats
#> 1       1
#> 2     100
#> 3    1000
#> 4   10000
#> 5  100000

bm
#> # A tibble: 6 × 7
#>   expression     repeats      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 predict_single       1 108.73µs  114.3µs 8307.            0B    8.12 
#> 2 predict_single     100   2.69ms    2.8ms  356.        2.13KB    2.01 
#> 3 predict_single    1000  26.72ms   27.4ms   36.6      19.71KB    0    
#> 4 predict_single   10000 269.34ms  269.5ms    3.71    195.49KB    0    
#> 5 predict_single  100000    2.71s     2.7s    0.370     1.91MB    0.739
#> 6 predict_single 1000000   27.41s    27.4s    0.0365   19.07MB    0.547
```

<sup>Created on 2025-03-22 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>


I'm not sure what the best way is to approach this as any optimization is likely going to carry a little overhead for small inputs, while paying off for larger inputs. I'm also not sure whether this should be tackled from the C++ side or the R side, but when tackled from R and any future Python wrapper would need to come up with it's own solution.
Is this where [Memoization](https://en.wikipedia.org/wiki/Memoization) comes in handy?

Ah and we also shouldn't forget that this presumably also holds for interactions, so for a dataset with a large number of binary features for example it could probably save a lot of time to optimize prediction here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize component-wise prediction for low-cardinality features #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize component-wise prediction for low-cardinality features #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions