-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Overview:
The package works by writing data to disk in a format Cubist understands, and the the C Cubist Code reads it in. In this process we sometimes loss a bit of precision on the data. This isn't an issue if you use Cubist by itself. As both training and prediction data is passed through the same process, yielding proper predictions.
However what it means that for example if the split is listed as 6.2259998 it is actually 6.226, and when predicting it knows that.
However this can lead to problems in other packages that tries to recreate this process from the sufficient statistics and model specification.
It is likely that we can't fix this issue, as it would cause models fitted on the old version to not match. This could be dealt with by having model versioning etc etc but it might not be worth it.
What is happening. First take this fitted model:
library(tidypredict)
library(Cubist)
#> Loading required package: lattice
data("BostonHousing", package = "mlbench")
BostonHousing$chas <- as.numeric(BostonHousing$chas) - 1
set.seed(1)
inTrain <- sample(1:nrow(BostonHousing), floor(.8 * nrow(BostonHousing)))
train_pred <- BostonHousing[inTrain, -14]
test_pred <- BostonHousing[-inTrain, -14]
train_resp <- BostonHousing$medv[inTrain]
test_resp <- BostonHousing$medv[-inTrain]
com_model <- cubist(
x = train_pred,
y = train_resp,
committees = 2,
control = cubistControl(rules = 3)
)We can then extract the split values that are used in the rules.
library(tidyverse)
split_values <- as_tibble(com_model$splits) |>
distinct(variable, value)
data_values <- as_tibble(train_pred) |>
pivot_longer(everything(), names_to = "variable") |>
distinct()
options(pillar.sigfig = 8)
split_values
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 lstat 9.5299997
#> 2 rm 6.2259998
#> 3 rm 6.546
#> 4 lstat 5.3899999
We would imagine that these would match up with observed values in the training data set.
Let us see if that is correct.
matches <- inner_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
matches
#> # A tibble: 1 × 2
#> variable value
#> <chr> <dbl>
#> 1 rm 6.546
non_matches <- anti_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
non_matches
#> # A tibble: 3 × 2
#> variable value
#> <chr> <dbl>
#> 1 lstat 9.5299997
#> 2 rm 6.2259998
#> 3 lstat 5.3899999Hmm, it appears that only one of the of 4 split values are found in the original data.
What happens if we look for nearby values?
inner_join(
data_values,
split_values,
by = "variable",
relationship = "many-to-many",
suffix = c("_data", "_split")
) |>
filter(abs(value_data - value_split) < 0.0001)
#> # A tibble: 4 × 3
#> variable value_data value_split
#> <chr> <dbl> <dbl>
#> 1 rm 6.226 6.2259998
#> 2 rm 6.546 6.546
#> 3 lstat 5.39 5.3899999
#> 4 lstat 9.53 9.5299997
We see that some of the splits are happening VERY close to data values. This is happening in {Cubist} at this line
Line 437 in 0355b94
| Cv = strtod(Name, &EndVal); |
Printing Name and Cv with %.8 we get the following, and we see that the differences match what we are seeing so far.
Name: 0.0837
Cv: 0.08370000
Name: 45
Cv: 45.00000000
Name: 3.44
Cv: 3.44000006
Name: 0
Cv: 0.00000000
Name: 0.437
Cv: 0.43700001
Name: 7.185
Cv: 7.18499994
Name: 38.9
Cv: 38.90000153
Name: 4.5667
Cv: 4.56669998
Name: 5
Cv: 5.00000000
Name: 398
Cv: 398.00000000
Name: 15.2
Cv: 15.19999981
Name: 396.9
Cv: 396.89999390
Name: 5.39
Cv: 5.38999987
Why does this matter? Imagine a committee with the two rules
rule 1: rm > 6.2259998
rule 2: rm <= 6.2259998
If you passed in an observation where rm is 6.226 you would expect that rule 1 would kick in. But what actually happens is that rule 2 kicks in, because of the way {Cubist} reads in the data, turning 6.226 into 6.2259998.
By itself this isn't an issue in {Cubist} besides the confusion it causes people on these edge-cases. However it means that it is VERY hard to replicate in {tidypredict} because of the way these slight offsets are generated.
Notably this means that prediction with data that has values that lie exactly on the splits in the rules, may or may not work.