Skip to content

Bug: writing and and reading data from disk causes small discrepancies in splits #62

@EmilHvitfeldt

Description

@EmilHvitfeldt

Overview:

The package works by writing data to disk in a format Cubist understands, and the the C Cubist Code reads it in. In this process we sometimes loss a bit of precision on the data. This isn't an issue if you use Cubist by itself. As both training and prediction data is passed through the same process, yielding proper predictions.
However what it means that for example if the split is listed as 6.2259998 it is actually 6.226, and when predicting it knows that.

However this can lead to problems in other packages that tries to recreate this process from the sufficient statistics and model specification.

It is likely that we can't fix this issue, as it would cause models fitted on the old version to not match. This could be dealt with by having model versioning etc etc but it might not be worth it.

What is happening. First take this fitted model:

library(tidypredict)
library(Cubist)
#> Loading required package: lattice

data("BostonHousing", package = "mlbench")

BostonHousing$chas <- as.numeric(BostonHousing$chas) - 1

set.seed(1)
inTrain <- sample(1:nrow(BostonHousing), floor(.8 * nrow(BostonHousing)))

train_pred <- BostonHousing[inTrain, -14]
test_pred <- BostonHousing[-inTrain, -14]

train_resp <- BostonHousing$medv[inTrain]
test_resp <- BostonHousing$medv[-inTrain]


com_model <- cubist(
  x = train_pred,
  y = train_resp,
  committees = 2,
  control = cubistControl(rules = 3)
)

We can then extract the split values that are used in the rules.

library(tidyverse)
split_values <- as_tibble(com_model$splits) |>
  distinct(variable, value)

data_values <- as_tibble(train_pred) |>
  pivot_longer(everything(), names_to = "variable") |>
  distinct()

options(pillar.sigfig = 8)

split_values
#> # A tibble: 4 × 2
#>   variable     value
#>   <chr>        <dbl>
#> 1 lstat    9.5299997
#> 2 rm       6.2259998
#> 3 rm       6.546    
#> 4 lstat    5.3899999

We would imagine that these would match up with observed values in the training data set.
Let us see if that is correct.

matches <- inner_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
matches
#> # A tibble: 1 × 2
#>   variable value
#>   <chr>    <dbl>
#> 1 rm       6.546

non_matches <- anti_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
non_matches
#> # A tibble: 3 × 2
#>   variable     value
#>   <chr>        <dbl>
#> 1 lstat    9.5299997
#> 2 rm       6.2259998
#> 3 lstat    5.3899999

Hmm, it appears that only one of the of 4 split values are found in the original data.
What happens if we look for nearby values?

inner_join(
  data_values,
  split_values,
  by = "variable",
  relationship = "many-to-many",
  suffix = c("_data", "_split")
) |>
  filter(abs(value_data - value_split) < 0.0001)
#> # A tibble: 4 × 3
#>   variable value_data value_split
#>   <chr>         <dbl>       <dbl>
#> 1 rm            6.226   6.2259998
#> 2 rm            6.546   6.546    
#> 3 lstat         5.39    5.3899999
#> 4 lstat         9.53    9.5299997

We see that some of the splits are happening VERY close to data values. This is happening in {Cubist} at this line

Cv = strtod(Name, &EndVal);
.

Printing Name and Cv with %.8 we get the following, and we see that the differences match what we are seeing so far.

Name: 0.0837
Cv: 0.08370000
Name: 45
Cv: 45.00000000
Name: 3.44
Cv: 3.44000006
Name: 0
Cv: 0.00000000
Name: 0.437
Cv: 0.43700001
Name: 7.185
Cv: 7.18499994
Name: 38.9
Cv: 38.90000153
Name: 4.5667
Cv: 4.56669998
Name: 5
Cv: 5.00000000
Name: 398
Cv: 398.00000000
Name: 15.2
Cv: 15.19999981
Name: 396.9
Cv: 396.89999390
Name: 5.39
Cv: 5.38999987

Why does this matter? Imagine a committee with the two rules

rule 1: rm > 6.2259998

rule 2: rm <= 6.2259998

If you passed in an observation where rm is 6.226 you would expect that rule 1 would kick in. But what actually happens is that rule 2 kicks in, because of the way {Cubist} reads in the data, turning 6.226 into 6.2259998.

By itself this isn't an issue in {Cubist} besides the confusion it causes people on these edge-cases. However it means that it is VERY hard to replicate in {tidypredict} because of the way these slight offsets are generated.

Notably this means that prediction with data that has values that lie exactly on the splits in the rules, may or may not work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions