Bug: writing and and reading data from disk causes small discrepancies in splits

Overview:

The package works by writing data to disk in a format Cubist understands, and the the C Cubist Code reads it in. In this process we sometimes loss a bit of precision on the data. This isn't an issue if you use Cubist by itself. As both training and prediction data is passed through the same process, yielding proper predictions.
However what it means that for example if the split is listed as `6.2259998` it is actually `6.226`, and when predicting it knows that.

However this can lead to problems in other packages that tries to recreate this process from the sufficient statistics and model specification.

It is likely that we can't fix this issue, as it would cause models fitted on the old version to not match. This could be dealt with by having model versioning etc etc but it might not be worth it.

What is happening. First take this fitted model:

``` r
library(tidypredict)
library(Cubist)
#> Loading required package: lattice

data("BostonHousing", package = "mlbench")

BostonHousing$chas <- as.numeric(BostonHousing$chas) - 1

set.seed(1)
inTrain <- sample(1:nrow(BostonHousing), floor(.8 * nrow(BostonHousing)))

train_pred <- BostonHousing[inTrain, -14]
test_pred <- BostonHousing[-inTrain, -14]

train_resp <- BostonHousing$medv[inTrain]
test_resp <- BostonHousing$medv[-inTrain]


com_model <- cubist(
  x = train_pred,
  y = train_resp,
  committees = 2,
  control = cubistControl(rules = 3)
)
```

We can then extract the split values that are used in the rules.

```
library(tidyverse)
split_values <- as_tibble(com_model$splits) |>
  distinct(variable, value)

data_values <- as_tibble(train_pred) |>
  pivot_longer(everything(), names_to = "variable") |>
  distinct()

options(pillar.sigfig = 8)

split_values
#> # A tibble: 4 × 2
#>   variable     value
#>   <chr>        <dbl>
#> 1 lstat    9.5299997
#> 2 rm       6.2259998
#> 3 rm       6.546    
#> 4 lstat    5.3899999
```

We would imagine that these would match up with observed values in the training data set.
Let us see if that is correct.

```r
matches <- inner_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
matches
#> # A tibble: 1 × 2
#>   variable value
#>   <chr>    <dbl>
#> 1 rm       6.546

non_matches <- anti_join(split_values, data_values)
#> Joining with `by = join_by(variable, value)`
non_matches
#> # A tibble: 3 × 2
#>   variable     value
#>   <chr>        <dbl>
#> 1 lstat    9.5299997
#> 2 rm       6.2259998
#> 3 lstat    5.3899999
```

Hmm, it appears that only one of the of 4 split values are found in the original data.
What happens if we look for nearby values?

```
inner_join(
  data_values,
  split_values,
  by = "variable",
  relationship = "many-to-many",
  suffix = c("_data", "_split")
) |>
  filter(abs(value_data - value_split) < 0.0001)
#> # A tibble: 4 × 3
#>   variable value_data value_split
#>   <chr>         <dbl>       <dbl>
#> 1 rm            6.226   6.2259998
#> 2 rm            6.546   6.546    
#> 3 lstat         5.39    5.3899999
#> 4 lstat         9.53    9.5299997
```

We see that some of the splits are happening VERY close to data values. This is happening in {Cubist} at this line https://github.com/topepo/Cubist/blob/0355b9437b345be77c2b16f7f151cf6599fe161b/src/getdata.c#L437.

Printing `Name` and `Cv` with `%.8` we get the following, and we see that the differences match what we are seeing so far.

```
Name: 0.0837
Cv: 0.08370000
Name: 45
Cv: 45.00000000
Name: 3.44
Cv: 3.44000006
Name: 0
Cv: 0.00000000
Name: 0.437
Cv: 0.43700001
Name: 7.185
Cv: 7.18499994
Name: 38.9
Cv: 38.90000153
Name: 4.5667
Cv: 4.56669998
Name: 5
Cv: 5.00000000
Name: 398
Cv: 398.00000000
Name: 15.2
Cv: 15.19999981
Name: 396.9
Cv: 396.89999390
Name: 5.39
Cv: 5.38999987
```

Why does this matter? Imagine a committee with the two rules

```
rule 1: rm > 6.2259998

rule 2: rm <= 6.2259998
```

If you passed in an observation where `rm` is `6.226` you would expect that rule 1 would kick in. But what actually happens is that rule 2 kicks in, because of the way {Cubist} reads in the data, turning `6.226` into `6.2259998`.

By itself this isn't an issue in {Cubist} besides the confusion it causes people on these edge-cases. However it means that it is VERY hard to replicate in {tidypredict} because of the way these slight offsets are generated.

Notably this means that prediction with data that has values that lie exactly on the splits in the rules, may or may not work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: writing and and reading data from disk causes small discrepancies in splits #62

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: writing and and reading data from disk causes small discrepancies in splits #62

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions