Skip to content
Adam Aker edited this page Feb 17, 2023 · 11 revisions

Basic Aker Relational Score (BARS)

BARS is a dimensionality reduction algorithm which has the goal of maintaining interpretability i.e we eliminate variables directly from potential models that don't seem to add any predictive power. The use of decision trees to approximate a function between two variables is the construction of a deterministic model which can be compared to the probabilistic model of always predicting the mean of a target. This is a modified version of the Predictive Power Score inspired by Florian Wetschoreck's article.

Typically dimensionality reduction by ranking features is done by

  1. Creating a predictive model using all input/output variables
  2. For each input, remove it from the collection of inputs and build another predictive model
  3. Compare the accuracy of the two models
  4. The variable which has the largest change in accuracy is ranked as the most important predictor, the variable which has the next largest change is ranked as the second, etc...

However, this ranking is dependent upon comparing models which are constructed using all available data. Furthermore, you need to construct separate models with each feature removed in order to do a traditional importance ranking. This might take awhile for a large dataset especially if constructing different model types and adjusting their hyper-parameters in order make a good judgement call for which variables to omit from potential models. A different approach is to

  1. Create a smart predictive model (a decision tree) between a single input and a single output.
  2. Create another (naive) predictive model between the same variables as in step one. Basically, this model will always predict the median of the output independent of the input.
  3. Compare the accuracy of the two models.
  4. For a given input, iterate steps 1-4 for all of the outputs.
  5. Now you can score how well an input does in terms of predicting all of the outputs.
  6. Iterate steps 1-5 for all the inputs.

At this point you'll have a relative metric for deciding which variables to include and which to omit. The idea here is to compare how well a deterministic model does when fitting it to the dataset versus a simple probabilistic model. This provides a quick way to judge which variables are important and which ones are not.

The Problem

We'll start with a set of observations which can be further split into a set of features $F$ (things we want to use to predict) and targets $T$ (things we want to predict). The elements $x\in F$ and $y\in T$ are time-series of some measurable quantity. The main goal will be to minimize the set of features and targets we want to use to build models based on how well a feature does at predicting all the targets. How can we choose a good minimal set of observables to build models with? If we can potentially identify that there is a function between $x$ and $y$, then we can say that $x$ has predictive power with respect to $y$. So, how can identify if a function potentially exists between $x$ and $y$?

Universal Function Approximators

Decision Trees are universal function approximators which basically means, we can split two dimensional subset of our data into different bins which are chosen based on minimizing a cost function. In this case the boundaries of the bins are chosen so as to minimize the error of the tree model makes when making predicitons. Spliting the data into different bins is constructing a function, but we need to understand how well this function does compared to a more naive model of prediction: taking the median of the target $y$ and always guessing that any $x$ will map to the median.

Comparing Model Performance

If we have two different models $g_1$ and $g_2$ mapping feature $x$ to target $y$, then we will need a way to choose which model does a better job at predicting $y$ from $x$. One way to do this is to look at the mean absolute error of each model which is defined as

$$\text{MAE}=\sum\limits_{i=1}^{N}|y_i-g(x_i)|$$

We can compare the how well the "smart model" does as compared to the "naive model" by looking at the ratio

$$r=\frac{\text{MAE}_{\text{smart}}}{\text{MAE}_{\text{naive}}}$$

as the smart model does better, this ratio becomes smaller and as the smart model starts doing as good or worse than the naive model, this ratio becomes larger. Up to this point, this is pretty much just the predictive power score. If our smart model is doing better than the naive model, then we have at least established that constructing a function between $x$ and $y$ is useful which means that we should include it in whatever models that we wish to build.

Making the BARS

There are a number of features that would be nice to have to make the process for judging how well a variable does at predicting another

The first thing we can do is to use a gaussian to map $r$ to $[0,1]$ so that we now have

$$e^{-r^2}$$

It'd be nice if when comparing how well a variable predicts itself, then its score would be $1$. We can make this happen by subtracting

$$r_0=\frac{\text{MAE}_{\text{self}}}{\text{MAE}_{\text{naive}}}$$

from $r$ so that the score will be $1$. This makes sense because when we compare a variable to itself $r=r_0$ which means

$$\text{exp}({-(r-r_0)^2})$$

will become $1$. Finally, it'd be nice to be able to make it harder or easier for a feature-target pair to have a high BARS. This might be useful in case you'd like to see which variables tend to stick. This would make you feel more confident that there is a function between the feature-target pair. So we'll define the acceptence $\alpha$ as the hyperparameter which dictates how easy or difficult it is to get a high BARS. So, our final BARS is...

$$\text{BARS}(r,r_0;\alpha)=\text{exp}\Big({-\frac{(r-r_0)^2}{\alpha^2}}\Big)$$

Why does BARS solve the problem?

Probabilistic models are the default when it isn't clear how to proceed with model construction, but in quantitative science deterministic models are preferred when building explanations of how the world works. Since the naive model is simply predicting the median of a target, then it is inherently a probabilistic model; furthermore, it's the easiest one to construct since we don't need to construct a pdf or even a random variable in order find the median.

Now the smart model uses a universal approximator to build a function between a feature and a target. Basically, we're building a deterministic model between the feature and the target.

BARS compares how well the deterministic model does compared to simple probabilistic model; thus, BARS is a measure how viable it is to construct a deterministic model using a particular feature-target pair. So, the feature which has the greatest predictability is the one that has the highest BARS scores across the greatest number of targets

Power Basic Aker Relational Score (PBARS)

References

[1] Wetschoreck, Florian. (Apr 23, 2020). RIP correlation. Introducing the Predictive Power Score. https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598

[2] Mathonline The Simple Function Approximation Theorem. http://mathonline.wikidot.com/the-simple-function-approximation-theorem

[3] kenndanielso Blog Universal Function Approximation. https://kenndanielso.github.io/mlrefined/blog_posts/12_Nonlinear_intro/12_5_Universal_approximation.html

[4] Median - Wikiwand https://www.wikiwand.com/en/Median

Clone this wiki locally