-
Notifications
You must be signed in to change notification settings - Fork 1
Home
BARS is a dimensionality reduction algorithm which has the goal of maintaining interpretability i.e we eliminate variables directly from potential models that don't seem to add any predictive power. The use of decision trees to approximate a function between two variables is the construction of a deterministic model which can be compared to the probabilistic model of always predicting the mean of a target. This is a modified version of the Predictive Power Score inspired by Florian Wetschoreck's article.
Typically dimensionality reduction by ranking features is done by
- Creating a predictive model using all input/output variables
- For each input, remove it from the collection of inputs and build another predictive model
- Compare the accuracy of the two models
- The variable which has the largest change in accuracy is ranked as the most important predictor, the variable which has the next largest change is ranked as the second, etc...
However, this ranking is dependent upon comparing models which are constructed using all available data. Furthermore, you need to construct separate models with each feature removed in order to do a traditional importance ranking. This might take awhile for a large dataset especially if constructing different model types and adjusting their hyper-parameters in order make a good judgement call for which variables to omit from potential models. A different approach is to
- Create a smart predictive model (a decision tree) between a single input and a single output.
- Create another (naive) predictive model between the same variables as in step one. Basically, this model will always predict the median of the output independent of the input.
- Compare the accuracy of the two models.
- For a given input, iterate steps 1-4 for all of the outputs.
- Now you can score how well an input does in terms of predicting all of the outputs.
- Iterate steps 1-5 for all the inputs.
At this point you'll have a relative metric for deciding which variables to include and which to omit. The idea here is to compare how well a deterministic model does when fitting it to the dataset versus a simple probabilistic model. This provides a quick way to judge which variables are important and which ones are not.
We'll start with a set of observations which can be further split into a set of features
Decision Trees are universal function approximators which basically means, we can split two dimensional subset of our data into different bins which are chosen based on minimizing a cost function. In this case the boundaries of the bins are chosen so as to minimize the error of the tree model makes when making predicitons. Spliting the data into different bins is constructing a function, but we need to understand how well this function does compared to a more naive model of prediction: taking the median of the target
If we have two different models
We can compare the how well the "smart model" does as compared to the "naive model" by looking at the ratio
as the smart model does better, this ratio becomes smaller and as the smart model starts doing as good or worse than the naive model, this ratio becomes larger. Up to this point, this is pretty much just the predictive power score. If our smart model is doing better than the naive model, then we have at least established that constructing a function between
There are a number of features that would be nice to have to make the process for judging how well a variable does at predicting another
The first thing we can do is to use a gaussian to map
It'd be nice if when comparing how well a variable predicts itself, then its score would be
from
will become
Probabilistic models are the default when it isn't clear how to proceed with model construction, but in quantitative science deterministic models are preferred when building explanations of how the world works. Since the naive model is simply predicting the median of a target, then it is inherently a probabilistic model; furthermore, it's the easiest one to construct since we don't need to construct a pdf or even a random variable in order find the median.
Now the smart model uses a universal approximator to build a function between a feature and a target. Basically, we're building a deterministic model between the feature and the target.
BARS compares how well the deterministic model does compared to simple probabilistic model; thus, BARS is a measure how viable it is to construct a deterministic model using a particular feature-target pair. So, the feature which has the greatest predictability is the one that has the highest BARS scores across the greatest number of targets
[1] Wetschoreck, Florian. (Apr 23, 2020). RIP correlation. Introducing the Predictive Power Score. https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598
[2] Mathonline The Simple Function Approximation Theorem. http://mathonline.wikidot.com/the-simple-function-approximation-theorem
[3] kenndanielso Blog Universal Function Approximation. https://kenndanielso.github.io/mlrefined/blog_posts/12_Nonlinear_intro/12_5_Universal_approximation.html
[4] Median - Wikiwand https://www.wikiwand.com/en/Median