Clément Bénard (Thales cortAIx-Labs)
treehfd is a Python module to compute the Hoeffding functional decomposition
of XGBoost models (Chen and Guestrin, 2016) with dependent input variables, using
the TreeHFD algorithm.
This decomposition breaks down XGBoost models into a sum of functions of one or
two variables (respectively main effects and interactions), which are intrinsically
explainable, while preserving the accuracy of the initial black-box model.
The TreeHFD algorithm is introduced in the following article: Bénard, C. (2025). Tree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm. In Advances in Neural Information Processing Systems 38 (NeurIPS 2025), in press.
The documentation is available at Read the Docs.
treehfd currently supports XGBoost models for regression and binary classification with numeric input variables,
and categorical variables should be one-hot encoded.
treehfd essentially relies on xgboost, numpy, scipy, and scikit-learn.
To install treehfd, run: pip install treehfd
To install from source, first clone this repository, create
and activate a Python virtual environnement using Python version >= 3.11
(conda or uv can also be used), and run pip at the root directory of the repository:
python -m venv .treehfd-env
source .treehfd-env/bin/activate
pip install .The output of treehfd is illustrated with a standard dataset, the California Housing dataset,
where housing prices are predicted from various features, such as the house locations, and
characteristics of the house blocks. The code is provided below and in examples.
The following figure shows the two main decomposition components for the longitude and Latitude
variables of a black-box xgboost model, fitted with default parameters.
The treehfd decomposition is displayed in blue, and compared to the decomposition
induced by TreeSHAP with interactions in red, which is directly implemented in xgboost package.
In particular, treehfd clearly identifies the peak of housing prices in the Longitude main effect,
corresponding to the San Francisco Bay Area (-122°25'), whereas TreeSHAP does not really detect it,
because main effects are entangled with interactions, while they are orthogonal in the Hoeffding decomposition.
treehfd also highlights that house prices are lower in northern and eastern California.
# Load packages.
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from treehfd import XGBTreeHFD
if __name__ == "__main__":
# Fetch California Housing data.
california_housing = fetch_california_housing()
X = california_housing.data
y = california_housing.target
# Fit XGBoost model.
xgb_model = xgb.XGBRegressor(eta=0.3, n_estimators=100, max_depth=5)
xgb_model = xgb_model.fit(X, y)
# Fit TreeHFD.
treehfd_model = XGBTreeHFD(xgb_model)
treehfd_model.fit(X)
# Compute TreeHFD predictions.
y_main, y_order2 = treehfd_model.predict(X)The Hoeffding decomposition was introduced by Hoeffding (1948) for independent input variables. More recently, Stone (1994) and Hooker (2007) extended the decomposition to the case of dependent input variables, where uniqueness of the decomposition is enforced by hierarchical orthogonality constraints. The estimation from a data sample in the dependent case is a notoriously difficult problem, and therefore, the Hoeffding decomposition has long remain an abstract theoretical tool. TreeHFD computes the Hoeffding decomposition of tree ensembles, based on a discretization of the hierarchical orthogonality constraints over the tree partitions. Such decomposition is proved to be piecewise constant over these partitions, and the values in each cell for all components are given by solving a least square problem for each tree.
The documentation is available at Read the Docs,
and is generated with sphinx package. To build the documentation from source,
first install sphinx and the relevant extensions.
Then, go to the docs folder and build the documentation.
pip install sphinx sphinx-rtd-theme
cd docs
sphinx-build -M html ./source ./buildFinally, open the html file build/html/index.html with a web browser to display the
documentation.
To run the tests, install and execute pytest with:
pip install pytest
pytestContributions are of course very welcome!
If you are interested in contributing to treehfd, start by reading the Contributing guide.
This package is distributed under the Apache 2.0 license. All dependencies have their own license.
In particular, xgboost relies on NVIDIA proprietary modules for the optional use of GPU.
Please use the following citation to refer to treehfd:
@inproceedings{
benard2025tree,
title={Tree Ensemble Explainability through the Hoeffding Functional Decomposition and Tree{HFD} Algorithm},
author={Clement Benard},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=dRLWcpBQxS}
}Hoeffding, W. (1948). A Class of Statistics with Asymptotically Normal Distribution. The Annals of Mathematical Statistics, 19:293 – 325.
Stone, C.J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The Annals of Statistics, 22:118–171.
Hooker, G. (2007). Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics, 16:709–732.
Chastaing, G., Gamboa, F., and Prieur, C. (2012). Generalized Hoeffding-Sobol decomposition for dependent variables - application to sensitivity analysis. Electronic Journal of Statistics, 6:2420 – 2448.
Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, New York. ACM.
Lengerich, B., Tan, S., Chang, C.-H., Hooker, G., and Caruana, R. (2020). Purifying interaction effects with the functional anova: An efficient algorithm for recovering identifiable additive models. In International Conference on Artificial Intelligence and Statistics, pages 2402–2412. PMLR.
Bénard, C. (2025). Tree Ensemble Explainability through the Hoeffding Functional Decomposition and TreeHFD Algorithm. In Advances in Neural Information Processing Systems 38 (NeurIPS 2025), in press.
