Stephen Wade
🚧 Under construction 🚧
literanger
is an adaption of the ranger
R package for
training and predicting from random forest models within multiple imputation
algorithms. ranger
is a fast implementation of random forests
(Breiman, 2001) or recursive partitioning, particularly
suited for high dimensional data (Wright et al, 2017).
literanger
enables random forests to be embedded in the fully conditional
specification framework for multiple imputation known as 'Multiple Imputation
via Chained Equations' (Van Buuren, 2007).
Implementations of multiple imputation with random forests include:
mice
which uses random forests to predict in a similar fashion to Doove et al, (2014), i.e. for each observation, a draw is taken from the sample of all values that belong to the terminal node of a randomly drawn tree.miceRanger
andmissRanger
which use predictive mean matching.
This package enables a minor variation on mice
's use of random forests.
The prediction can be drawn from the in-bag samples in the terminal node for
each missing data point. Thus, the computational effort during prediction then
scales with the number of missing values, rather than with the product of the
size of the whole dataset and the number of trees (as in mice
).
A more general advantage of this package is re-cycling of the trained forest
object and the separation of the (training) data from the forest, see ranger
issue #304.
A multiple imputation algorithm using this package is under development: called
mimputest
.
require(literanger)
train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[ train_idx, ]
iris_test <- iris[-train_idx, ]
rf_iris <- train(data=iris_train, response_name="Species")
pred_iris_bagged <- predict(rf_iris, newdata=iris_test,
prediction_type="bagged")
pred_iris_inbag <- predict(rf_iris, newdata=iris_test,
prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)
Installation is easy using devtools
:
library(devtools)
install_github('stephematician/literanger')
The cpp11
package is also required, available on CRAN:
install.packages('cpp11')
Not exhaustive:
prediction type: terminal nodes for every tree (e.g. for mice algorithm);finish documentation, e.g. this README;- prepare CRAN submission;
- implement variable importance measures;
- probability and survival forests.
Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.
Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.