High Throughput Light Weight Regularized Regression Modeling for Molecular Data
The package enables for fitting and basic evaluation of regularized regression models via the interface of the glmnet package 1. The pre-processing, modeling, evalaution, and prediction tools were fine-tuned specifically for use with genomic, transcriptomic, proteomic and pharmaco-molecular data.
The package requires one non-CRAN package (microViz, https://github.com/PiotrTymoszuk/microViz), please make sure to install it first:
devtools::install_github('PiotrTymoszuk/microViz') ## dependency microViz
devtools::install_github('PiotrTymoszuk/htGLMNET')
The package is available under a GPL-3 license.
The package maintainer is Piotr Tymoszuk.
Many thanks to the authors of the packages glmnet, sva, tidyverse, rlang, furrr, ggrepel, and caret.
Special thanks to Paul Geeleher, Danielle Maeser, Robert Gruener and Stephanie Huang, the authors of the oncoPredict and pRRophetic packages, for the idea of using regularized linear regression for prediction of anti-cancer drug response 234.
Regularized linear modeling with RIDGE, LASSO and Elastic Net algorithms is a canonical machine learning approach widely used in molecular in silico studies which tackle inherently multi-dimensional data, e.g. expression levels of thousands of genes for few-to-few-hundred samples. Its usefulness comes from the general property of regularization which reduces over-fitting and, in case of so called L1-regularized LASSO and Elastic Net models, enables for selection of modeling features, e.g. association of the requested outcome with expression levels of few genes 156.
Our toolset is optimized for easy handling of high-dimensinal molecular biology numeric data sets in a classical machine learning workflow. The general idea of the package is to bundle three steps of regularized linear modeling in a comprehensive pipeline with basic quality control:
-
pre-processing accomplished by
pre_process()
involves selection of common explanatory variables shared by the training and test data set, and subsequent removal of unwanted batch effects with the ComBat procedure 7 -
model selection, training and evaluation accomplished by
train()
and involving selection of the optimal shrinkage parameter λ by cross-validation, training a regularizedglmnet
/cv.glmnet
class model, and computation of basic performance metrics such as R^2, correlation coefficients, classification accuracy or Cohen's κ -
prediction accomplished by
predict()
allows for calculation of response value, classification probabilities, linear prediction scores, or class assignment for the training data set
Of note, the pre-processing step is not mandatory and the user can provide modeling data sets without pre-processing or following pre-processing done with other tools.
The package is designed to work with four families of generalized linear models: Gaussian and Poisson regression models, and binomial and multinomial classification models.
Under the hood, is makes use of glmnet()
, cv.glmnet()
, predict.glmnet()
, and coef.glmnet()
from the glmnet package 1.
The model selection/training/evaluation step can be easily accelerated by parallelization via the interface of furrr and future.
In the following example, we will predict drug response in individual tumor samples of pancreatic carcinoma given a training data set of drug sensitivity in cancer cell lines by Gaussian-family Elastic Net regression 6. R code used in this example is available here. Data sets with cell line gene expression and anti-cancer drug sensitivity originate from the GDSC2 drug screening project 8 and were downloaded from a repository by Maeser and collearues 4. Bulk-tumor expression data for pancreatic cancer samples originate from the CPTAC project 9.
In the following example, we will use, apart from the htGLMNET package, also tidyverse (pipe and tabular data handling, plots), rstatix (normality check), future (parallelization), and pathchwork (plot panels).
The training and test data sets are provided as numeric matrices gdsc_expression
, gdsc_ic50
, and paad_cptac
within the package.
library(tidyverse)
library(htGLMNET)
library(rstatix)
library(future)
library(patchwork)
Note, the package tools take training and test explanatory variable data in a form that is common in molecular biology, i.e. with features such as genes or proteins in rows and samples/observations in columns. By contrast, rows in the response matrix are always observations and columns represent the variables. In our case, the explanatory variable matrices contain log-transformed gene expression levels with official gene symbols in row names and sample identifiers in column names:
## training data: GDSC experiment:
## whole-genome expression in untreated cancer cell lines,
## log transformed to improve normality
training_x <- gdsc_expression
training_x[1:5, 1:5]
> training_x[1:5, 1:5]
COSMIC_908467 COSMIC_906798 COSMIC_753584 COSMIC_687588 COSMIC_749712
TSPAN6 7.868408 8.366444 7.105637 6.602253 7.435151
TNMD 2.635424 2.524400 2.798847 2.692571 2.832570
DPM1 9.296730 10.819526 10.486486 9.829172 10.344827
SCYL3 3.582473 3.952511 3.696835 3.698360 3.877500
C1orf112 3.997161 3.068264 3.726833 3.472732 3.555658
## test data: whole-tumor (bulk) gene expression in pancreatic cancers
## already log-transformed, to improve normality
test_x <- paad_cptac
> test_x[1:5, 1:5]
C3L-02899 C3N-02585 C3N-00303 C3N-03428 C3L-00599
A1BG 2.022513 1.981244 2.033752 1.995434 1.957314
A1BG-AS1 2.011082 1.997331 2.078784 2.076459 1.951348
A1CF 2.189583 2.234262 1.549379 2.072498 2.285331
A2M 2.788965 2.756788 2.783230 2.752751 2.755728
A2M-AS1 2.104827 2.187389 2.372447 2.319328 2.171888
For Gaussian linear modeling, both the explanatory factors and the responses are assumed to follow normal distributions.
Our response variable, IC50, i.e. a concentration of a drug in µM at which half of tumor cells is dead, varies widely and is badly skewed.
log()
transformation helps to improve normality of the IC50 values.
Of note, during development of the package, we have observed that fitting a Gaussian model for negative response value may fail.
To avoid occurrence of negative drug response values upon log()
transformation, we are changing the units of IC50 to pM (pico-mole).
## IC50 concentrations in µM of 10 anti-cancer drugs
## log transformation greatly improves normality!
## Yet, `glmnet` may have problems with negative values after
## the log transformation, to avoid that, we're transforming the IC50 to pM
training_ic50 <- gdsc_ic50
> training_ic50[1:5, 1:5]
Camptothecin_1003 Vinblastine_1004 Cisplatin_1005 Cytarabine_1006 Docetaxel_1007
COSMIC_908467 0.010520653 0.011794665 8.105104 0.2087598 0.003440711
COSMIC_906798 0.005999987 0.018177077 1.855336 13.8102767 0.004296369
COSMIC_753584 0.053981926 0.062174085 11.444863 23.0481383 0.008578793
COSMIC_687588 0.042151880 0.074801151 16.417421 1.4409030 0.041145941
COSMIC_749712 0.099352813 0.005823254 18.511756 26.1957452 0.011557681
## poor normality of non-transformed IC50
## reflected by low Shapiro-wilk test statistic value
> training_ic50 %>%
+ as.data.frame %>%
+ map_dfr(shapiro_test) %>%
+ mutate(variable = colnames(training_ic50))
# A tibble: 10 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 Camptothecin_1003 0.348 3.02e-26
2 Vinblastine_1004 0.180 9.71e-29
3 Cisplatin_1005 0.445 1.48e-24
4 Cytarabine_1006 0.431 8.16e-25
5 Docetaxel_1007 0.481 7.01e-24
6 Gefitinib_1010 0.717 3.53e-18
7 Navitoclax_1011 0.322 1.17e-26
8 Vorinostat_1012 0.425 6.40e-25
9 Nilotinib_1013 0.666 1.22e-19
10 Olaparib_1017 0.648 4.04e-20
## great improvement by log-transformation!!!
> training_ic50 %>%
+ log %>%
+ as.data.frame %>%
+ map_dfr(shapiro_test) %>%
+ mutate(variable = colnames(training_ic50))
# A tibble: 10 × 3
variable statistic p.value
<chr> <dbl> <dbl>
1 Camptothecin_1003 0.967 1.18e- 4
2 Vinblastine_1004 0.955 5.43e- 6
3 Cisplatin_1005 0.991 2.53e- 1
4 Cytarabine_1006 0.980 6.45e- 3
5 Docetaxel_1007 0.982 1.15e- 2
6 Gefitinib_1010 0.980 6.44e- 3
7 Navitoclax_1011 0.991 2.41e- 1
8 Vorinostat_1012 0.985 3.78e- 2
9 Nilotinib_1013 0.900 2.35e-10
10 Olaparib_1017 0.989 1.31e- 1
## picomole conversion and log transformation
training_ic50 <- log(1e6 * training_ic50)
In the following portion of code, we are going to select modeling features, i.e. genes with substantial expression and variability, shared by both the training and test data set.
To this end, we are removing 20% of genes with the lowest median expression and 20% of genes with the lowest variability measured by Gini index.
Finally, we are adjusting the gene expression matrices with the selected modeling features for unwanted batch effects with ComBat.
Those two tasks are performed with the pre_process()
function of our package, but may be also done by other tools.
Of convenience, pre_process()
returns an object that bundles the adjusted training and test matrices and which can be passed directly to the modeling and prediction functions from our package.
## pre-processing involves the following steps:
##
## 1) selection of common genes in the GDSC cell lines and pancreatic cancers
## such genes must be fairly abundantly expressed and variable,
## hence using 0.2 quantiles of median expression and Gini coefficients
## as modeling variable selection criteria
##
## 2) batch adjustment via ComBat
pre_pro_data <- pre_process(train = training_x,
test = test_x,
median_quantile = 0.2,
gini_quantile = 0.2)
## the effects of batch adjustment and common gene numbers
## can be inspected by `summary()` and `plot()`
We can easily retrieve numbers of selected modeling features, in our case genes, and their basic distribution statistics with the summary()
method.
The distribution and variability measures can be visualized with plot()
, which returns a list of ggplot
graphs - feel free to customize them!
Selection cutoffs are displayed in the respective plots as dashed lines, whose color corresponds to the color coding of the training and test data sets.
> summary(pre_pro_data)
# A tibble: 2 × 11
data_set n_features n_observations n_missing mean sd median q25 q75 q025 q975
<chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 train 7635 200 0 4.00 1.26 3.67 2.95 4.85 2.47 6.87
2 test 7635 140 0 4.00 1.22 3.80 3.13 4.77 2.09 6.66
pre_pro_plots <- plot(pre_pro_data)
## easy customization of the diagnostic plots
pre_pro_plots$before_adjustment$median +
labs(title = 'Median, before ComBat',
x = 'Data set') +
scale_y_continuous(limits = c(0, 15)) +
pre_pro_plots$after_adjustment$median +
labs(title = 'Median, after ComBat',
x = 'Data set') +
scale_y_continuous(limits = c(0, 15))
pre_pro_plots$before_adjustment$gini_coef +
labs(title = 'Gini index, before ComBat',
x = 'Data set') +
scale_y_continuous(limits = c(0, 1)) +
pre_pro_plots$after_adjustment$gini_coef +
labs(title = 'Gini index, after ComBat',
x = 'Data set') +
scale_y_continuous(limits = c(0, 1))
In this section we are going to select the optimal value of the shrinkage paramater λ by cross-validation of the training data set, construct a series of regularized models and evaluate them.
Those tasks are accomplished by the function train()
.
In this case, train()
is supplied with the output of pre_process()
as x
argument; the matrix of explanatory variables for the training data set is picked automatically.
The user is also welcome to provide any numeric matrix as x
which fulfills two criteria: (1) features in rows and observations in columns, and (2) row and column names defined.
Critially, column names of x
and rownames of y
must overlap for matching the explanatory factors and the response.
NA
values are allowed in y
and will be silently skipped.
As we are fitting a series of Elastic Net models of drug sensitivity, we are setting the mixing parameter alpha
to 0.5.
Mean squared error seves as the loss function, i.e. model selection criterion.
By declaring plan('multisession')
, we are allowing for a parallel run:
## Gaussian family Elastic Net models are fitted
## (family = 'gaussian', alpha = 0.5)
##
## the optimal lambda is found by minimizing mean squared error
## in cross-validation (type.measure = 'mse')
##
## no standardization, because the gene expression values are expected
## to be on the same scale.
##
## The function runs in parallel upon defining a multisession backend
plan('multisession')
set.seed(12345)
train_object <- train(x = pre_pro_data,
y = training_ic50,
standardize = FALSE,
family = 'gaussian',
alpha = 0.5,
type.measure = 'mse')
plan('sequential')
Model evaluation statistics can be accessed by calling summary()
and visualized by plot()
.
> summary(train_object)
# A tibble: 10 × 12
response n_observations n_features n_coefficients lambda loss_name loss_…¹ loss_…² varia…³ rsq_oof pearson spear…⁴
<chr> <dbl> <int> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Camptothecin_1003 200 7635 35 0.258 mse min 2.24 3.56 0.370 0.850 0.849
2 Vinblastine_1004 200 7635 44 0.304 mse min 3.27 4.43 0.262 0.805 0.787
3 Cisplatin_1005 200 7635 44 0.259 mse min 2.26 4.03 0.440 0.845 0.854
4 Cytarabine_1006 200 7635 110 0.110 mse min 2.59 4.95 0.478 0.952 0.950
5 Docetaxel_1007 200 7635 75 0.110 mse min 1.60 2.01 0.203 0.900 0.873
6 Gefitinib_1010 200 7635 49 0.123 mse min 1.10 1.52 0.277 0.863 0.859
7 Navitoclax_1011 200 7635 50 0.286 mse min 3.46 4.73 0.269 0.824 0.779
8 Vorinostat_1012 200 7635 25 0.168 mse min 0.697 1.10 0.368 0.787 0.797
9 Nilotinib_1013 200 7635 61 0.154 mse min 1.77 2.92 0.395 0.879 0.862
10 Olaparib_1017 200 7635 48 0.143 mse min 1.02 1.77 0.422 0.856 0.835
# … with abbreviated variable names ¹loss_criterion, ²loss_value, ³variance, ⁴spearman
As before, plot()
generates a list of ggplot
graphics, whose appearance can be easily adjusted by the user.
In this particular case, we are adding dashed lines to one of the graphics, that represent cutoffs for selection of models with at least moderate performance in cross-validation (
train_plots <- plot(train_object)
train_plots$rsq_pearson +
labs(title = 'Molde performance: CV and training') +
geom_hline(yintercept = 0.5,
linetype = 'dashed') +
geom_vline(xintercept = 0.26,
linetype = 'dashed')
In this section, we are going to exploit the trained models to glimpse at genes associated with response towards particular anti-cancer compounds. Our modeling approach bases on the Elastic Net algorithm, which in fact uses just few-to-few-hundred explanatory variables of roughly 7000+ genes available for modeling. Such genes selected by the algorithm are expected to associate strongly with the response, in our case with drug sensitivity, and can be regarded as a kind of 'drug response signauture'. In the following code, we are going to retrieve such genes characterized by non-zero β coefficients and visualize their coefficient values for one popular anti cancer gene, cis-platin.
Model coefficients are computed by calling coef()
, which returns a matrix of model coefficients for all genes used for modeling and the model intercept:
## gene signatures of drug response may be obtained by extracting
## non-zero coefficients of the regularized model.
## By this way, we're selecting only the features that contribute to
## prediction of drug sensitivity
##
## Coefficients of the models are obtained via `coef()`, which returns
## a matrix for all investigated genes.
train_coefficients <- coef(train_object)
> train_coefficients[1:5, 1:5]
Camptothecin_1003 Vinblastine_1004 Cisplatin_1005 Cytarabine_1006 Docetaxel_1007
(Intercept) 16.04874 11.3986 18.14665 24.46561 10.59089
TSPAN6 0.00000 0.0000 0.00000 0.00000 0.00000
SCYL3 0.00000 0.0000 0.00000 0.00000 0.00000
C1orf112 0.00000 0.0000 0.00000 0.00000 0.00000
FGR 0.00000 0.0000 0.00000 0.00000 0.00000
## removal of the intercepts and selection of non-zero model coefficients
non_zero <- train_coefficients[-1, ]
non_zero <- colnames(non_zero) %>%
map(~non_zero[, .x]) %>%
map(as.data.frame) %>%
map(set_names, 'beta')
non_zero <- non_zero %>%
map(filter, beta != 0) %>%
map(rownames_to_column, 'gene_symbol') %>%
map(arrange, -beta) %>%
map(as_tibble) %>%
set_names(colnames(train_coefficients))
For cis-platin, 43 non-zero model coefficients are:
> non_zero$Cisplatin_1005
# A tibble: 43 × 2
gene_symbol beta
<chr> <dbl>
1 IL17RC 0.341
2 TOLLIP 0.208
3 HM13 0.192
4 PPME1 0.143
5 PTTG1IP 0.140
6 TM2D2 0.134
7 PPP2R2C 0.123
8 WTIP 0.118
9 UBTD1 0.0847
10 ABHD2 0.0714
# … with 33 more rows
# ℹ Use `print(n = ...)` to see more rows
## plotting coefficients of the signature members for cis-platin
non_zero$Cisplatin_1005 %>%
ggplot(aes(x = beta,
y = reorder(gene_symbol, beta),
fill = factor(sign(beta)))) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c('-1' = 'steelblue',
'1' = 'indianred3')) +
theme_bw() +
theme(axis.title.y = element_blank(),
axis.text.y = element_text(face = 'italic'),
legend.position = 'none') +
labs(title = 'Cis-platin response signature',
x = expression(beta * ', association with IC50'))
As evident from the graphic above, at least some genes of the cis-platin response signature make biological sense. For example, IL17RC and TOLLIP are involved in NK-κB signaling, which may orchestrate cell survival. PPME and PPP2R2C are implicated in pro-survival signaling, while RIOK, SLFN11, DOT1L, and CHECK2 are involved in DNA replication, repair, chromatic control, and protein synthesis.
In the final part of this example, we are gong to predict IC50 concentrations for individual bulk-tumor samples of pancreatic carcinoma in the test data set.
Such predictions, but also predictions of e.g. linear predictor scores, probabilities and class assignment are performed by calling the predict()
method.
In our example, we would like to to generate predictions at the same scale as the log-transformed IC50 values used for model training and are hence specifying type = 'response'
.
The numeric matrix of explanatory factors, i.e. the test data set, is provided as newdata
and has to be in the feature (row) pre_process()
as newdata
: the test data set will be picked automatically.
Please note, that in the code below, we are transforming the predictions back to IC50 expressed as µM:
## IC50 samples for individuals cancers may be computed
## by calling `predict(type = 'response')`.
## Of convenience, as `newdata`, the result of `pre_process()` may be
## supplied. The function picks the test data set automatically!
test_ic50 <- predict(train_object,
newdata = pre_pro_data,
type = 'response')
## transforming the predicted sensitivities back to µM
test_ic50 <- exp(test_ic50) * 1e-6
> test_ic50[1:5, 1:5]
Camptothecin_1003 Vinblastine_1004 Cisplatin_1005 Cytarabine_1006 Docetaxel_1007
C3L-02899 0.06114990 0.04003696 22.306809 2.977348 0.008361944
C3N-02585 0.10661910 0.10359832 49.587790 3.172271 0.015345744
C3N-00303 0.06506168 0.00824819 4.778458 1.064913 0.008228247
C3N-03428 0.01745086 0.01321737 2.731877 1.459961 0.005907104
C3L-00599 0.15932344 0.03192528 27.123428 1.126792 0.015345844
Basically, such drug sensitivity measures may be used directly in analyses such as clustering or comparison of cancer sample subsets. Here, we can visualize the predicted IC50 values for the compounds whose models performed with at least moderate reliability as a violin plot:
## let's visualize predictions for
## the models with the best performance
## (cross-validated R^2 >= 0.26, Pearson's r >= 0.5)
best_models <- train_object %>%
summary %>%
filter(rsq_oof >= 0.26,
pearson > 0.5) %>%
.$response
## plotting data in long format
test_plot_data <- test_ic50[, best_models] %>%
as.data.frame %>%
rownames_to_column('sample_id') %>%
as_tibble
test_plot_data <- test_plot_data %>%
pivot_longer(cols = all_of(best_models),
names_to = 'compound',
values_to = 'ic50')
> test_plot_data
# A tibble: 1,260 × 3
sample_id compound ic50
<chr> <chr> <dbl>
1 C3L-02899 Camptothecin_1003 0.0611
2 C3L-02899 Vinblastine_1004 0.0400
3 C3L-02899 Cisplatin_1005 22.3
4 C3L-02899 Cytarabine_1006 2.98
5 C3L-02899 Gefitinib_1010 23.8
6 C3L-02899 Navitoclax_1011 7.73
7 C3L-02899 Vorinostat_1012 7.12
8 C3L-02899 Nilotinib_1013 34.2
9 C3L-02899 Olaparib_1017 49.4
10 C3N-02585 Camptothecin_1003 0.107
# … with 1,250 more rows
# ℹ Use `print(n = ...)` to see more rows
test_plot_data %>%
ggplot(aes(x = reorder(compound, ic50),
y = ic50)) +
geom_violin(fill = 'cornsilk',
scale = 'width') +
geom_point(shape = 16,
size = 1,
position = position_jitter(width = 0.05,
height = 0),
alpha = 0.25) +
scale_y_continuous(trans = 'log10') +
guides(x = guide_axis(angle = 45)) +
theme_bw() +
labs(title = 'Predicted IC50, PAAD CPTAC',
subtitle = 'Bulk cancer samples',
x = 'Compound',
y = 'IC50, µM')
Footnotes
-
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw (2010) 33:1–22. doi:10.18637/jss.v033.i01 ↩ ↩2 ↩3
-
Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol (2014) 15: doi:10.1186/gb-2014-15-3-r47 ↩
-
Geeleher P, Cox N, Stephanie Huang R. pRRophetic: An R Package for Prediction of Clinical Chemotherapeutic Response from Tumor Gene Expression Levels. PLoS One (2014) 9:e107468. doi:10.1371/JOURNAL.PONE.0107468 ↩
-
Maeser D, Gruener RF, Huang RS. oncoPredict: an R package for predicting in vivo or cancer patient drug response and biomarkers from cell line screening data. Brief Bioinform (2021) 22:1–7. doi:10.1093/BIB/BBAB260 ↩ ↩2
-
Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Ser B (1996) 58:267–288. doi:10.1111/j.2517-6161.1996.tb02080.x ↩
-
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol (2005) 67:301–320. doi:10.1111/j.1467-9868.2005.00503.x ↩ ↩2
-
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics (2012) 28:882. doi:10.1093/BIOINFORMATICS/BTS034 ↩
-
Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, Bindal N, Beare D, Smith JA, Thompson IR, et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res (2013) 41:D955–D961. doi:10.1093/NAR/GKS1111 ↩
-
Cao L, Huang C, Cui Zhou D, Hu Y, Lih TM, Savage SR, Krug K, Clark DJ, Schnaubelt M, Chen L, et al. Proteogenomic characterization of pancreatic ductal adenocarcinoma. Cell (2021) 184:5031-5052.e26. doi:10.1016/J.CELL.2021.08.023 ↩