This document reproduces parts of numerical studies presented in the best intratumor heterogeneity (ITH) association study paper. Specifically, we will
- simulate (binary) multiregion genomic data,
- estimate parameters that will be used to find optimal designs, and
- find optimal designs for pre-collection and post-collection scenarios for given estimated parameters.
All of the functions appearing in this document are in
We simulate multiregion genomic data to estimate three parameters,
simulate_tumors()
generates a list of matrix, each of them representing a multiregion
genomic data for one tumor (corresponding to one subject). Required
inputs are
- number of tumors (
nTumors) - number of samples for each tumor (
nSamp) - number of genes (or probes, copy number segments and etc.) (
nSeg) - underlying mutation rate of a patient (
mutationRates) which will be used to generate underlying mutation status vector - lower bound of the uniform distribution from which
$\theta$ will be drawn - upper bound of the uniform distribution from which
$\theta$ will be drawn
nTumors <- 50
nSamp <- 10
nSeg <- 40
theta_lb <- 0.5
theta_ub <- 0.6
mutationRates <- rbeta(nSeg,2,3)
set.seed(1)
tumor_mat_list <- simulate_tumors(nTumors=nTumors,nSamp=nSamp,nSeg=nSeg,
mutationRates=mutationRates,
theta_lb=theta_lb,theta_ub=theta_ub)
tumor_mat_list[[1]]## s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 1 0 0 1 1 0 0 1 0
## [3,] 0 1 0 1 0 1 0 1 1 0
## [4,] 1 1 0 1 1 0 1 1 1 0
## [5,] 0 0 0 0 0 0 0 0 0 0
## [6,] 1 1 1 0 0 1 0 1 1 1
## [7,] 1 1 1 0 1 0 1 1 1 1
## [8,] 1 1 1 0 0 1 1 0 1 1
## [9,] 0 0 1 1 0 0 1 1 0 1
## [10,] 0 0 0 0 0 0 0 0 0 0
## [11,] 0 0 0 0 0 0 0 0 0 0
## [12,] 0 0 0 0 0 0 0 0 0 0
## [13,] 0 0 1 0 1 1 0 0 0 1
## [14,] 0 0 0 0 0 0 0 0 0 0
## [15,] 0 1 0 1 0 0 0 1 1 1
## [16,] 1 0 0 0 1 0 1 0 1 1
## [17,] 0 1 0 0 0 1 1 1 1 0
## [18,] 1 1 1 0 1 1 1 0 1 0
## [19,] 0 0 0 0 0 0 0 0 0 0
## [20,] 0 0 1 1 0 0 1 1 1 1
## [21,] 1 1 1 0 0 1 1 0 1 0
## [22,] 0 0 0 0 0 0 0 0 0 0
## [23,] 0 1 1 1 1 0 0 1 1 0
## [24,] 0 0 0 0 0 0 0 0 0 0
## [25,] 0 0 0 0 0 0 0 0 0 0
## [26,] 0 0 0 0 0 0 0 0 0 0
## [27,] 0 0 0 0 0 0 0 0 0 0
## [28,] 0 0 0 0 0 0 0 0 0 0
## [29,] 1 1 0 1 1 1 0 0 1 0
## [30,] 0 0 0 0 0 0 0 0 0 0
## [31,] 0 0 0 0 0 0 0 0 0 0
## [32,] 0 0 0 0 0 0 0 0 0 0
## [33,] 0 0 0 0 0 0 0 0 0 0
## [34,] 1 1 0 1 0 1 1 0 1 0
## [35,] 1 1 1 0 1 0 1 0 0 1
## [36,] 0 0 0 0 0 0 0 0 0 0
## [37,] 1 0 0 0 0 1 1 0 0 1
## [38,] 0 0 0 0 0 0 0 0 0 0
## [39,] 1 0 0 0 0 0 0 0 1 1
## [40,] 0 0 0 0 0 0 0 0 0 0
We use simulated data tumor_mat_list to estimate parameters
and its conditional variance as
where
The function estimate_parameters() takes a list of multiregion genomic
profile matrices and estimates the parameters.
estimate_parameters(tumor_mat_list)## sigma_square rho tau_square
## 3.8413333 0.1048889 1.4243483
The objective function we want to maximize is
where
Given the budget to profile
Pre-collection scenario assumes no samples have been collected. Thus,
one has freedom to select any
Assume
In the following example, we set
$(\hat{\tau}^2,\hat{\sigma}^2,\hat{\rho}) = (4.718, 5.831, 1.463)$ $(\hat{\tau}^2,\hat{\sigma}^2,\hat{\rho}) = (3.123, 6.527, 0.877)$ $(\hat{\tau}^2,\hat{\sigma}^2,\hat{\rho}) = (2.009, 6.800, 0.147)$
The function phi1() takes
# Create parameter table
parameter_tab <- tibble(
tau_sq = c(4.718,3.123,2.009),
sigma_sq = c(5.831, 6.527, 6.800),
rho = c(1.463, 0.877, 0.147)
)
# Find K_max for each parameter setting
M <- 100
nSamp_max <- 10
res <- apply(parameter_tab,1,function(x){
tau_sq <- x[1]
sigma_sq <- x[2]
rho <- x[3]
phi <- lapply(2:nSamp_max,function(nSamp) phi1(tau_sq,sigma_sq,rho,nSamp,M)) %>% unlist()
K_max <- which.max(phi) + 1
phi_max <- phi[which.max(phi)]
c(K_max, phi_max)
}) %>% t()
colnames(res) <- c("K_max","phi_max")
res <- as_tibble(cbind(parameter_tab,res))
res## # A tibble: 3 × 5
## tau_sq sigma_sq rho K_max phi_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.72 5.83 1.46 2 4.74
## 2 3.12 6.53 0.877 3 5.67
## 3 2.01 6.8 0.147 4 7.72
Next, we plot
par(mfrow=c(1,3))
par(oma = c(3,3,0,0))
par(mar = c(2,2,2,1))
apply(res,1,function(x){
tau_sq <- x[1]
sigma_sq <- x[2]
rho <- x[3]
K <- 2:10
phi <- lapply(2:nSamp_max,function(nSamp) phi1(tau_sq,sigma_sq,rho,nSamp,M)) %>% unlist()
fit <- lm(phi~poly(K,6,raw=F))
main_str <- TeX(sprintf("$(\\tau^2,\\sigma^2,\\rho) = (%0.2f,%0.2f,%0.2f)$",tau_sq,sigma_sq,rho))
plot(K,phi,ylab="", xlab="",main = main_str,col=ifelse(phi==max(phi),"red","black"),
pch=20,ylim = c(3,8))
lines(K, predict(fit,data.frame(x=K)), col="blue", lwd = 0.5)
})
mtext(TeX("$K$"), side = 1, outer = T, line = 1)
mtext(TeX("$\\varphi$"), side = 2, outer = T, line = 0.75,las=1)For the post-collection scenario, we assume tumor samples have been
collected for
| number of tumor samples collected | number of patients |
|---|---|
| 2 | |
| 3 | |
| 4 | |
where
For illustration purpose, we explore the optimal design for a study with the following estimated parameters:
$(\hat{\tau}^2,\hat{\sigma}^2,\hat{\rho}) = (2.04,5.24,1.68)$
We assume we have already collected 372 tumor samples from 84 patients as tabulated in the following:
| number of tumor samples collected | number of patients |
|---|---|
| 2 | 20 |
| 3 | 16 |
| 4 | 14 |
| 5 | 10 |
| 6 | 8 |
| 7 | 6 |
| 8 | 4 |
| 9 | 4 |
| 10 | 2 |
The function recursive_search() (defined in recursive_search.cpp)
computes the optimal design with following inputs:
-
A: the largest number of tumor samples collected among all subjects -
M: a total number of tumor samples budgeted -
R: a number of available subjects with more thanAsamples -
Om: a vector of “candidate” number of subjects$(\omega_1,\omega_2,...,\omega_A)$ -
N: a vector of number of subjects collected for each number of samples$(n_1,n_2,...,n_A)$ -
tau_sq:$\tau^2$ -
sigma_sq:$\sigma^2$ -
rho:$\rho$
Set budget as
params <- c(2.04,5.24,1.68)
A <- 10
N <- c(0,20,16,14,10,8,6,4,4,2)
Om <- rep(0,10)
R <- 0
M <- 200
res <- recursive_search(A = A,M = M,R = R, Om = Om, N = N,
tau_sq = params[1], sigma_sq = params[2],rho = params[3])The function outputs a list containing
- the solution
$(\omega_1,\omega_2,...,\omega_A)$ , and - the optimal value
$\varphi_{max}$
res## [[1]]
## [1] 0 52 32 0 0 0 0 0 0 0
##
## [[2]]
## [1] 13.6646
