-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathappendix.Rmd
1220 lines (906 loc) · 45 KB
/
appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Appendix
## Maximum Likelihood Review
This is a very brief refresher on *maximum likelihood estimation* using a standard regression approach as an example. Given the likelihood's role in Bayesian estimation and statistics in general, and the ties between specific Bayesian results and maximum likelihood estimates one typically comes across, one should be conceptually comfortable with some basic likelihood estimation.
In the standard model setting we attempt to find parameters $\theta$ that will maximize the probability of the data we actually observe. We'll start with an observed random target vector $y$ with $i...N$ independent and identically distributed observations and some data-generating process underlying it $f(\cdot|\theta)$. The key idea is that we are interested in estimating the model parameter(s), $\theta$, that would make the data most likely to have occurred. The probability density function for an observed value of $y$ given some particular estimate for the parameters can be noted as $f(y_i|\theta)$. The joint probability distribution of the (independent) observations given those parameters is the product of the individual densities, and this is our basic *likelihood function*. We can write it out generally as:
$$\mathcal{L}(\theta) = \prod_{i=1}^N f(y_i|\theta)$$
<!-- <img src="img/maxLikeNormalCompareLLforDifferentMeans.svg" style="display:block; margin: 0 auto; width:50%"> -->
So the *likelihood* for one set of parameter estimates given a fixed set of data y, is equal to the probability of the data given those (fixed) estimates. Furthermore, we can compare one set, $\mathcal{L}(\theta_A)$, to that of another, $\mathcal{L}(\theta_B)$, and whichever produces the greater likelihood would be the preferred set of estimates. We can get a sense of this with the following visualization, based on a single parameter, Poisson distributed variable. The data is drawn from a variable with mean $\theta=5$. We note the calculated likelihood increases as we estimate values for $\theta$ closer to $5$. With more and more data, the final ML estimate will converge on the true value.
```{r maxLikeillustration, echo=FALSE, out.width='500px', message=T}
set.seed(1234)
y = rpois(100000, lambda = 5)
mus = seq(3, 8, l = 100)
L = map_dbl(mus, function(mu) sum(dpois(y, lambda = mu, log = TRUE)))
message(glue::glue('Final estimate = ', round(mus[L == max(L)], 2)))
# y = rnorm(100000, mean=50, 10)
# mus = seq(40,60, l=100)
# L = sapply(mus, function(mu) sum(dnorm(y, mean=mu, sd=10, log=T)))
# y = rbinom(100000, size=100, p=.5)
# mus = seq(.3, .7, l=100)
# L = sapply(mus, function(mu) sum(dbinom(y, size=100, p=mu, log=T)))
# mus[L==max(L)]
ggplot(data.frame(mus, L)) +
geom_vline(aes(xintercept = 5), alpha = .5, lty = 2) +
geom_hline(aes(yintercept = max(L)), alpha = .5, lty = 2) +
geom_path(aes(x = mus, y = L), color = '#ff5500', alpha = .9, lwd = 2) +
labs(x = expression(theta), y = 'Likelihood') +
theme_clean() +
theme(
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(color = 'gray10', size = 14),
axis.title.x = element_text(color = 'gray10', size = 24),
plot.background = element_rect(fill = "transparent", colour = NA)
)
# ggsave('img/maxLikeNormalCompareLLforDifferentMeans.svg', bg='transparent')
```
For computational reasons, we instead work with the sum of the natural log probabilities[^logs], and thus the *log likelihood*:
$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[f(y_i|\theta)]$$
Concretely, we calculate a log likelihood for each observation and then sum them for the total likelihood for parameter(s) $\theta$.
The likelihood function incorporates our assumption about the sampling distribution of the data given some estimate for the parameters. It can take on many forms and be notably complex depending on the model in question, but once specified, we can use any number of optimization approaches to find the estimates of the parameter that make the data most likely. As an example, for a normally distributed variable of interest we can write the log likelihood as follows:
$$\ln\mathcal{L}(\theta) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-\mu)^2}{2\sigma^2})]$$
### Example
In the following we will demonstrate the maximum likelihood approach to estimation for a simple setting incorporating a normal distribution, where we estimate the mean and variance/sd for a set of values $y$[^notlazy]. First the data is created, and then we create the function that will compute the log likelihood. Using the built in R distributions[^distributions] makes it fairly straightforward to create our own likelihood function and feed it into an optimization function to find the best parameters. We will set things up to work with the <span class="pack">bbmle</span> package, which has some nice summary functionality and other features. However, one should take a glance at <span class="func">optim</span> and the other underlying functions that do the work.
```{r bblmDemo1}
# for replication
set.seed(1234)
# create the data
y = rnorm(1000, mean = 5, sd = 2)
startvals = c(0, 1)
# the log likelihood function
LL = function(mu = startvals[1], sigma = startvals[2], verbose = TRUE) {
ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
if (verbose) message(paste(mu, sigma, ll))
ll
}
```
The <span class="func">LL</span> function takes starting points for the parameters as arguments, in this case we call them $\mu$ and $\sigma$, which will be set to `r startvals[1]` and `r startvals[2]` respectively. Only the first line (ll = -sum...) is actually necessary, and we use <span class="func">dnorm</span> to get the density for each point[^reasy]. Since this optimizer is by default minimization, we reverse the sign of the sum so as to *minimize the negative log likelihood*, which is the same as maximizing the likelihood. Note that the bit of other code just allows you to see the estimates as the optimization procedure searches for the best values. I do not show that here, but you'll see it in your console if `verbose=TRUE`.
We are now ready to obtain maximum likelihood estimates for the parameters. For the <span class="func">mle2</span> function we will need the function we've created, plus other inputs related to that function or the underlying optimizing function used (by default <span class="func">optim</span>). In this case we will use an optimization procedure that will allow us to set a lower bound for $\sigma$. This isn't strictly necessary, but otherwise you would get warnings and possibly lack of convergence if negative estimates for $\sigma$ were allowed[^logsigma].
```{r bblmDemo1B}
library(bbmle)
# using optim, and L-BFGS-B so as to constrain sigma to be positive by setting
# the lower bound at zero
mlnorm = mle2(
LL,
start = list(mu = 2, sigma = 1),
method = "L-BFGS-B",
lower = c(sigma = 0)
)
mlnorm
# compare to an intercept only regression model
broom::tidy(lm(y ~ 1))
```
We can see that the ML estimates are the same as the intercept only model estimates[^MLvOLS], which given the sample size are close to the true values.
In terms of the parameters we estimate, in the typical case of two or more parameters we can think of a *likelihood surface* that represents the possible likelihood values given any particular set of estimates. Given some starting point, the optimization procedure then travels along the surface looking for a minimum/maximum point[^tangent]. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. The optimizer travels along this surface until it finds a minimum (the surface plot is interactive- feel free to adjust). I also plot the path of the optimizer from a top down view. The large blue dot noted represents the minimum negative log likelihood.
```{r plotSurface, echo=FALSE}
# RSTUDIO VIEWER WILL NOT CORRECTLY DISPLAY EVEN IF PLOT IS CORRECT
# PLOTLY STILL BREAKS MATHJAX FOR TUFTE, CAN ONLY USE SCREENSHOT
mu = seq(4, 6, length = 50)
sigma = seq(1.5, 3, length = 50)
llsurf = matrix(NA, length(mu), length(sigma))
for (i in 1:length(mu)){
for (j in 1:length(sigma)){
llsurf[i,j] = -sum(dnorm(y, mean=mu[i], sd=sigma[j], log=T))
}
}
rownames(llsurf) = mu
colnames(llsurf) = sigma
plotdat = read.csv('data//mleNormEstimates.csv')
colnames(plotdat) = c('mu', 'sigma', 'll')
pointdat = subset(plotdat, mu <=6 & mu >=4 & sigma<= 3 & sigma >=1.5)
# because plotly legend doesn't work correctly with surface, color scale dropped, not needed anyway https://community.plot.ly/t/how-to-name-axis-and-show-legend-in-mesh3d-and-surface-3d-plots/1819
library(plotly)
llsurf_trans = t(llsurf)
plot_ly(
x = ~ mu,
y = ~ sigma,
z = ~ llsurf_trans,
colors = viridis::plasma(500),
showscale = FALSE,
width = 750
) %>%
add_surface() %>%
add_trace(
x = ~ mu,
y = ~ sigma,
z = ~ ll,
mode = 'markers',
marker = list(color = palettes$orange, size = 3),
type = "scatter3d",
data = pointdat,
name = NULL,
showlegend = FALSE
) %>%
add_trace(
x = coef(mlnorm)[1],
y = coef(mlnorm)[2],
z = mlnorm@min,
marker = list(color = palettes$orange$complementary[2], size = 10),
type = "scatter3d",
showlegend = FALSE
) %>%
# config(displayModeBar = FALSE) %>%
theme_plotly() %>%
layout(
scene = list(
xaxis = list(
title = 'mu',
titlefont = list(size = 12),
tickfont = list(size = 10),
dtick = .25
),
yaxis = list(
title = 'sigma',
titlefont = list(size = 12),
tickfont = list(size = 10),
dtick = .25
),
zaxis = list(
title = 'Neg. Log Lik ',
titlefont = list(size = 8),
tickfont = list(size = 10),
position = .1
)
),
paper_bgcolor = 'rgba(0,0,0,0)',
plot_bgcolor = 'rgba(0,0,0,0)'
)
detach(package:plotly)
```
```{r plotNormMLEpath, echo=FALSE, out.width='500px'}
# plotdat = read.csv('data//mleNormEstimates.csv'); colnames(plotdat) = c('mu','sigma', 'll')
plotdat$nll = -plotdat$ll
# car::scatter3d(ll~mu + sigma, data=plotdat, fill=F, grid=F, residuals=F, point.col='#FF5500')
# library(ggplot2); library(reshape2)
ggplot(aes(x = mu, y = sigma), data = plotdat) +
geom_point(
aes(),
color = '#ff5500',
size = 4,
alpha = .25,
show.legend = FALSE,
position = position_jitter(w = 0.2, h = 0.2)
) +
scale_size_continuous(range = c(.1, 5)) +
geom_point(
aes(),
col = palettes$orange$complementary[2],
data = data.frame(t(coef(mlnorm))),
size = 10,
alpha = .9
) +
# geom_path(aes(), col='navy', alpha=.5) +
labs(x = expression(mu), y = expression(sigma)) +
theme_clean() +
theme(
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(color = 'gray10', size = 14),
axis.title.x = element_text(color = 'gray10', size = 24),
axis.title.y = element_text(color = 'gray10', size = 24),
plot.background = element_rect(fill = "transparent", colour = NA)
)
# ggsave('img/mlnorm.svg', bg='transparent')
```
<br> <p class="" style='font-size:.75em; text-align:center'>A bit of jitter was added to the points to better see what's going on.</p>
Please note that there are many other considerations in optimization completely ignored here, but for our purposes and the audience for which this is intended, we do not want to lose sight of the forest for the trees. We now move next to a slightly more complicated regression example.
## Linear Model
In the standard regression context, the expected value for the target variable comes from our linear predictor, i.e. the weighted combination of our explanatory variables, and we estimate the regression weights/coefficients and possibly other relevant parameters. We can expand our previous example to the standard linear model without too much change. In this case we estimate a mean for each observation, but otherwise assume the variance is constant across observations. Again, we first construct some data so that we know exactly what to expect, then write out the likelihood function with starting parameters. As we need to estimate our intercept and coefficient for the X predictor (collectively referred to as $\beta$), we can think of our likelihood explicitly as before:
$$\ln\mathcal{L}(\beta, \sigma^2) = \sum_{i=1}^N \ln[\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(y-X\beta)^2}{2\sigma^2})]$$
```{r bblmeReg}
# for replication
set.seed(1234)
# predictor
X = rnorm(1000)
# coefficients for intercept and predictor
beta = c(5, 2)
# add intercept to X and create y with some noise
y = cbind(1, X) %*% beta + rnorm(1000, sd = 2.5)
regLL = function(sigma = 1, Int = 0, b1 = 0){
coefs = c(Int, b1)
mu = cbind(1, X) %*% coefs
ll = -sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
message(paste(sigma, Int, b1, ll))
ll
}
library(bbmle)
mlopt = mle2(regLL, method = "L-BFGS-B", lower = c(sigma = 0))
summary(mlopt)
# plot(profile(mlopt), absVal = FALSE)
modlm = lm(y ~ X)
broom::tidy(modlm)
-2 * logLik(modlm)
```
As before, our estimates and final log likelihood value are about where they should be, and reflect the `lm` output, as the OLS estimates are the maximum likelihood estimates. The visualization becomes more difficult beyond two parameters, but we can examine slices similar to the previous plot.
```{r mlmisc, echo=FALSE}
detach(package:bbmle) # why are people still using MASS?!
# detach(package:MASS) # why are people still using MASS?!
plotdat = read_csv('data//mleEstimates.csv')
colnames(plotdat) = c('sigma', 'Intercept', 'β', 'll')
# plotdat = read.csv('data//mleEstimates.csv')
# colnames(plotdat) = c('sigma', 'Int', 'b1', 'll')
plotdat$nll = -plotdat$ll
plotdat %>%
select(-ll) %>%
pivot_longer(-c(sigma, nll), names_to = 'variable') %>%
ggplot() +
geom_point(
aes(x = sigma, y = value, size = nll),
color = palettes$orange$orange,
alpha = .15,
show.legend = FALSE,
position = position_jitter(w = 0.2, h = 0.2)
) +
facet_wrap(~ variable) + #, scales='free_y'
scale_size_continuous(range = c(.5, 5)) +
labs(x = expression(sigma)) +
theme_clean(center_axis_labels = TRUE) +
theme(
axis.ticks.y = element_blank(),
axis.text.x = element_text(color = 'gray25', size = 14),
axis.title.x = element_text(color = 'gray25', size = 24),
axis.title.y = element_text(color = 'gray25', vjust = 1),
plot.background = element_rect(fill = "transparent", colour = NA)
)
# ggsave('img/mlreg.svg', bg='transparent')
```
To move to generalized linear models, very little changes of the process. We will have a potentially different distribution assumed, and we are typically modeling a function of the target variable (e.g. $\log(y)=X\beta; \mu = e^{X\beta}$).
## Binomial Likelihood Example
This regards the example seen in the early part of the document with the hands-on example.
```{r binomLL, echo=-1}
set.seed(1234)
x1 = rbinom(1000, size = 10, p = .5)
x2 = rbinom(1000, size = 10, p = .85)
binomLL = function(theta, x) {
-sum(dbinom(x, size = 10, p = theta, log = TRUE))
}
optimize(binomLL,
x = x1,
lower = 0,
upper = 1)
mean(x1)
optimize(binomLL,
x = x2,
lower = 0,
upper = 1)
mean(x2)
```
## Modeling Languages
I will talk only briefly about a couple of the modeling language options available, as you will have to make your own choice among many.
### Bugs
When I first wrote this document, BUGS [@bugsbook] (*B*ayesian inference *U*sing *G*ibbs *S*ampling) was perhaps the most widely known and used Bayesian modeling language, and it has been around for over 30 years at this point. It is implemented via [OpenBUGS](http://www.openbugs.net/) and freely available for download[^winbugs]. It even has a GUI interface if such a thing is desired.
### JAGS
[JAGS](http://mcmc-jags.sourceforge.net/) (*J*ust *A*nother *G*ibbs *S*ampler) is a more recent dialect of the BUGS language, and is also free to use. It offers some technical and modeling advantages to OpenBUGs, but much of the code translates directly from one to the other.
### Nimble
[Nimble](https://r-nimble.org/) allows one to implement BUGS style models within R, but also is highly extensible and provides additional options.
### Stan
[Stan](http://mc-stan.org/) is a relative newcomer to Bayesian modeling languages, and as I write this it is about to have its 10 year anniversary. It uses a different estimation procedure than the BUGS language, and this makes it more flexible and perhaps better behaved for many types of models. It actually compiles Stan code to C++, and so can be very fast as well. In addition, it has interfaces for Python, Julia and more. I personally prefer it as I find it more clear in its expression, and there are many associated tools for model exploration.
### R
R has many modeling packages devoted to Bayesian analysis such that there is a [Task View](http://cran.r-project.org/web/views/Bayesian.html) specific to the topic. Most of them are even specific to the implementation of a certain type of analysis[^rbayes]. So not only can you do everything within R and take advantage of the power of those languages, you can then use Bayesian specific R packages on the results. For standard and even some complex models I would suggest using the <span class="pack">rstanarm</span> or <span class="pack">brms</span> packages as a way to stick to the usual R modeling style, unless you have a notably complicated model, at which point you can use <span class="pack">rstan</span>.
### General Statistical Packages
The general statistical languages such as SAS, SPSS, and Stata were generally *very* late to the Bayesian game, even for implementations of Bayesian versions of commonly used models. SAS and Stata are fairly capable from what I can tell (but there is [StataStan](http://mc-stan.org/interfaces/stata-stan)). Others still seem to be lacking as well. In general, I wouldn't recommend these packages for Bayesian analysis, except as an interface to one of the Bayesian specific languages, assuming they have the capability and you are already familiar with the program.
### Other Programming Languages
Python has functionality via modules such as PyMC, and Stan has a Python implementation, PyStan. Julia and Matlab both have Stan ports as well. And with any programming language that you might use for statistical analysis, you could certainly do a lot of it by hand if you have the time, though you should exhaust tested implementations first.
### Summary
In short, you have plenty of options. I would suggest starting with a Bayesian programming language or using that language within your chosen statistical environment or package. This gives you the most modeling flexibility, choice, and opportunity to learn.
## BUGS Example
The following provides a BUGS example[^bugswarning] of the primary model used in this document. The applicable code for the data set up is in the [Example: Linear Regression Model][Example: Linear Regression Model] section of the document. The model matrix X must be a <span class="objclass">matrix</span> class object. Next we setup a BUGS data list as we did with Stan, and create a text file that contains the model code. Note that the data list comprises simple characters which are used to look for objects of those names that are in the environment. Also, I use <span class="func">cat</span> with <span class="func">sink</span> so that I don't have to go to a separate text editor to create the file.
One of the big differences between BUGS and other languages is its use of the precision parameter $\frac{1}{\sigma^2}$, the inverse variance, usually denoted as $\tau$. While there were some computational niceties to be had in doing so, even the authors admit this was not a good decision in retrospect. Prepare to have that issue come up from time to time when you inevitably forget. Comments and assignments are the same as R, and distributions noted with $\sim$.
```{r bugsExampleDataSetup, eval=FALSE, echo=FALSE}
# set seed for replicability
set.seed(8675309)
# create a Nxk matrix of covariates
N = 250
K = 4
covariates = replicate(K-1, rnorm(n=N))
# create the model matrix with intercept
X = data.frame(Intercept=1, covariates)
# create a normally distributed target variable that is a function of the covariates
mu = with(X, 5 + .2*X1 - 1.5*X2 + .9*X3)
sigma = 2
y <- rnorm(N, mu, sigma)
# for comparison
modlm = lm(y~., X[,-1])
# Convert X to matrix for use with R2OpenBUGS
X = as.matrix(X)
```
```{r bugsBugsCodeSetup, eval=FALSE, echo=-(1:4)}
##################
### BUGS setup ###
##################
bugsdat = list('y', 'X', 'N', 'K')
# This will create a file, lmbugs.txt that will subsequently be called
sink('data/lmbugs.txt')
cat(
'model {
for (n in 1:N){
mu[n] <- beta[1]*X[n,1] + beta[2]*X[n,2] + beta[3]*X[n,3] + beta[4]*X[n,4]
y[n] ~ dnorm(mu[n], inv.sigma.sq)
}
for (k in 1:K){
beta[k] ~ dnorm(0, .001) # prior for reg coefs
}
# Half-cauchy as in Gelman 2006
# Scale parameter is 5, so precision of z = 1/5^2 = 0.04
sigma.y <- abs(z)/sqrt(chSq) # prior for sigma; cauchy = normal/sqrt(chi^2)
z ~ dnorm(0, .04)I(0,)
chSq ~ dgamma(0.5, 0.5) # chi^2 with 1 d.f.
inv.sigma.sq <- pow(sigma.y, -2) # precision
# sigma.y ~ dgamma(.001, .001) # prior for sigma; a typical approach used.
}'
)
sink()
# explicitly provided initial values not necessary, but one can specify them
# as follows, and you may have problems with variance parameters if you don't.
# Note also that sigma.y is unnecesary if using the half-cauchy approach as
# it is defined based on other values.
# inits <- list(list(beta=rep(0,4), sigma.y=runif(1,0,10)),
# list(beta=rep(0,4), sigma.y=runif(1,0,10)),
# list(beta=rep(0,4), sigma.y=runif(1,0,10)))
# parameters <- c('beta', 'sigma.y')
```
Now we are ready to run the model. You'll want to examine the help file for the <span class="func">bugs</span> function for more information. In addition, depending on your setup you may need to set the working directory and other options. Note that `n.thin` argument is used differently than other packages. One specifies the n posterior draws (per chain) you to keep want as n.iter-n.burnin. The thinned samples aren't stored. Compare this to other packages where n.iter is the total before thinning and including burn-in, and n.keep is (n.iter-n.burnin)/n.thin. With the function used here, n.keep is the same, but as far as arguments your you'll want to think of n.iter as the number of posterior draws *after* thinning. So the following all produce 1000 posterior draws in <span class="pack">R2OpenBUGS</span>:
```{r bugsIter, echo=FALSE, results='asis'}
tab = rbind(
c("n.iter = 3000", " n.thin = 1", " n.burnin = 2000"),
c("n.iter = 3000", " n.thin = 10", " n.burnin = 2000"),
c("n.iter = 3000", " n.thin = 100", " n.burnin = 2000")
)
kable(tab) %>%
kable_styling(full_width = FALSE)
```
In other packages, with those arguments you'd end up with 1000, 100, and 10 posterior draws.
```{r bugsRunModel, eval=FALSE, echo=5:15}
#####################
### Run the model ###
#####################
library(R2OpenBUGS)
lmbugs <- bugs(
bugsdat,
inits = NULL,
parameters = c('beta', 'sigma.y'),
model.file = 'lmbugs.txt',
n.chains = 3,
n.iter = 3000,
n.thin = 10,
n.burnin = 2000
)
# save.image('data/lmbugsResults.RData')
```
Now we are ready for the results, which will be the same as what we saw with Stan. In addition to the usual output, you get the *deviance information criterion* as a potential means for model comparison.
```{r bugsModelResults, eval=c(1, 5), echo=7, results='hold'}
load('data/lmbugsResults.RData')
print(lmbugs, digits = 3)
lmbugs$summary[, c(1:3, 5, 7:9)] %>%
kable(digits = 2) %>%
kable_styling(full_width = FALSE)
lmbugs$summary
```
The usual model diagnostics are available with conversion of the results to a general MCMC object, here shown via the the <span class="pack">posterior</span> package. You can then use <span class="pack">bayesplot</span> or other visualization tools. Figures are not shown, but they are the typical traceplots and density plots.
```{r bugsModelSamples, eval=FALSE}
bugs_samples = posterior::as_draws(lmbugs$sims.array)
posterior::summarise_draws(bugs_samples)
posterior::rstar(bugs_samples)
bayesplot::mcmc_trace(bugs_samples)
bayesplot::mcmc_dens(bugs_samples)
```
## JAGS Example
The following shows how to run the regression model presented earlier in the document via JAGS. Once you have the data set up as before, the data list is done in the same fashion as with BUGS. The code itself is mostly identical, save for the use of T instead of I for truncation. JAGS, being a BUGS dialect, also uses the precision parameter in lieu of the variance, which isn't annoying at all.
```{r jagsExample, eval=FALSE, echo=-1}
load('data/mainModels.RData')
jagsdat = list(
'y' = y,
'X' = X,
'N' = N,
'K' = K
)
sink('data/lmjags.txt')
cat(
'model {
for (n in 1:N){
mu[n] <- beta[1] * X[n, 1] + beta[2] * X[n, 2] + beta[3] * X[n, 3] + beta[4] * X[n, 4]
y[n] ~ dnorm(mu[n], inv.sigma.sq)
}
for (k in 1:K){
beta[k] ~ dnorm(0, .001)
}
# Half-cauchy as in Gelman 2006
# Scale parameter is 5, so precision of z = 1/5^2 = 0.04
sigma.y <- z/sqrt(chSq)
z ~ dnorm(0, .04)T(0,)
chSq ~ dgamma(0.5, 0.5)
inv.sigma.sq <- pow(sigma.y, -2)
}'
)
sink()
# explicitly provided initial values not necessary, but can specify as follows
# inits <- function() {
# list(beta = rep(0, 4), sigma.y = runif(1, 0, 10))
# }
parameters = c('beta', 'sigma.y')
```
With everything set, we can now run the model. With JAGS, we have what might be called an initialization stage that sets the model up and runs through the warm-up period, after which we can then flexibly sample from the posterior via the <span class="func">coda.samples</span> function.
```{r jagsRunModel, eval=FALSE, echo=-11}
library(rjags)
# library(coda)
lmjagsmod = jags.model(
file = 'data/lmjags.txt',
data = jagsdat,
n.chains = 3,
n.adapt = 2000
# inits = inits # if you use them
)
# update(lmjagsmod, 10000) # alternative approach
lmjags = coda.samples(
model = lmjagsmod,
n.iter = 10000,
thin = 10,
n.chains = 3,
variable.names = parameters
)
```
Now we have a model identical to the others, and can summarize the posterior distribution in similar fashion. As before, you can use the <span class="pack">posterior</span> and other packages as well.
```{r jagsModelResults, echo=-1, eval=FALSE}
load('data/jagsModel.RData')
summary(lmjags)
coda::effectiveSize(lmjags)
```
```{r jagsModelResults_pretty, echo=FALSE}
load('data/jagsModel.RData')
broom.mixed::tidyMCMC(lmjags, conf.int = TRUE) %>%
bind_cols(tibble(ESS = coda::effectiveSize(lmjags))) %>%
kable(align = 'lrrrr', digits = 2) %>%
kable_styling(full_width = FALSE)
```
```{r jagsModelDiag, eval=FALSE, echo=F}
# visualize
library(coda)
library(scales)
library(ggthemes)
traceplot(lmjags, col = alpha(gg_color_hue(3), .5))
densityplot(lmjags, col = alpha(gg_color_hue(3), .5))
plot(lmjags, col = alpha(gg_color_hue(3), .25))
corrplot:::corrplot(cor(lmjags[[2]])) # noticeably better than levelplot
par(mar = c(5, 4, 4, 2) + 0.1) # reset margins
```
## Metropolis Hastings Example
Next depicted is a random walk Metropolis-Hastings algorithm using the data and model from prior sections of the document. I had several texts open while cobbling together this code such as @gelman_bda, and some oriented towards the social sciences by @jeff_gill_bayesian_2008, @simon_jackman_bayesian_2009, and @scott_lynch_2007 etc. Some parts of the code reflect information and code examples found therein, and follows Lynch's code a bit more[^bettercode].
[^bettercode]: Some of these texts may have more recent editions, so consult those if interested.
The primary functions that we need to specify regard the posterior distribution[^MHexample], an update step for beta coefficients, and an update step for the variance estimate.<br><br>
```{r MHexample}
# posterior function
post = function(x, y, b, s2) {
# Args: X is the model matrix; y the target vector; b and s2 the parameters
# to be estimated
beta = b
sigma = sqrt(s2)
sigma2 = s2
mu = X %*% beta
# priors are b0 ~ N(0, sd = 10), sigma2 ~ invGamma(.001, .001)
priorbvarinv = diag(1/100, 4)
prioralpha = priorbeta = .001
if (is.nan(sigma) | sigma <= 0) { # scale parameter must be positive
return(-Inf)
}
# Note that you will not find the exact same presentation across texts and
# other media for the log posterior in this conjugate setting. In the end
# they are conceptually still (log) prior + (log) likelihood (See commented 'else')
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
-(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}
# else {
# ll = mvtnorm::dmvnorm(
# y,
# mean = mu,
# sigma = diag(sigma2, length(y)),
# log = TRUE
# )
#
# priorb = mvtnorm::dmvnorm(
# beta,
# mean = rep(0, length(beta)),
# sigma = diag(100, length(beta)),
# log = TRUE
# )
#
# priors2 = dgamma(1 / sigma2, prioralpha, priorbeta, log = TRUE)
#
# logposterior = ll + priorb + priors2
# logposterior
# }
}
# update step for regression coefficients
updatereg = function(i, x, y, b, s2) {
# Args are the same as above but with additional i iterator argument.
b[i,] = MASS::mvrnorm(1, mu = b[i-1, ], Sigma = bvarscale) # proposal/jumping distribution
# Compare to past- does it increase the posterior probability?
postdiff =
post(x = x, y = y, b = b[i, ], s2 = s2[i-1]) -
post(x = x, y = y, b = b[i-1, ], s2 = s2[i-1])
# Acceptance phase
unidraw = runif(1)
accept = unidraw < min(exp(postdiff), 1) # accept if so
if (accept) b[i,]
else b[i-1,]
}
# update step for sigma2
updates2 = function(i, x, y, b, s2) {
s2candidate = rnorm(1, s2[i-1], sd = sigmascale)
if (s2candidate < 0) {
accept = FALSE
}
else {
s2diff =
post(x = x, y = y, b = b[i, ], s2 = s2candidate) -
post(x = x, y = y, b = b[i, ], s2 = s2[i-1])
unidraw = runif(1)
accept = unidraw < min(exp(s2diff), 1)
}
ifelse(accept, s2candidate, s2[i-1])
}
```
Now we can set things up for the MCMC chain[^MHmcmc]. Aside from the typical MCMC setup and initializing the parameter matrices to hold the draws from the posterior, we also require scale parameters to use for the jumping/proposal distribution.
```{r MHexampleSetup}
# Setup, starting values etc.
nsim = 5000
burnin = 1000
thin = 10
b = matrix(0, nsim, ncol(X)) # initialize beta update matrix
s2 = rep(1, nsim) # initialize sigma vector
# For the following this c term comes from BDA3 12.2 and will produce an
# acceptance rate of .44 in 1 dimension and declining from there to about
# .23 in high dimensions. For the sigmascale, the magic number comes from
# starting with a value of one and fiddling from there to get around .44.
c = 2.4/sqrt(ncol(b))
bvar = vcov(lm(y ~ ., data.frame(X[, -1])))
bvarscale = bvar * c^2
sigmascale = .9
```
We can now run and summarize the model with tools from the <span class="pack">coda</span> package.
```{r MHexampleRun, eval=F}
# Run
for (i in 2:nsim) {
b[i, ] = updatereg(
i = i,
y = y,
x = X,
b = b,
s2 = s2
)
s2[i] = updates2(
i = i,
y = y,
x = X,
b = b,
s2 = s2
)
}
# calculate acceptance rates
b_acc_rate = mean(diff(b[(burnin + 1):nsim,]) != 0)
s2_acc_rate = mean(diff(s2[(burnin + 1):nsim]) != 0)
tibble(b_acc_rate, s2_acc_rate)
# get final chain
library(posterior)
b_final = as_draws(b[seq(burnin + 1, nsim, by = thin), ])
s2_final = as_draws(matrix(s2[seq(burnin + 1, nsim, by = thin)]))
# get summaries; compare to lm and stan
summarise_draws(b_final)
summarise_draws(s2_final)
round(c(coef(modlm), summary(modlm)$sigma ^ 2), 3)
```
```{r MHexampleRun_pretty, echo=FALSE}
# Run
for (i in 2:nsim) {
b[i,] = updatereg(i = i, y = y, x = X, b = b, s2 = s2)
s2[i] = updates2(i = i, y = y, x = X, b = b, s2 = s2)
}
# calculate acceptance rates
b_acc_rate = mean(diff(b[(burnin + 1):nsim,]) != 0)
s2_acc_rate = mean(diff(s2[(burnin + 1):nsim]) != 0)
# tibble(b_acc_rate, s2_acc_rate)
# get final chain
library(posterior)
b_final = as_draws(
data.frame(
beta = b[seq(burnin + 1, nsim, by = thin), ],
s2 = s2[seq(burnin + 1, nsim, by = thin)]
)
)
```
```{r MHexampleRun_pretty_table, echo=FALSE}
posterior::summarise_draws(
b_final,
mean = mean,
sd = sd,
mcse = mcse_mean,
quantile = \(x, ...) quantile2(x, probs = c(.025, .975), ...),
default_convergence_measures()
# .args = list(probs = c(.025, .975)) # problem passing arg to quantile without any explicit call of quantile
) %>%
kable(align = 'lrrrrrrr', digits = 2) %>%
kable_styling(full_width = FALSE)
```
Here is the previous Stan fit for comparison.
```{r MHexampleCompare, echo=FALSE}
# print(fit, digits=3, prob=c(.025,.5,.975))
broom::tidy(fit, conf.int = TRUE) %>%
kable(align = 'lrrrr', digits = 2) %>%
kable_styling(full_width = FALSE)
```
## Hamiltonian Monte Carlo Example
The following demonstrates Hamiltonian Monte Carlo, the technique that Stan uses, and which is a different estimation approach than the Gibbs sampler in BUGS/JAGS. If you are interested in the details enough to be reading this, I highly recommend Betancourt's [conceptual introduction to HMC](https://arxiv.org/pdf/1701.02434.pdf). This example still assumes the data we used in this document, and is largely based on the code in the appendix of @gelman_bda.
First we start with the functions.
```{r HMCexample}
# Log posterior function
log_p_th = function(X, y, theta) {
# Args: X is the model matrix; y the target vector; theta are the current
# parameter estimates.
beta = theta[-length(theta)] # reg coefs to be estimated
sigma = theta[length(theta)] # sigma to be estimated
sigma2 = sigma^2
mu = X %*% beta
# priors are b0 ~ N(0, sd=10), sigma2 ~ invGamma(.001, .001)
priorbvarinv = diag(1/100, 4)
prioralpha = priorbeta = .001
if (is.nan(sigma) | sigma<=0) { # scale parameter must be positive, so posterior
return(-Inf) # density is zero if it jumps below zero
}
# log posterior in this conjugate setting. conceptually it's
# (log) prior + (log) likelihood. (See commented 'else')
else {
-.5*nrow(X)*log(sigma2) - (.5*(1/sigma2) * (crossprod(y-mu))) +
-.5*ncol(X)*log(sigma2) - (.5*(1/sigma2) * (t(beta)%*%priorbvarinv%*%beta)) +
-(prioralpha+1)*log(sigma2) + log(sigma2) - priorbeta/sigma2
}
# else {
# ll = mvtnorm::dmvnorm(
# y,
# mean = mu,
# sigma = diag(sigma2, length(y)),
# log = TRUE
# )
# priorb = mvtnorm::dmvnorm(
# beta,
# mean = rep(0, length(beta)),
# sigma = diag(100, length(beta)),
# log = TRUE
# )
# priors2 = dgamma(1/sigma2, prioralpha, priorbeta, log = TRUE)
# logposterior = ll + priorb + priors2
# logposterior
# }
}
# numerical gradient function as given in BDA3 p. 602; same args as posterior
gradient_theta_numerical = function(X, y, theta) {
d = length(theta)
e = .0001
diffs = numeric(5)
for (k in 1:d) {
th_hi = theta
th_lo = theta
th_hi[k] = theta[k] + e
th_lo[k] = theta[k] - e
diffs[k] = (log_p_th(X, y, th_hi) - log_p_th(X, y, th_lo)) / (2*e)
}
return(diffs)
}
# single HMC iteration function
hmc_iteration = function(
X,
y,
theta,
epsilon,
L,
M
) {
# Args: epsilon is the stepsize; L is the number of leapfrog steps; epsilon
# and L are drawn randomly at each iteration to explore other areas of the
# posterior (starting with epsilon0 and L0); M is a diagonal mass matrix
# (expressed as a vector), a bit of a magic number in this setting. It regards
# the mass of a particle whose position is represented by theta, and momentum
# by phi. See the sampling section of chapter 1 in the Stan manual for more
# detail.
M_inv = 1/M
d = length(theta)
phi = rnorm(d, 0, sqrt(M))
th_old = theta
log_p_old = log_p_th(X, y, theta) - .5*sum(M_inv*phi^2)
phi = phi + .5 * epsilon * gradient_theta_numerical(X, y, theta)
for (l in 1:L) {
theta = theta + epsilon*M_inv*phi
phi = phi + ifelse(l == L, .5, 1) *
epsilon * gradient_theta_numerical(X, y, theta)
}
# here we get into standard MCMC stuff, jump or not based on a draw from a
# proposal distribution
phi = -phi
log_p_star = log_p_th(X, y, theta) - .5 * sum(M_inv * phi^2)
r = exp(log_p_star - log_p_old)
if (is.nan(r)) r = 0
p_jump = min(r, 1)
if (runif(1) < p_jump) {
th_new = theta
}
else {
th_new = th_old
}
return(list(theta = th_new, p_jump = p_jump)) # returns estimates and acceptance rate
}
# main HMC function
hmc_run = function(
starts, # starting values for theta
iter, # total number of simulations for each chain; chain is based on the dimension of starts
warmup, # determines initial iterations that will be ignored for inference purposes
epsilon_0, # the baseline stepsize
L_0, # the baseline number of leapfrog steps
M, # the mass vector
X,
y
) {
# Args:
chains = nrow(starts)
d = ncol(starts)
sims = array(NA, c(iter, chains, d), dimnames = list(NULL, NULL, colnames(starts)))
p_jump = matrix(NA, iter, chains)
for (j in 1:chains) {
theta = starts[j, ]
for (t in 1:iter) {
epsilon = runif(1, 0, 2 * epsilon_0)
L = ceiling(2 * L_0 * runif(1))