-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathMetric Learning.Rmd
1965 lines (1483 loc) · 120 KB
/
Metric Learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Prediction evaluation metric: loss functions"
author: "Zhang Jinxiong [email protected]"
date: "2020/5/3"
output:
html_document:
toc: yes
pdf_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Prediction evaluation metric: loss functions
In general, [learning consists of three components](https://dl.acm.org/doi/pdf/10.1145/2347736.2347755):
$$\text{Learning=Representation + Evaluation + Optimization}.$$
[The prediction evaluation metric should be selected to reflect domain-specific considerations, such as the types of errors that are more costly.](https://www.stat.berkeley.edu/~binyu/ps/papers2020/VDS20-YuKumbier.pdf)
[In fact, there is an entire area of research devoted to evaluating the quality of probabilistic forecasts through “scoring rules”.](https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf)
Even Zongben Xu propose the [the independence assumption of loss function on dataset](https://dl.acm.org/doi/pdf/10.1145/3397271.3402428),
we should remind that the loss function design dependens on the `errors`.
- https://rohanvarma.me/Loss-Functions/
- [Probabilistic Setup and Empirical Risk Minimization](https://ttic.uchicago.edu/~tewari/lectures/lecture7.pdf)
- [10: Empirical Risk Minimization](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html)
- [Principles of risk minimization for learning theory](http://papers.nips.cc/paper/506-principles-of-risk-minimization-for-learning-theory.pdf)
- https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
************
In absence of prior knowledge about data we can only use general purpose metrics like Euclidean distance, Cosine similarity or Manhattan distance etc,
but these metric often fail to capture the correct behavior of data which directly affects the performance of the learning algorithm.
Solution to this problem is to tune the metric according to the data and the problem,
manually deriving the metric for high dimensional data which is often difficult to even visualize is not only tedious but is extremely difficult.,
Which leads to put effort on metric learning which satisfies the data geometry.
[Goal of `metric learning` algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points](https://parajain.github.io/metric_learning_tutorial/)
- https://parajain.github.io/metric_learning_tutorial/
- https://en.wikipedia.org/wiki/Similarity_learning
Loss function| Cost function | Empirical risk| Structure risk
-----|----|---|---
$\ell(y, f(x;\theta))$ between the response $y$ of the supervisor to a given input $x$ and the response $f(x;\theta)$ provided by the learning machine|the objective function to minimize|$\frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))$ approximation to the expected value of the loss | $\frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))+\mathcal{R}(f)$ regularized emprical risk
We use the formula $f(x;\theta)$ to predict the response of the inout $x$, which is parameterized by $\theta\in\Theta$.
The formula $f(\cdot;\cdot)$ is the representation of the machine learning algorithm.
We use the `structure risk minimization(SRM)` principle to find the optimal parameters $\theta^{\ast}$ to select
the beat model over the representation space$\mathcal{F}=\{f(x;\theta)\mid \theta\in |theta\}$, i.e.,
$$\theta^{\ast}=\arg\min_{\theta\in\Theta} \frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))+\mathcal{R}(f)$$
where $\ell(y_i , f(x_i;\theta))$ is the predictive metric with the respect to the input $x_i$ and its groundtruth $y_i$.
Here we will focus on the design of the predictive metric for different tasks.
After optimization, we obtain the (sub)optimal model $f(x;\theta^{\ast})$.
Now we turn to the error decomposition of the trained model $f(x;\theta^{\ast})$ and the oracle model $\mathcal{o}$:
$$\|f(x;\theta^{\ast}) - \mathcal{o}\|=\|f(x;\theta^{\ast}) -f^{\ast}+ f^{\ast}-f_{*}+f_{*}-\mathcal{o}\|\\$$
$$\leq \underbrace{\|f(x;\theta^{\ast}) -f^{\ast}\|}_{\text{Computational error}} + \underbrace{\|f^{\ast}-f_{*}\|}_{\text{Approximation error}}+\underbrace{\|f_{*}-\mathcal{o}\|}_{\text{Representation error}}$$
where $f^{\ast}$ is the best model based on the training data set; $f_*$ is the best model constrainted in the hypothesis space $\mathcal{F}=\{f(x;\theta)\mid \theta\in \Theta\}$.
The predictive metric or the loss function $\ell(\cdot, \cdot)$ directly determine the difficulity and complexity of the optimization problem as well as the computtaional error.
***
Except the supervised learning, we also need the evaluation metrics to reflect the truth degree of the algorithms such as the goodness of fit.
If all models are approximation of the truth, what is criteria or principle to choose the best one?
As [Andrew R. Barron](http://www.stat.yale.edu/~arb4/publications.html) quote
> [Thus learning is not possible without inductive bias, and now the question is how to choose the right bias. This is called `model selection`.](http://www.modelselection.org/)
- http://www.modelselection.org/
- https://www.ssc.wisc.edu/~bhansen/
- http://www.stat.yale.edu/~arb4/
- http://sp.cs.tut.fi/WITMSE08/
- [Modern MDL meets Data Mining — Insights, Theory, and Practice](http://eda.mmci.uni-saarland.de/events/mdldm19)
- https://cispa.de/en
- [Distance Geometry in data science](https://www.lix.polytechnique.fr/~liberti/dgds.pdf)
- http://iphome.hhi.de/samek/pdf/LapNCOMM19.pdf
## Regression loss
### Maximum likelihood estimatation
Let us begin with the maximum likelihood estimate(mle).
In statistics, the **likelihood principle** is the proposition that,
given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.
In maximum likelihood estimate (mle), the sample $\{x_i\}_{i=1}^{N}$ are identitically independently from a probability distribution $\mathcal{P}(\theta)$,
so that we can obtain their joint probability
$$P(x_1,\cdots,x_n)=\prod_{i=1}^N P(\theta;x_i)$$
where $P(\theta;x_i)\in [0, 1]$ is the probability of $x_i$.
It is also called the likelihood of the parameter $\theta$.
Here $P(\theta;x_i)$ is point in the probability (distribution) function at the point $x_i$
and $P(\theta;x)$ is parameterized by $\theta\in \Theta$.
In another word, we know the distribution family $\mathcal P$ parametrized $\theta\in \Theta$.
Because we have observed that samples $x_i$ generated from the probability (distribution) family $P(\theta)$,
the joint probability of $\{x_i\}_{i=1}^{N}$ cannot be zero.
If $P(x_1,\cdots,x_n)$ is too small, the event is rare to observe.
Thus it should be common enough for us to observe.
The maximum likelihood estimate (mle) of $\theta$ is the one that maximize the joint probability
\begin{equation}\label{mle}
\theta^{\ast}=\arg\max_{\theta\in \Theta} P(x_1,\cdots,x_n)=\arg\max_{\theta\in \Theta} \prod_{i=1}^N P(\theta;x_i).\end{equation}
```{r likelihood, echo=FALSE, fig.height=4, fig.width=4, paged.print=TRUE}
q = seq(0,1,length=100)
L= function(q){q^30 * (1-q)^70}
plot(q,L(q),ylab="L(q)",xlab="q",type="l")
```
Because likelihoods may be very small numbers,
the product of the likelihoods can get very small very fast and can exceed the precision of the computer.
Because we only care about the relative likelihood,
the convention is to work with the log-likelihoods instead,
because `the log of a product is equal to the sum of the logs`.
Therefore, we can take of the logarithm of each individual likelihood and add them together
and get the same end result, i.e.,
the parameter value that maximizes the likelihood ($L$) is equal to the parameter value that maximizes the log-likelihood $\ell$:
\begin{equation}\label{log-likelihood}\ell=\log(\prod_{i=1}^N P(\theta;x_i))=\sum_{i=1}^{N}\log P(\theta;x_i)\end{equation}
where $\log P(\theta;x_i)\leq 0$.
Because the logarithm function is monotonic and inverse, so that
$$\theta^{\ast}=\arg\max_{\theta\in \Theta} P(x_1,\cdots,x_n)=\arg\max_{\theta\in \Theta} \sum_{i=1}^{N}\log P(\theta;x_i).$$
```{r log-likelihood, echo=FALSE, fig.height=4, fig.width=4, paged.print=TRUE}
q = seq(0,1,length=100)
L= function(q){q^30 * (1-q)^70}
plot(q,L(q)/L(0.3),ylab="L(q)/L(qhat)",xlab="q",type="l")
```
Given our goodness(or badness)-of-fit measure, our next step is to find the values of the parameters
that give us the best fit – the so-called `maximum likelihood estimators`.
In short, we convert the parameter inference into the numerical optimization.
* [Likelihood Ratios, Likelihoodism, and the Law of Likelihood](https://plato.stanford.edu/entries/logic-inductive/sup-likelihood.html)
* <https://algorithmia.com/blog/introduction-to-loss-functions>
* [Analysis of Environmental Data Conceptual Foundations: Maximum Likelihood Inference](https://www.umass.edu/landeco/teaching/ecodata/schedule/likelihood.pdf)
Note the following relation
$$\theta^{\ast}=\arg\max_{\theta\in \Theta} \sum_{i=1}^{N}\log P(\theta;x_i) = \arg\max_{\theta\in \Theta}\frac{1}{N} \sum_{i=1}^{N}\log P(\theta;x_i)$$
where $N$ is a constant positive number.
Now we regard the term $\log P(\theta;x_i)$ as an transformation of the observation,
$$\theta^{\ast}
=\arg\max_{\theta\in \Theta}\mathbb{E}_{x\sim \hat{p}(x)} \log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} -\mathbb{E}_{x\sim \hat{p}(x)} \log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} \mathbb{E}_{x\sim \hat{p}(x)} -\log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} \mathbb{E}_{x\sim \hat{p}(x)} \log\frac{1}{P(\theta;x)}.
$$
As $N\to\infty$, this average log-likelihood function tends, with probability 1,
$$\mathbb{E}_{x\sim \hat{p}(x)} \log\frac{1}{P(\theta;x)}=\mathbb{E}_{x\sim {p}^{true}(x)} \log\frac{1}{P(\theta;x)}\\
=\int_{x}p^{true}(x)\log\frac{1}{p(\theta;x)}\mathrm dx\\
=\int_{x}p^{true}(x)\log\frac{1}{p(\theta;x)}\mathrm dx+\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}-\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}\\
=\int_{x}p^{true}(x)\log\frac{p^{true}(x)}{p(\theta;x)}\mathrm dx- \underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}\\
=KL(p^{true}(x)\mid p(\theta;x))-\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}$$
where ${p}^{true}(x)$ is the true probability distribution function of $x$.
So we conclude that
$$\theta^{\ast}=\arg\min_{\theta\in \Theta} KL(p^{true}(x)\mid p(\theta;x))=\arg\min_{\theta\in \Theta}\log\frac{p^{true}(x)}{p(\theta;x)}$$
as $N\to\infty$.
In this sense, mle is to find the optimal approximation to the true probability distribution function constrained in the probability distribution function family $p(\theta;\cdot)$ given some observation.
- [Alternative form of max likelihood ](https://cedar.buffalo.edu/~srihari/CSE676/5.5%20MLBasics-MaxLikelihood.pdf)
- [Maximum Likelihood as minimizing KL Divergence](https://www.jessicayung.com/maximum-likelihood-as-minimising-kl-divergence/)
- [Why Minimize Negative Log Likelihood?](https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/)
------
The loss function of mle is based on the probability distribution family.
For example, $x_i\in \{0, 1\}\sim Bernoulli(p)$, the log-likelihood of $\{x_i\}_{i=1}^N$ is
$$\ell(p)=\log(\prod_{i=1}^{N}p^{x_i}(1-p)^{1-x_i})=\sum_{i=1}^N (x_i\log p+(1-x_i)\log (1-p))\\=\sum_{i=1}^N(x_i\log\frac{p}{1-p}+\log(1-p))\\=(\sum_{i=1}^Nx_i)\log\frac{p}{1-p}+n\log(1-p).$$
To find the mle, we set the derivative of $\ell(p)$ to zero
$$\frac{\mathrm d \ell(p)}{\mathrm d p}=(\sum_{i=1}^Nx_i)(\frac{1}{p}+\frac{1}{1-p})-\frac{n}{1-p}=0$$
and we get $p^{\ast}=\arg\max_{p}\ell(p)=\frac{(\sum_{i=1}^Nx_i)}{n}$.
For example, $x_i\sim Poisson(\lambda)$, the log-likelihood of $\{x_i\}_{i=1}^N$ is
$$\ell(\lambda)=\log(\prod_{i=1}^{N}\frac{\lambda^{x_i}\exp(-\lambda)}{x_i!})=\sum_{i=1}^N [x_i\log(\lambda)-\log x_i!]-N\lambda\propto \sum_{i=1}^N x_i\log(\lambda)-N\lambda$$
To find the mle, we set the derivative of $\ell(\lambda)$ to zero
$$\frac{\mathrm d \ell(\lambda)}{\mathrm d \lambda}=\sum_{i=1}^N\frac{x_i}{\lambda}-N=0$$
and we get $\lambda=\frac{\sum_{i=1}^N x_i}{N}$.
Let $x_1, \cdots , x_N$ be i.i.d. samples with Laplace density
$$P(\theta\mid \cdot)=\frac{1}{2}\exp(-\|x-\theta\|_1)$$
where $\theta\in\mathbb{R}$.
Their log-likelihood is
$$\ell(\theta)=\log(\prod_{i=1}^NP(\theta\mid x_i))=\log(\prod_{i=1}^N\frac{1}{2}\exp(-\|x_i-\theta\|_1)=-N\log(2)-\sum_{i=1}^N\|x_i-\theta\|_1$$
Observe that
$$\hat{\theta}=\arg\max_{\theta}\ell(\theta)=\arg\max_{\theta}-\sum_{i=1}^N\|x_i-\theta\|_1\\=\arg\min_{\theta}\underbrace{\sum_{i=1}^N\|x_i-\theta\|_1}_{\text{convex}}.$$
According to convex optimization, $0\in\partial \sum_{i=1}^N\|x_i-\hat\theta\|_1$
thus $\sum_{i=1}^N\operatorname{sgn}(x_i-\hat\theta)=0\implies \hat\theta=\operatorname{median}\{x_1,\cdots,x_N\}$.
- https://www.colorado.edu/amath/sites/default/files/attached-files/ch4.pdf
- https://rpubs.com/FJRubio/logisMLE
- https://shodhganga.inflibnet.ac.in/bitstream/10603/21266/12/12_chapter%206.pdf
### Bayesian mle
Sometimes, domain-specific experience will tell us how to select some appropriate parameters of the probiality distribution functions.
[For example, adult male heights are on average 70 inches (5'10) with a standard deviation of 4 inches. Adult women are on average a bit shorter and less variable in height with a mean height of 65 inches (5'5) and standard deviation of 3.5 inches. ](https://www.usablestats.com/lessons/normal)
However, we know that human heights cannot be negative.
It is better to use the truncated normal distribution for adult heights.
The parameters must be compatible with the domain-specific experience.
[There's one key difference between frequentist statisticians and Bayesian statisticians that we first need to acknowledge before we can even begin to talk about how a Bayesian might estimate a population parameter $\theta$.](https://online.stat.psu.edu/stat414/node/241/)
Bayesian describes the mapping from prior beliefs about $\theta$, summarized in so-called prior density of $P(\theta)$, to new posterior beliefs in the light of observing the data.
[Maximum A Posteriori (MAP) estimator](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/ppt/22-MAP.pdf) of $\theta$ is to maximize the posteriori of $\theta$
$$\theta^{\ast}=\arg\max_{\theta\in \Theta}P(\theta\mid x)\\=\arg\max_{\theta\in \Theta}\frac{P(\theta, x)}{P(x)}\\=\arg\max_{\theta\in \Theta}\frac{P(x\mid \theta)P(\theta)}{P(x)}\\=\arg\max_{\theta\in \Theta} P(x\mid \theta)P(\theta)$$
where $P(x)$ is the function of random variable.
`Bayesian maximum likelihood estimate` is to maximize the likelihood with a penalty function
\begin{equation}\label{Bayesian mle}\theta^{\ast}=\arg\max_{\theta\in \Theta}\sum_{i=1}^{N}\log P(\theta;x_i)+\log P(\theta)\end{equation}
where $\sum_{i=1}^{N}\log P(\theta;x_i)$ is the likelihood of the parameter $\theta$ and $P(\theta)$ is the prior probability density function of $\theta$.
```{r Bayesian-likelihood, echo=FALSE, fig.height=4, fig.width=6, paged.print=TRUE}
q = seq(0,1,length=100)
l= function(q){30*log(q) + 70 * log(1-q)+log(dnorm(q))}
plot(q,L(q),ylab="Bayesian log-likelihood",xlab="q",type="b")
```
MAP estimate is the `mode of the posterior distribution`.
Another option of Bayesian estimate would be to choose the `posterior mean`
\begin{equation}\label{MMSE}\hat{\theta}=\mathbb E[\theta|X=x]=\int_{\theta\in\Theta}\theta P(\theta\mid x)\mathrm d \theta.\end{equation}
It is called minimum mean squared error (MMSE) estimate of the random variable $\theta$ given $x$.
- [Bayesian Maximum Likelihood](http://faculty.wcas.northwestern.edu/~lchrist/course/CIED_2012/bayesian.pdf)
- [Maximum Likelihood vs.Bayesian Estimation](http://www-edlab.cs.umass.edu/cs689/lectures/ml-vs-bayes-estimation.pdf)
- [9.1.2 Maximum A Posteriori (MAP) Estimation](https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php)
- [9.1.4 Conditional Expectation (MMSE)](https://www.probabilitycourse.com/chapter9/9_1_4_conditional_expectation_MMSE.php)
- [ML, MAP, and Bayesian — The Holy Trinity of Parameter Estimation and Data Prediction](https://engineering.purdue.edu/kak/Trinity.pdf)
-----
Suppose that $Y$ follows a binomial distribution with parameters $n$ and $p = \theta$, so that the p.m.f. of $Y$ given $\theta$ is:
$$g(y|\theta) = \binom{n}{y}\theta^y(1-\theta)^{n-y}$$
for $y = 0, 1, 2, \cdots, n$.
Suppose that the prior p.d.f. of the parameter $\theta$ is the beta p.d.f., that is:
$$h(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}$$
for $0 < \theta < 1$.
Find the posterior p.d.f of $\theta$, given that $Y = y$.
First we obtain the joint density function by multiplying the prior and conditional probability
$$P(y,\theta)=g(y|\theta)h(\theta)=\binom{n}{y}\theta^y(1-\theta)^{n-y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\\=\underbrace{\binom{n}{y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}}_{\text{normalization factor}}\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}$$
and
$$\theta^{\ast}=\arg\max_{\theta\in (0, 1)}\log(k(y,\theta))=\arg\max_{\theta\in (0, 1)}(y+\alpha-1)\log(\theta)+(n-y+\beta-1)\log(1-\theta).$$
By setting the derivative of the log-likelihood function be 0, we obtain the following equation
$$(y+\alpha-1)(1-\theta)-\theta(n-y+\beta-1)=y+\alpha-1-\theta(n-y+\beta-1+y+\alpha-1)=0$$
we obtain $\theta=\frac{y+\alpha-1}{n+\alpha+\beta-2}$
so $\theta^{\ast}=\frac{y+\alpha-1}{n-2+\beta+\alpha}$.
And $P(\theta\mid y)=\int_{y}P(y,\theta)\mathrm d y\propto \theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}$.
In another word, $\theta\mid y$ follows a $Beta$ distribution, and
$$\mathbb E(\theta\mid y)=\frac{y+\alpha}{n+\alpha+\beta}.$$
- [Notes for 6.864, ML vs. MAP (Sept. 24, 2009)](http://people.csail.mit.edu/regina/6864/mlvsmap.pdf)
- [Bayesian Estimation](https://online.stat.psu.edu/stat414/node/241/)
------
The mle only takes the samples $\{x_i\}_{i=1}^N$ into consideration
while Bayesian mle also considers the parameters.
mle|Bayesian mle
---|----
point estimation of density function|Bayesian inference
$\arg\max_{\theta\in\Theta}P(x\mid\theta)$|$\arg\max_{\theta\in\Theta}P(x\mid\theta)P(\theta)$
$\theta$ is as the unknown parameter of the probability $P(x\mid \theta)$. | $\theta$ is the latent random factor to measure the rareness of the model $P(\cdot\mid \theta)$.
Bayesian mle is to find the frequent pattern of the model.
MAP is to answer the question what is the most possible pattern in the belief of $P(\theta)$
if we have observed a sequence of samples generated by a model parameterized by $\theta$.
Bayesian estimates constraint the probability distribution of parameter $\theta$
so it is to rule out some chances.
The prior $P(\theta)$ penalize the rare values of parameters.
The core of Bayesian statistics is to use the prior probability of the parameters to decrease their uncertainty.
According to the product rule of probability, we can find that
$$P(x,\theta)=P(x\mid \theta)P(\theta)=P( \theta\mid x)P(x)\implies P( \theta\mid x)=\frac{P(x\mid \theta)P(\theta)}{P(x)}\propto P(x\mid \theta)P(\theta).$$
It is summarized as following
``` posterior is proportional to likelihood times prior```.
In standard Bayesian notation, we use $\pi(\theta)$ to denote the `prior probability density function (pdf)` of parameter $\theta$ with support$S(\theta)$,
$L(y|\theta)$ the `likelihood function` (i.e. the pdf of data given the parameter) with support $S(Y |\theta)$,
$p(\theta|y)$ the `posterior pdf` with support $S(\theta\mid Y)$ of parameter given the data,
and $f(y)$ the unconditional pdf for the data with support $S(Y)$. Both $\theta$ and $y$ can be vectors.
From the joint pdf identity, $L(y|\theta)\pi(\theta) = p(\theta|y)f(y)$, the `Bayes formula`
$$ p(\theta|y)=\frac{L(y|\theta)\pi(\theta)}{f(y)}=\frac{L(y|\theta)\pi(\theta)}{\int_{S(\theta\mid Y)}L(y|\theta)\pi(\theta)}.$$
We can re-write the above joint pdf identity as $f(y)=\frac{L(y|\theta)\pi(\theta)}{p(\theta|y)}$.
Now for any fixed $\theta$, we can integrate both sides of the re-expressed joint pdf identity with respect to $y$
over $S(Y\mid \theta)$ and obtain the prior pdf at $\theta$
$$\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y=\pi(\theta)\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y \\ \Downarrow \\ \pi(\theta)=\frac{\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y}{\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y}=\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y(\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y)^{-1}.$$
Under so-called positivity assumption, we will obtain the [Inversion of Bayes formula](https://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0001/SII-2011-0004-0001-a010.pdf):
\begin{equation}\label{IBF}\pi(\theta)=(\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y)^{-1}\end{equation}
for $\theta \in S(\theta)$.
If we prefer some posterior pdf, we can use Inversion of Bayes formula to find a proper $\pi(\theta)$.
* [Inversion of Bayes formula](https://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0001/SII-2011-0004-0001-a010.pdf)
* [Inverse Bayes Formulae (IBF)](http://web.hku.hk/~kaing/Section1_3.pdf)
* [Unexpected Journey to the Converse of Bayes’ Theorem](http://web.hku.hk/~kaing/HKSSinterview.pdf)
### Linear Regression
Now we turn to regression problem.
Simply speaking, regression is to find the relationship of variables.
The common setting of regression problem is training data set $(x_i, y_i)$ for $i=1,\cdots, N$
and the model space $\mathcal{M}=\{m(\theta)\mid \theta\in\Theta\}$, where $x_i\in\mathbb{R}^d$ and $y_i\in\mathbb{R}$.
And there is an oracle $m_o\in \mathcal{M}$ so that
$$y_i=m_o(x_i)+\varepsilon_i\tag{regression}$$
for $i=1,\cdots, N$ and $\{\varepsilon_i\}){i=1}^N$ are random variables identically indepentlt distributed in some probability family.
```{r regression, echo=FALSE, fig.height=4, fig.width=6, paged.print=TRUE}
x <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120)
y <- c(10, 18, 25, 29, 30, 28, 25, 22, 18, 15, 11, 8)
fit <- lm(y ~ poly(x, 3)) ## polynomial of degree 3
plot(x, y) ## scatter plot (colour: black)
x0 <- seq(min(x), max(x), length = 20) ## prediction grid
y0 <- predict.lm(fit, newdata = list(x = x0)) ## predicted values
lines(x0, y0, col = 2) ## add regression curve (colour: red)
#https://stackoverflow.com/questions/39736847/plot-regression-line-in-r
```
[`The method of least squares` is about estimating parameters by minimizing the `squared discrepancies` between observed data, on the one hand, and their expected values on the other](https://stat.ethz.ch/~geer/bsa199_o.pdf)
The least squares estimator, denoted by $\hat{\beta}$, is that value of b that minimizes
$$\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2.\tag{least squares method}$$
In another word,
$$\hat{\beta}=\arg\min_{\theta\in\Theta}\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2=\arg\min_{\theta\in\Theta}\frac{1}{N}\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2\\=\arg\max_{\theta\in\Theta}-\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2=\arg\max_{\theta\in\Theta}\sum_{i=1}^{N}-(y_i-m(x_i;\theta))^2\\=\arg\max_{\theta\in\Theta}\prod_{i=1}^{N}\exp(-(\underbrace{y_i-m(x_i;\theta)}_{\varepsilon_i})^2)\\=\arg\max_{\theta\in\Theta}\prod_{i=1}^{N}\exp(-(\varepsilon_i)^2).$$
In some sense, least squares estimation is equivalent to maximum likelihood estimation with respect to the residual $\varepsilon_i$.
The basic model is the linear model, i.e., $m(x_i)=w\cdot x_i - b=\left<w\,\, b, x_i \,\,1\right>$.
Without generality, we set $m(x)=\left<\theta, x\right>$.
The `ordinary least squares` estimate is to minimize the $\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2$, i.e.,
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2=\arg\min_{\theta}\|y-X\theta\|_2^2\tag{ordianry least squares}$$
where $y=(y_1,\cdots, y_N)^T$ and $X=(x_1,\cdots,x_N)^T$.
According to the [first-order optimality conditions], we can obtain that
$$X^T(y-X\hat{\theta})=0\iff X^T X\hat{\theta}= X^T y$$
so we can obtain $\hat{\theta}=( X^T X)^{-1}X^Ty$ when $X^T X$ is invertible.
[Ordinary least squares method](https://www.albert.io/blog/key-assumptions-of-ols-econometrics-review/) holds some following assumptions:
1. The regression model is linear in the coefficients and the error term.
2. The error term has a population mean of zero.
3. All independent variables are uncorrelated with the error term.
4. Observations of the error term are uncorrelated with each other.
5. The error term has a constant variance (no heteroscedasticity)
6. No independent variable is a perfect linear function of other explanatory variables.
7. The error term is normally distributed (optional).
- https://www.itl.nist.gov/div898/handbook/pmd/section2/pmd21.htm
- https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/
- https://www.econometrics-with-r.org/4-4-tlsa.html
- [Least squares and maximum likelihood by M.R.Osborne](https://maths-people.anu.edu.au/~mike/lsnml.pdf)
- [Least squares and maximum likelihood estimation](https://bookdown.org/egarpor/PM-UC3M/app-ext-mle.html)
From the computtaional perspective, we add some regularization term to deal with ill-posed problem.
For example, [Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf) by
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2+\lambda\sum_{i=1}^p\|\theta_i\|_2^2=\arg\min_{\theta}\|y-X\theta\|_2^2+\lambda\|\theta\|_2^2\tag{Ridge Regression}.$$
According to the [first-order optimality conditions](https://web.stanford.edu/class/msande312/restricted/OPTconditions.pdf), we can obtain that
$$-X^T(y-X\hat{\theta})+\lambda\hat\theta=0\iff (X^T X+\lambda I)\hat{\theta}= X^T y$$
so that $\hat\theta=(X^T X+\lambda I)^{-1}X^T y$.
Note that $\theta=\frac{\partial m(x))}{\partial x}=\frac{\partial }{\partial x} x^T\theta$ so we can apply the regularization technique to nonlinear model.
- http://statweb.stanford.edu/~owen/courses/305-1314/Rudyregularization.pdf
- http://www.few.vu.nl/~wvanwie/Courses/HighdimensionalDataAnalysis/WNvanWieringen_HDDA_Lecture234_RidgeRegression_20182019.pdf
- https://en.wikipedia.org/wiki/Regularized_least_squares
[first-order optimality conditions]: https://web.stanford.edu/class/msande312/restricted/OPTconditions.pdf
----
The method of `weighted least squares` can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity).
Simply, it is not necessary to think all the errors(residuals) are equally weighted:
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N w_i^2(y_i-\left<\theta, x_i\right>)^2=\arg\min_{\theta}\|W(y-X\theta)\|_2^2\tag{weigted least squares}.$$
According to [first-order optimality conditions], we can obtain that $(WX)^T(Wy-WX\hat{\theta})=0\iff (WX)^T WX\hat{\theta}= (WX)^T y$ and
$$\hat\theta=[(WX)^T WX]^{-1}(WX)^T Wy$$
where $W=\operatorname{diag}(w_1,\cdots,w_n)$.
<img src="https://online.stat.psu.edu/onlinecourses/sites/stat508/files/lesson04/ridge_regression_geomteric.png" width="40%"/>
- [13.1 - Weighted Least Squares](https://online.stat.psu.edu/stat501/lesson/13/13.1)
- https://online.stat.psu.edu/stat501/lesson/13/13.3
- [5.1 - Ridge Regression](https://online.stat.psu.edu/stat508/lesson/5/5.1)
- https://www.itl.nist.gov/div898/handbook/pmd/section1/pmd143.htm
- https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/24/lecture-24--25.pdf
----
Regularization is a popular approach to reducing a model’s predisposition to overfit on the training data
[and thus hopefully increasing the generalization ability of the model.](https://rohanvarma.me/Regularization/)
The Lasso is a `shrinkage and selection` method for linear regression. It minimizes the usual sum of squared errors,
with a bound on the sum of the absolute values of the coefficients.
[It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods.](http://statweb.stanford.edu/~tibs/lasso.html)
The `lasso` estimate is defined as
$$\hat{\theta}=\arg\min_{\theta\in\mathbb{R}^p}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2+\lambda\sum_{i=1}^{p}|\theta_i|=\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_1$$
The optimal codnition is
$$-X^T(y-x\theta)+\lambda S(\theta)=0$$
where $S(\theta)$ is the subgradeint of $\|\theta\|_1$.
Lasso penalty corresponds to Double-exponential prior:
$\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_1=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\underbrace{\exp(-\lambda \|\theta\|_1)}_{\text{Double-exponential prior}}$.
Ridge Regression penalty corresponds to Gaussian prior:
$\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_2^2=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\underbrace{\exp(-\lambda \|\theta\|_2^2)}_{\text{Gaussian prior}}$.
Elastic Net penalty corresponds to a new prior:
$\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\alpha\|\theta\|_2^2+(1-\alpha)\|\theta\|_1=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\exp(-\alpha\|\theta\|_2^2-(1-\alpha)\|\theta\|_1)$
The lasso does `variable selection and shrinkage`,
whereas ridge regression, in contrast, `only shrinks`.
Different priors have different effects.
- https://online.stat.psu.edu/stat508/lesson/5
- https://www.cvxpy.org/examples/machine_learning/lasso_regression.html
- http://statweb.stanford.edu/~tibs/lasso.html
- https://trevorhastie.github.io/
- [The Bayesian Lasso slide](https://www2.stat.duke.edu/courses/Fall17/sta521/knitr/Lec-13-Bayes-VarSel/bayes-varsel.pdf)
- [The Bayesian Lasso](https://people.eecs.berkeley.edu/~jordan/courses/260-spring09/other-readings/park-casella.pdf)
- [Lecture 3: More on regularization. Bayesian vs maximum likelihood learning](https://www.cs.mcgill.ca/~dprecup/courses/ML/Lectures/ml-lecture03.pdf)
- [Bayesian Interpretations of Regularization](https://www.mit.edu/~9.520/spring09/Classes/class15-bayes.pdf)
- https://rohanvarma.me/Regularization/
- [The Learning Problem and Regularization](https://www.mit.edu/~9.520/spring11/slides/class02.pdf)
- [New and Evolving Roles of Shrinkage in Large-Scale Prediction and Inference](https://www.birs.ca/events/2019/5-day-workshops/19w5188/schedule#)
```{r regularization, echo=FALSE, fig.height=10, fig.width=6}
library(MASS) # Package needed to generate correlated precictors
library(glmnet) # Package to fit ridge/lasso/elastic net models
library(Matrix)
# Generate data
set.seed(19875) # Set seed for reproducibility
n <- 1000 # Number of observations
p <- 5000 # Number of predictors included in model
real_p <- 15 # Number of true predictors
x <- matrix(rnorm(n*p), nrow=n, ncol=p)
y <- apply(x[,1:real_p], 1, sum) + rnorm(n)
# Split data into train (2/3) and test (1/3) sets
train_rows <- sample(1:n, .66*n)
x.train <- x[train_rows, ]
x.test <- x[-train_rows, ]
y.train <- y[train_rows]
y.test <- y[-train_rows]
fit.lasso <- glmnet(x.train, y.train, family="gaussian", alpha=1)
fit.ridge <- glmnet(x.train, y.train, family="gaussian", alpha=0)
fit.elnet <- glmnet(x.train, y.train, family="gaussian", alpha=.5)
# 10-fold Cross validation for each alpha = 0, 0.1, ... , 0.9, 1.0
# (For plots on Right)
for (i in 0:10) {
assign(paste("fit", i, sep=""), cv.glmnet(x.train, y.train, type.measure="mse",
alpha=i/10,family="gaussian"))
}
# Plot solution paths:
par(mfrow=c(3,2))
# For plotting options, type '?plot.glmnet' in R console
plot(fit.lasso, xvar="lambda")
plot(fit10, main="LASSO")
plot(fit.ridge, xvar="lambda")
plot(fit0, main="Ridge")
plot(fit.elnet, xvar="lambda")
plot(fit5, main="Elastic Net")
#https://www4.stat.ncsu.edu/~post/josh/LASSO_Ridge_Elastic_Net_-_Examples.html
```
When the number of predictors is large compared to the number of observations,
$X$ is likely to be singular and the regression approach is no longer feasible.
`Partial Least Squares`(pls) finds components from $X$ that are also relevant for $Y$.
Specifically, pls regression searches for a set of components (called latent vectors)
that performs a simultaneous decomposition of $X$ and $Y$ with the
constraint that these components explain as much as possible of the covariance between $X$ and $Y$.
- https://stats.idre.ucla.edu/wp-content/uploads/2016/02/pls.pdf
- https://personal.utdallas.edu/~herve/Abdi-PLS-pretty.pdf
- https://en.wikipedia.org/wiki/Partial_least_squares_regression
- https://www.camo.com/resources/pls-regression.html
- http://virtuallaboratory.org/lab/pls/
### Bayesian Regularization
`Bayesian Regularization` apply the Bayesian statistics to the regularization techniques.
The inverse Bayesan formula tells us that we can imply the piror from posterior.
The `sparsity inducing priors` used in the Bayesian approach can broadly be classified as
* A single continuous shrinkage prior such as the Double Exponential prior and the Horseshoe prior;
* Two-group spike-and-slab prior, such as the spike-and-slab Normal prior and spike-and-slab Lasso prior.
The `horseshoe prior` is a member of the family of multivariate scale mixtures of normals,
and is therefore closely related to widely used approaches for sparse Bayesian learning, including, among others, Laplacian priors (e.g. the LASSO) and Student-t priors (e.g. the relevance vector machine).
The probability density function of Cauchy distribution is defined as
$$f(y;\mu,\sigma) = \frac{1}{\pi \sigma}\,\frac{1}{1 + (y-\mu)^2/\sigma^2}.$$
The Half-Cauchy distribution is supported on the set of all real numbers that are greater than or equal to $\mu$, that is on $[\mu, \infty)$ with the following pdf:
\[\begin{split} \begin{align}
f(y;\mu, \sigma) = \left\{\begin{array}{cll} \frac{2}{\pi \sigma}\,\frac{1}{1 + (y-\mu)^2/\sigma^2} & & y \ge \mu \\[1em] 0 & & \text{otherwise}. \end{array}\right.
\end{align}\end{split}\]
By a spike and slab model we mean a Bayesian model specified by the following prior hierarchy:
$$\tag{The spike and slab model}
(Y_i\mid x_i,\theta,\sigma^2)\sim \mathcal{N}(\theta^T x_i, \sigma^2) \\
(\theta\mid\Gamma)\sim \mathcal{N}(0, \Gamma)\\
\Gamma\sim \pi(d\gamma)\\
\sigma^2\sim \mu(d\sigma^2).
$$
- https://github.com/AsaCooperStickland/Spike_And_Slab
- http://proceedings.mlr.press/v5/carvalho09a.html
- https://projecteuclid.org/euclid.aos/1117114335
During the modeling stage a critical design decision concerns choosing between model complexity and model expressiveness.
Although one wishes to use a model that can explain as closely as possible the particular application’s generative process,
one has to restrict oneself to models that are computationally tractable.
[This leads to models that are generally an oversimplification of the inherent process, as seen in the previous chapter.]
These models, due to their simplifying assumptions, fail to learn the parameters settings that produce the desired predictions.
One solution to this problem is to add more complexity to the model to better reflect the underlying process.
The posterior regularization framework incorporates side-information into parameter estimation in the form of linear constraints on posterior expectations,
which allows tractable learning and inference even when the constraints would be intractable to encode directly in the model parameters.
By defining a flexible language for specifying diverse types of problem-specific prior knowledge,
we make the framework applicable to a wide variety of probabilistic models, both generative and discriminative.
- [Bayesian Regularization](http://hedibert.org/wp-content/uploads/2015/12/BayesianRegularization.pdf)
- [Mixtures of g Priors for Bayesian Variable Selection](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/readings/liang-etal.pdf)
- [29 : Posterior Regularization](https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture29.pdf)
- [New and Evolving Roles of Shrinkage in Large-Scale Prediction and Inference](https://www.birs.ca/workshops/2019/19w5188/report19w5188.pdf)
- [Shrinkage priors for Bayesian prediction](https://projecteuclid.org/euclid.aos/1151418241)
- [Posterior Regularization for Structured Latent Variable Models](http://jmlr.csail.mit.edu/papers/volume11/ganchev10a/ganchev10a.pdf)
```{r LaplacesDemon, echo=FALSE, fig.height=4, fig.width=6}
library(LaplacesDemon)
x <- rnorm(100)
lambda <- rhalfcauchy(100, 5)
tau <- 5
x <- dhs(x, lambda, tau, log=TRUE)
x <- rhs(100, lambda=lambda, tau=tau)
plot(density(x))
#https://www.rdocumentation.org/packages/LaplacesDemon/versions/16.1.1/topics/dist.Horseshoe
```
----
Last but not least, we introduce more on Bayesian statistics.
- https://bayesian.org/isba2020-home/
- https://www.stats.ox.ac.uk/bnp12/
- [Objections to Bayesian statistics](http://www.stat.columbia.edu/~gelman/research/published/badbayesmain.pdf)
- [The Limitation of Bayesianism](https://www.cc.gatech.edu/~isbell/classes/reading/papers/wang.bayesianism.pdf)
- [Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?](http://gpss.cc/bark08/slides/1%20zoubin.pdf)
- http://www.math.leidenuniv.nl/~avdvaart/talks/ICM.pdf
- http://www2.stat.duke.edu/~rcs46/lectures_2015/14-bayes1/14-bayes3.pdf
- http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec8.pdf
### Robust regression
[Robust regression can be used in any situation where OLS regression can be applied. It generally gives better accuracies over OLS because it uses a weighting mechanism to weigh down the influential observations. It is particularly resourceful when there are no compelling reasons to exclude outliers in your data.](http://r-statistics.co/Robust-Regression-With-R.html)
- https://stats.idre.ucla.edu/r/dae/robust-regression/
- [13.3 - Robust Regression Methods](https://online.stat.psu.edu/stat501/lesson/13/13.3)
- http://r-statistics.co/Robust-Regression-With-R.html
The least absolute deviation (LAD) method, which is also known as the $L_1$ method,
provides a useful and plausible alternative to least squares (LS) method.
The least absolute deviation (LAD) method is to minimize
$$\sum_{i=1}^{N}|\underbrace{y_i-m(x_i;\theta)}_{\varepsilon_i}|\tag{least absolute deviation}$$
which in turn minimizes the absolute value of the residuals.
In another word, $\varepsilon_i$ is generated from `Laplacian distribution`.
- [Analysis of least absolute deviation](https://www.math.ust.hk/~makchen/papers/LAD.pdf)
This achieves robustness, but is hard to work with in practice
because the absolute value function is `not differentiable`.
[Peter Huber](https://statmodeling.stat.columbia.edu/2011/05/01/peter_hubers_th/) defines a loss function
$$\rho(r_i)=\begin{cases}\varepsilon_i^2,\text{if } |\varepsilon_i|<c,\\|2\varepsilon_i|c -c,\text{otherwise}.\end{cases}\tag{Huber}$$
`Log Hyperbolic Cosine Loss` is defined as
$$\rho(\varepsilon_i)=\log\frac{\exp(\varepsilon_i)+\exp(-\varepsilon_i)}{2}\tag{log-cosh}=\log\operatorname{cosh}(\varepsilon_i).$$
Log cosh is a smooth approximation to huber loss.
$$\lim_{\varepsilon_i\to 0}\frac{\log\operatorname{cosh}(\varepsilon_i)}{\varepsilon_i^2}=1=\lim_{\varepsilon_i\to \infty}\frac{\log\operatorname{cosh}(\varepsilon_i)}{|\varepsilon_i|}.$$
- [Log Hyperbolic Cosine Loss Improves Variational Auto-Encoder](https://openreview.net/forum?id=rkglvsC9Ym)
If an observation has a response value that is very different from the predicted value based on a model,
then that observation is called an outlier.
On the other hand, if an observation has a particularly unusual combination of predictor values (e.g., one predictor has a very different value for that observation compared with all the other data observations),
then that observation is said to have high leverage.
Thus, there is a distinction between outliers and high leverage observations, and each can impact our regression analyses differently.
It is also possible for an observation to be both an outlier and have high leverage.
Thus, it is important to know how to detect outliers and high leverage data points.
- https://online.stat.psu.edu/stat501/lesson/11
### Penalized mle
The penalty term in Bayesian elm requires that $P(\theta)$ is a probability density (distribution) function, i.e., $\int_{x}P(x)\mathrm dx=1$.
However, it is not easy to find appropriate prior.
Penalized least squares are to minimize the
\begin{equation}\label{ Penalized mle}\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\sum_{i=1}^{p}\lambda_ip_i(|\theta_i|)\end{equation}
where the penalty function $p_i$ are not necessary for all $i$.
[One distinguishing feature of the nonconcave penalized likelihood approach is that it can simultaneously select variables and estimate coefficients of variables.](http://www.math.hkbu.edu.hk/~hpeng/Paper/modsel.pdf)
- [Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties](https://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf)
- [Nonconcave penalized likelihood with a diverging number of parameters](https://arxiv.org/abs/math/0406466)
- https://www.sis.uta.fi/tilasto/mttapu/runze.pdf
- https://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_8.html
- https://www.ccg.unam.mx/~vinuesa/tlem/pdfs/Sanderson_PL_2002.pdf
- http://people.vcu.edu/~dbandyop/BIOS625/Penalized.pdf
### Regression Discontinuity Design
We can deal with the categorical independent variable.
`Indicator variable, instrumental variable`
- [Endogenous Regressors and Instrumental Variables](https://eml.berkeley.edu/~powell/e240b_sp10/ivnotes.pdf)
- https://www.mailman.columbia.edu/research/population-health-methods/instrumental-variables
- [8.2 - The Basics of Indicator Variables](https://online.stat.psu.edu/stat462/node/161/)
- https://idss.mit.edu/calendar/idss-distinguished-speaker-seminar-rocio-titiunik/
## Classification Loss
Simply speaking, classification is to
$$\mathbb{R}^d\mapsto \mathbb{D}$$
where $\mathbb{D}$ is the finite categorical set.
Classification is about categorical variable.
[A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables.](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/)
[If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/)
There is no arithmetic and magnitude concept in the world of categorical variables.
Purely categorical variables are not comparable.
In computer, they are always in the format of string or char.
Like other common objects in computer, we can test that if two categorical variables share the same literal value.
- https://online.stat.psu.edu/stat504/node/6/
- [How good is your classifier? Revisiting the role of evaluation metrics in machine learning](https://talks.cam.ac.uk/talk/index/128293)
----
[Category Encoders](http://contrib.scikit-learn.org/category_encoders/index.html) is a set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.
- http://contrib.scikit-learn.org/category_encoders/
- https://brendanhasz.github.io/2019/03/04/target-encoding.html
- https://www.kaggle.com/waydeherman/tutorial-categorical-encoding
- https://kiwidamien.github.io/encoding-categorical-variables.html
- https://kiwidamien.github.io/james-stein-encoder.html
- https://kiwidamien.github.io/are-you-getting-burned-by-one-hot-encoding.html
----
- https://rohanvarma.me/Loss-Functions/
- https://people.eecs.berkeley.edu/~wainwrig/stat241b/lec11.pdf
- https://people.cs.umass.edu/~akshay/courses/cs690m/files/lec12.pdf
- https://www.cse.huji.ac.il/~daphna/theses/Alon_Cohen_2014.pdf
- [About loss functions, regularization and joint losses : multinomial logistic, cross entropy, square errors, euclidian, hinge, Crammer and Singer, one versus all, squared hinge, absolute value, infogain, L1 / L2 - Frobenius / L2,1 norms, connectionist temporal classification loss](http://christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html)
### 0-1 loss and its surrogates
It is natural to apply the 0-1 loss function in classification problems according to lack of arithmetic operation.
The 0-1 loss is defined as following
\begin{equation}\label{0-1 loss}
L(\hat{y}, y)=\begin{cases}1,&\text{if $\hat{y}\not= y$}\\
0,&\text{if $\hat{y}= y$}\end{cases}=\mathbb{I}(\hat{y}\not= y).
\end{equation}
where $\mathbb{I}(\cdot)$ is the indicator function and $\hat y, y$ are categorical variables.
And [this loss will induce the discrete topology space.](https://proofwiki.org/wiki/Definition:Discrete_Topology)
[This metric "shatters" the points, isolating each one within its own unit ball.](https://math.stackexchange.com/questions/2614268/why-is-a-discrete-topology-called-a-discrete-topology)
- [Why is a discrete topology called a discrete topology?](https://math.stackexchange.com/questions/2614268/why-is-a-discrete-topology-called-a-discrete-topology)
- https://proofwiki.org/wiki/Category:Discrete_Topology
- https://stattrek.com/multiple-regression/dummy-variables.aspx
---
For example, it is uased in linear regression of an indicator matrix.
- https://online.stat.psu.edu/stat508/lesson/8/8.5
- http://users.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/classification.pdf
### Binary classification
An n-dimensional pattern (object) $\mathbb x$ has $n$ coordinates, $x=(x_1, x_2, \cdots, x_n)$,
where each $x_i$ is a real number, $x_i\in\mathbb{R}$ for $i = 1, 2,\cdots, n$.
Each pattern $\mathbb x_j$ belongs to a class $y_j\in\{-1, +1\}$.
Consider a training set $T$ of $m$ patterns together with their classes, $T=\{(\mathbb{x}_1, y_1), (\mathbb{x}_2, y_2), \cdots, (\mathbb{x}_m, y_m)\}$.
Consider a dot product space $S$, in which the patterns $x$ are embedded, $x_1, x_2,\cdots, x_m\in S$.
Any hyperplane in the space $S$ can be written as
$$\{\mathbb x\in S\mid \mathbb w\cdot\mathbb x+b=0\}, \mathbb w\in S, b\in\mathbb{R}.$$
The dot product $w\cdot\mathbb x$ is defined by:
$$\mathbb w\cdot\mathbb x=\sum_{i}^n w_ix_i$$
A training set of patterns is `linearly separable` if there exists at least one linear classifier defined by the pair $(\mathbb w, b)$
which correctly classifies all training patterns.
This linear classifier is represented by the hyperplane $H$ $(\mathbb{w\cdot x}+b=0)$ and
defines a region for class $+1$ patterns $(\mathbb{w\cdot x}+b>0)$ and another region for class $-1$ patterns $(\mathbb{w\cdot x}+b<0)$.
The linear classifiers are in the following form
\begin{equation}\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)
=\begin{cases}1, &\text{if $\mathbb{w\cdot x}+b> 0$;}\\
-1, &\text{if $\mathbb{w\cdot x}+b< 0$.}\end{cases}
\end{equation}
In this case, we can reexpress the loss \ref{0-1 loss} as following
\begin{equation}\tag{binary loss}\label{binary loss}
L(\hat{y}, y)=\begin{cases}1,&\text{if $\hat{y}\not= y$}\\
0,&\text{if $\hat{y}= y$}\end{cases}
=\frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x}+b)y).
\end{equation}
The optimization procedure will minimize the loss over the training data set $T$
$$\sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\mathbb{y}_i)$$
And if the positive $\alpha>0$ is small enough, we could obtain
$$(w^*, b^*)=
\arg\min_{\mathbb{w}, b} \sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\mathbb{y}_i)\\=
\arg\min_{\mathbb{w}, b} \sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\frac{1}{\alpha}[\mathbb{w\cdot x_i}+b])\mathbb{y}_i)
$$
where $\alpha\leq \min\{\mathbb{w^*\cdot x_i}+b^*\mid i=1,\cdots, m\}$.
The above objective is `non-differentiable and non-convex`.
If the training set $T$ is `linearly separable`, we can transfer the linear separability condition into constraints
\begin{equation}\label{SVM}\tag{SVM}
\begin{split}
(w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2 \\
&\text{subject to } \underbrace{(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 0}_{\text{linear separability}} \,\, i=1,\cdots,m.
\end{split}
\end{equation}
We prefer `the maximal margin hyperplane`
\begin{equation}\label{hard margins}\tag{hard margins}
\begin{split}
(w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2 \\
&\text{subject to } \underbrace{(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1}_{\text{linear separability}} \,\, i=1,\cdots,m.
\end{split}
\end{equation}
where the `support vectors` are defined as $\{(\mathbb{x}_i , y_i)\mid \mathbb{w\cdot x_i}+b)=\mathbb{y}_i\}$.
It is the `quadratic optimization problem with linear constraints`.
If the training set $T$ is not `linearly separable`, we prefer `the maximal soft margin hyperplane`
\begin{equation}\tag{soft margin}\label{soft margin}
\begin{split}
(w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2+\lambda\sum_{i=1}^m\varepsilon_i \\
&\text{subject to } (\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1- \varepsilon_i \\
& \varepsilon_i \geq 0\,\, i=1,\cdots,m.
\end{split}
\end{equation}
Here $\lambda > 0$ is the regularization constant.
A surrogate loss is a loss that is used as a proxy for the 0-1 loss, and usually has better computational properties such as [Hinge loss](https://en.wikipedia.org/wiki/Hinge_loss), [Ramp loss](https://www.jstor.org/stable/23013182).
- http://www.svms.org/history.html
- https://cs231n.github.io/linear-classify/
- https://www.jianshu.com/p/4a40f90f0d98
- https://www.cs.otago.ac.nz/research/student-publications/Haitao_Xu_Tbldm_for_ajcai2018.pdf
---
If $\varepsilon_i> 1$, we can imply that $\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\not=\mathbb{y}_i$,
i.e., the prediction is wrong with respect to the sample $(\mathbb{x}_i , y_i)$.
And $(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1- \varepsilon_i\implies \varepsilon_i\geq 1- (\mathbb{w\cdot x_i}+b)\mathbb{y}_i$, so the prediction is right if $\varepsilon_i\leq 0$.
The [Hinge loss](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf) is defined as
$$\operatorname{H}(y_i,\hat{y}_i)=\max\{0, 1- (\underbrace{\mathbb{w\cdot x_i}+b}_{\hat y_i})y_i\}\tag{Hinge loss}$$
so `hinge loss` is $0$ when the prediction is right.
The `ramp loss` is defined as following
$$\operatorname{R}(y_i,\hat{y}_i)=\min\{1,\max\{0, 1- (\underbrace{\mathbb{w\cdot x_i}+b}_{\hat y_i})y_i\}\}\tag{Ramp loss}$$
which is closer to the 0-1 loss \eqref{binary loss} than the hinge loss.
Note that $\lim_{t\to 0}\frac{1- t}{\exp(-t)}=1$, `Exponential Loss` is defined that
$$E(y_i,\hat y_i)=\exp(-(\mathbb{w\cdot x_i}+b)y_i).\tag{Exponential loss}$$
Another principle to solve the non-limear classification is to take some feature transformation and apply a linear classifer.
Kernel tricks are one of the most popular methods.
kernel tricks maps the data points into higher dimensional space.
- [SVM for Pattern recognition](http://support-vector-machines.org/SVM_pr.html)
- http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node19.html
- http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
- https://en.wikipedia.org/wiki/Support-vector_machine
- https://en.wikipedia.org/wiki/Hinge_loss
- [Nonconvex online support vector machines](https://www.ncbi.nlm.nih.gov/pubmed/20513924)
- [The Support Vector Machine and Mixed Integer Linear Programming: Ramp Loss SVM with L1-Norm Regularization](https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=1007&context=ssor_pubs)
Without modification of loss functions, we can extend the binary claasifers to mutlicall classifers.
The basic belief is that a single binary classifer must learn a specific pattern
so we can use multiply binary classifer to learn multiply patterns.
- https://en.wikipedia.org/wiki/Multiclass_classification
- [Multiclass Learning with ECOC](http://blog.pluskid.org/?p=870)
``` {r svm, echo=FALSE, fig.height=4, fig.width=6}
data(cats, package = "MASS")
library(e1071)
m <- svm(Sex~., data = cats)
plot(m, cats)
## more than two variables: fix 2 dimensions
data(iris)
m2 <- svm(Species~., data = iris)
plot(m2, iris, Petal.Width ~ Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
## plot with custom symbols and colors
plot(m, cats, svSymbol = 1, dataSymbol = 2, symbolPalette = rainbow(4),
color.palette = terrain.colors)
```
- https://arxiv.org/abs/1808.02435
- [Mixed-Integer Support Vector Machine](http://opt.kyb.tuebingen.mpg.de/papers/OPT2009-Guan.pdf)
- [OPTIMIZING $\Psi$-LEARNING VIA MIXED INTEGER PROGRAMMING](https://stat-or.unc.edu/files/2016/04/05-20.pdf)
### Adaptive margin learning
The margin is defined in support vector machines.
The Multiclass SVM loss for the i-th example is then formalized as follows:
$$\sum_{i}\sum_{y\not=y_i}[\Delta+(W x_i+b)^Ty-(W x_i+b)^Ty_i]_{+}$$
where $y_i, y$ are one-hot vectors and $\Delta>0$ is constant.
<img src="https://cs231n.github.io/assets/margin.jpg" width="80%" />
Why the margine $\Delta$ is constant for all categories?
To make the size of the margin at each training point a controlling variable we propose the following learning algorithm [AM-SVM](http://pdfs.semanticscholar.org/a3db/a5e48d2e57941efe7b4cab7e299fd2cd9de7.pdf):
$$\min \xi_i\\ \text{s.t. }\quad y_if(x_i)\geq 1-\xi_i+\alpha_i k(x_i, x_i)\\
\xi_i\geq 0, \alpha_i \geq 0$$
- http://yining-wang.com/mdactive-slides.pdf
- [Adaptive Margin Support Vector Machines for Classification](http://pdfs.semanticscholar.org/a3db/a5e48d2e57941efe7b4cab7e299fd2cd9de7.pdf)
- [Adaptive Large Margin Training for Multilabel Classification](https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3455/3878)
- [AdaptiveFace: Adaptive Margin and Sampling for Face Recognition](http://www.cbsr.ia.ac.cn/users/xiangyuzhu/papers/2019adaptiveface.pdf)
- [Boosting Few-Shot Learning With Adaptive Margin Loss](https://arxiv.org/abs/2005.13826)
- [Margin-adaptive model selection in statistical learning](https://projecteuclid.org/download/pdfview_1/euclid.bj/1302009243)
- https://github.com/cvqluu/Angular-Penalty-Softmax-Losses-Pytorch
- [Deep Ranking Model by Large Adaptive Margin Learning for Person Re-identification](https://arxiv.org/abs/1707.00409)
- [Boosting Few-Shot Visual Learning with Self-Supervision](https://github.com/valeoai/BF3S)
- https://abursuc.github.io/
- http://imagine.enpc.fr/~komodakn/
- https://webia.lip6.fr/~cord/
- https://www.weiranhuang.com/
### Optimal margine distribution machine learning
Optimal margin Distribution Machine (ODM) can achieve a better generalization performance by optimizing the margin distribution explicitly.
- [Optimal Margin Distribution Machine](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkde19odm.pdf)
- [Semi-Supervised Optimal Margin Distribution Machines](https://www.ijcai.org/Proceedings/2018/0431.pdf)
- [Multi-Label Optimal Margin Distribution Machine](http://www.acml-conf.org/2019/conference/accepted-papers/175/)
- [Large Margin Distribution Learning with Cost Interval and Unlabeled Data](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkde16cisldm.pdf)
- [Optimal Margin Distribution Clustering](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16895)
- https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/publication_toc.htm#Cost-Sensitive%20and%20Class-Imbalance%20Learning
### Logistic regression and beyond
We could explain the linear classifier in the probabilistic interpretation
$$\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})]$$
so that $P(Y=1\mid x)=\frac{\exp(\mathbb{w\cdot x}+b)}{1+\exp(\mathbb{w\cdot x}+b)}=\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}$.
And $P(Y\not=1\mid x)=1-\frac{\exp(\mathbb{w\cdot x}+b)}{1+\exp(\mathbb{w\cdot x}+b)}=\frac{1}{1+\exp(\mathbb{w\cdot x}+b)}$.
In our setting $y_i\{+1,-1\}$, we can get that
$$P(y\mid x)=\frac{1}{1+\exp(-y[\mathbb{w\cdot x}+b])}$$
where $y\{+1,-1\}$.
We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)=\arg\min_{\mathbb w, b}\ln(\frac{1}{P(y_i\mid x_i)})\\=\arg\min_{\mathbb w, b}\ln(1+\exp(-y_i[\mathbb{w\cdot x_i}+b])).\tag{negative likellihood}$$
Note that $\lim_{z\to\infty}\frac{\ln(1+\exp(z))}{1+z}=1$
so the negative log-likelihood function is the approximation of the hinge loss.
In our context, we always assume that $y\in\{+1,-1\}$ while another option is
$y\{0,1\}$.
If $y\{0,1\}$, we still suppose that
$$P(Y=1\mid x)=\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}$$
and
$$P(Y=0\mid x)-P(Y\not=1\mid x)=\frac{1}{1+\exp([\mathbb{w\cdot x}+b])}$$
so we can express the probaility in
$$P(Y=y\mid x)=[\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}]^{y} [\frac{1}{1+\exp([\mathbb{w\cdot x}+b])}]^{1-y}$$
where $y=0$ or $y=1$.
We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)
\\=\arg\min_{\mathbb w, b}\sum_{i=1}^m-\ln([\frac{1}{1+\exp(-[\mathbb{w\cdot x_i}+b])}]^{y_i} [\frac{1}{1+\exp([\mathbb{w\cdot x_i}+b])}]^{1-y_i})\\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m y_i\ln(1+\exp(-[\mathbb{w\cdot x_i}+b]))+(1-y_i)\ln([1+\exp([\mathbb{w\cdot x_i}+b])]) \\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m
y_i\ln(\frac{1+\exp(-[\mathbb{w\cdot x_i}+b])}{1+\exp([\mathbb{w\cdot x_i}+b])})
+\ln([1+\exp([\mathbb{w\cdot x_i}+b])]) \\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m y_i[\mathbb{w\cdot x_i}+b]+\ln([1+\exp([\mathbb{w\cdot x_i}+b])])
\tag{negative likellihood}.$$
- [Which loss function is correct for logistic regression?](https://stats.stackexchange.com/questions/250937/which-loss-function-is-correct-for-logistic-regression)
- [Why there are two different logistic loss formulation / notations?](https://stats.stackexchange.com/questions/229645/why-there-are-two-different-logistic-loss-formulation-notations)
- [Logistic Regression](https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf)
- [Logistic classification model - Maximum likelihood estimation](https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-likelihood)
- [Robustness and Regularization of Support Vector Machines](http://www.jmlr.org/papers/volume10/xu09b/xu09b.pdf)
- [The Entire Regularization Path for the Support Vector Machine](https://web.stanford.edu/~hastie/Papers/svmpath_jmlr.pdf)
Note that
$$\hat{y}=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})]=\begin{cases}+1, & \text{if } P(Y=1\mid x)> P(Y\not=1\mid x);\\ -1, &\text{if } P(Y=1\mid x)< P(Y\not=1\mid x).\end{cases}$$
If we set
$$P(Y=1\mid x)=\frac{1}{2}+\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$$
so $P(Y\not=1\mid x)=\frac{1}{2}-\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$ and the following condition also holds
$$\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})].$$
In our setting $y_i\{+1,-1\}$, we can get that
$$P(y\mid x)=\frac{1}{2}+y\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$$
where $y\{+1,-1\}$.
We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)=\arg\min_{\mathbb w, b}\sum_{i=1}^m -\ln [\frac{1}{2}+y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1-\exp(-\|\mathbb{w}x_i+b\|_1))]$$
where $\mathbb{w}x_i+b\not= 0$.
Note that $\lim_{x\to 0}\frac{-\ln(1+x)}{x}=-1$ so
$$-\ln [\frac{1}{2}+{y}_{i}\operatorname{sgn}(\mathbb{w}{x}_{i}+b)(1-\exp(-{\|\mathbb{w}{x}_{i}+b\|}_{1}))]\\ \approx \frac{1}{2}- y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1-\exp(-\|\mathbb{w}x_i+b\|_1))\\ =\frac{1}{2}[1-y_i\operatorname{sgn}(\mathbb{w}x_i+b)\frac{1-\exp(-\|\mathbb{w}x_i+b\|_1)}{2}]\\=\frac{1}{2}[1-y_i\operatorname{sgn}(\mathbb{w}x_i+b)]+\frac{y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1+\exp(-\|\mathbb{w}x_i+b\|_1)}{4}).$$
And we can induce more loss functions based on this approach.
A more general result states that Bayes consistent loss functions can be generated using the following formulation:
$$\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]$$
where $f(\eta ),(0\leq \eta \leq 1)$ is any convertible function such that $f^{-1}(-v)=1-f^{-1}(v)$
and $C(\eta )$ is any differentiable strictly concave function such that $C(\eta )=C(1-\eta )$.
See [Loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification).
- https://en.wikipedia.org/wiki/Loss_functions_for_classification
- http://jmlr.org/papers/volume16/masnadi15a/masnadi15a.pdf
- [on the design of loss functions for classification theory robustness to outliers and savageboost](https://papers.nips.cc/paper/3591-on-the-design-of-loss-functions-for-classification-theory-robustness-to-outliers-and-savageboost.pdf)
---
If the training set $T$ is `linearly separable`, we can train a decision tree so that
it defines a region for class $+1$ patterns $\mathbb{w\cdot x}+b>0$
and another region for class $-1$ patterns $\mathbb{w\cdot x}+b<0$.
In the analytic form of decision tree, it is defined as following
$$\mathbb{I}(\mathbb{w\cdot x}+b>0)-(1-\mathbb{I}(\mathbb{w\cdot x}+b>0))\tag{Decision tree}$$
where $\mathbb{I}$ is the indicator function.
The decision tree is to minimize the Split Criteria at each stage.
Tsallis entropy is defined by
\begin{equation}\label{Tsallis entropy}\tag{Tsallis entropy}
S_q(X) =\frac{1}{1-q}(\sum_{i=1}^n p(x_i)^q-1)
\end{equation}
which converges to Shannon entropy when $q \to 1$.
- https://www.benkuhn.net/tree-imp
- [How does a Decision Tree decide where to split?](http://www.ashukumar27.io/Decision-Trees-splitting/)
- [Selecting Multiway Splits in Decision Trees](https://www.cs.waikato.ac.nz/~ml/publications/1996/Frank-Witten96.pdf)
- [Unifying the Split Criteria of Decision Trees Using Tsallis Entropy](https://arxiv.org/pdf/1511.08136.pdf)
- https://daviddalpiaz.github.io/r4sl/trees.html
- https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
- https://online.stat.psu.edu/stat508/lesson/11/11.2
### Mutil-class classification
The categorical variable is really discrete variables that can take on one of finite possible mutually exclusive states.
We follow the 1-of-K scheme proposed by [Christopher Bishop](https://www.microsoft.com/en-us/research/people/cmbishop/) in [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf).
The categorical variable is represented by a K-dimensional vector $x$
in which one of the elements $x_k$ equals $1$, and all remaining elements equal $0$,
i.e., one-hot vector.
For example, we use the vector $x$ to represent gender (male and female) so $x\in\{(0, 1)^T, (1, 0)^T\}$
where $(0, 1)^T/(1, 0)^T$ corresponds male/female.
If we denote the probability of $x_k = 1$ by the parameter $\mu_k$,
then the distribution of $x$ is given
$$P(x\mid \mu)=\prod_{i=1}^K\mu_i^{x_i}$$
where $\sum_{i=1}^K\mu_i=\sum_{i=1}^Kx_i=1$ and $x_i\in\{0,1\}, u_i>0$ for $i=1,\cdots,K$.
And the expectation of the categorical random variable is computed by
$$\mathbb{E}(x)=\sum_{x}xP(x)=(\mu_1,\cdots,\mu_K)^T=\mu.$$
The marginal distribution for $x_k$ is given by
$$P(x_k)=\sum_{i\not=k}P(x_i)=[\sum_{i\not=k}\mu_i]^{1-x_i}\mu_i^{x_i}=(1-\mu_i)^{1-x_i}\mu_i^{x_i}$$
and the marginal expectation of $x_i$ is $\mu_i$.
- [The Multinomial Distribution](https://dipmat.univpm.it/~demeio/Alabama_PDF/11.%20Bernoulli_Trials/Multinomial.pdf)
Given two distributions $p$ and $q$ over a given variable $X$, the cross entropy is defined as