Metric Learning.Rmd

---
title: "Prediction evaluation metric: loss functions"
author: "Zhang Jinxiong zhangjinxiong@qq.com"
date: "2020/5/3"
output:
  html_document:
    toc: yes
  pdf_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Prediction evaluation metric: loss functions

In general, [learning consists of three components](https://dl.acm.org/doi/pdf/10.1145/2347736.2347755):
$$\text{Learning=Representation + Evaluation + Optimization}.$$


[The prediction evaluation metric should be selected to reflect domain-specific considerations, such as the types of errors that are more costly.](https://www.stat.berkeley.edu/~binyu/ps/papers2020/VDS20-YuKumbier.pdf) 
[In fact, there is an entire area of research devoted to evaluating the quality of probabilistic forecasts through “scoring rules”.](https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf)
Even Zongben Xu propose the [the independence assumption of loss function on dataset](https://dl.acm.org/doi/pdf/10.1145/3397271.3402428), 
we should remind that the loss function design dependens on the `errors`.

- https://rohanvarma.me/Loss-Functions/
- [Probabilistic Setup and Empirical Risk Minimization](https://ttic.uchicago.edu/~tewari/lectures/lecture7.pdf)
- [10: Empirical Risk Minimization](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html)
- [Principles of risk minimization for learning theory](http://papers.nips.cc/paper/506-principles-of-risk-minimization-for-learning-theory.pdf)
- https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

************


In absence of prior knowledge about data we can only use general purpose metrics like Euclidean distance, Cosine similarity or Manhattan distance etc, 
but these metric often fail to capture the correct behavior of data which directly affects the performance of the learning algorithm. 
Solution to this problem is to tune the metric according to the data and the problem, 
manually deriving the metric for high dimensional data which is often difficult to even visualize is not only tedious but is extremely difficult.,
Which leads to put effort on metric learning which satisfies the data geometry.
[Goal of `metric learning` algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points](https://parajain.github.io/metric_learning_tutorial/)

- https://parajain.github.io/metric_learning_tutorial/
- https://en.wikipedia.org/wiki/Similarity_learning

Loss function| Cost function | Empirical risk| Structure risk 
-----|----|---|---
$\ell(y, f(x;\theta))$ between the response $y$ of the supervisor to a given input $x$ and the response $f(x;\theta)$ provided by the learning machine|the objective function to minimize|$\frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))$ approximation to the expected value of the loss | $\frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))+\mathcal{R}(f)$ regularized emprical risk

We use the formula $f(x;\theta)$ to predict the response of the inout $x$, which is parameterized by $\theta\in\Theta$.
The formula $f(\cdot;\cdot)$ is the representation of the machine learning algorithm.
We use the `structure risk minimization(SRM)` principle to find the optimal parameters $\theta^{\ast}$ to select
 the beat model over the representation space$\mathcal{F}=\{f(x;\theta)\mid \theta\in |theta\}$, i.e.,
$$\theta^{\ast}=\arg\min_{\theta\in\Theta} \frac{1}{N}\sum_{i=1}^N\ell(y_i , f(x_i;\theta))+\mathcal{R}(f)$$
where $\ell(y_i , f(x_i;\theta))$ is the predictive metric with the respect to the input $x_i$ and its groundtruth $y_i$.
Here we will focus on the design of the predictive metric for different tasks.

After optimization, we obtain the (sub)optimal model $f(x;\theta^{\ast})$.

Now we turn to the error decomposition of the trained model $f(x;\theta^{\ast})$ and the oracle model $\mathcal{o}$:
$$\|f(x;\theta^{\ast}) - \mathcal{o}\|=\|f(x;\theta^{\ast}) -f^{\ast}+ f^{\ast}-f_{*}+f_{*}-\mathcal{o}\|\\$$
$$\leq \underbrace{\|f(x;\theta^{\ast}) -f^{\ast}\|}_{\text{Computational error}} + \underbrace{\|f^{\ast}-f_{*}\|}_{\text{Approximation error}}+\underbrace{\|f_{*}-\mathcal{o}\|}_{\text{Representation error}}$$

where $f^{\ast}$ is the best model based on the training data set; $f_*$ is the best model constrainted in the hypothesis space $\mathcal{F}=\{f(x;\theta)\mid \theta\in \Theta\}$.
The predictive metric or the loss function $\ell(\cdot, \cdot)$ directly determine the difficulity and complexity of the optimization problem as well as the computtaional error.

***

Except the supervised learning, we also need the evaluation metrics to  reflect the truth degree of the algorithms such as the goodness of fit.

If all models are approximation of the truth, what is criteria or principle to choose the best one?
As [Andrew R. Barron](http://www.stat.yale.edu/~arb4/publications.html) quote
> [Thus learning is not possible without inductive bias, and now the question is how to choose the right bias. This is called `model selection`.](http://www.modelselection.org/)

- http://www.modelselection.org/
- https://www.ssc.wisc.edu/~bhansen/
- http://www.stat.yale.edu/~arb4/
- http://sp.cs.tut.fi/WITMSE08/
- [Modern MDL meets Data Mining — Insights, Theory, and Practice](http://eda.mmci.uni-saarland.de/events/mdldm19)
- https://cispa.de/en
- [Distance Geometry in data science](https://www.lix.polytechnique.fr/~liberti/dgds.pdf)
- http://iphome.hhi.de/samek/pdf/LapNCOMM19.pdf

## Regression loss

### Maximum likelihood estimatation

Let us begin with the maximum likelihood estimate(mle).
In statistics, the **likelihood principle** is the proposition that, 
given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.
In maximum likelihood estimate (mle）, the sample $\{x_i\}_{i=1}^{N}$ are identitically independently from a probability distribution $\mathcal{P}(\theta)$, 
so that we can obtain their joint probability
$$P(x_1,\cdots,x_n)=\prod_{i=1}^N P(\theta;x_i)$$
where $P(\theta;x_i)\in [0, 1]$ is the probability of $x_i$.
It is also called the likelihood of the parameter $\theta$.
Here $P(\theta;x_i)$ is point in the probability (distribution) function at the point $x_i$
and $P(\theta;x)$ is parameterized by $\theta\in \Theta$.
In another word, we know the distribution family $\mathcal P$ parametrized  $\theta\in \Theta$.



Because we have observed that samples $x_i$ generated from the probability (distribution) family $P(\theta)$,
the joint probability of $\{x_i\}_{i=1}^{N}$ cannot be zero.
If $P(x_1,\cdots,x_n)$ is too small, the event is rare to observe.
Thus it should be common enough for us to observe.
The maximum likelihood estimate (mle) of $\theta$ is the one that maximize the joint probability
\begin{equation}\label{mle}
\theta^{\ast}=\arg\max_{\theta\in \Theta} P(x_1,\cdots,x_n)=\arg\max_{\theta\in \Theta} \prod_{i=1}^N P(\theta;x_i).\end{equation}

```{r likelihood, echo=FALSE, fig.height=4, fig.width=4, paged.print=TRUE}
q = seq(0,1,length=100)
L= function(q){q^30 * (1-q)^70}
plot(q,L(q),ylab="L(q)",xlab="q",type="l")
```

Because likelihoods may be very small numbers, 
the product of the likelihoods can get very small very fast and can exceed the precision of the computer.
Because we only care about the relative likelihood, 
the convention is to work with the log-likelihoods instead, 
because `the log of a product is equal to the sum of the logs`. 
Therefore, we can take of the logarithm of each individual likelihood and add them together 
and get the same end result, i.e., 
the parameter value that maximizes the likelihood ($L$) is equal to the parameter value that maximizes the log-likelihood $\ell$:
\begin{equation}\label{log-likelihood}\ell=\log(\prod_{i=1}^N P(\theta;x_i))=\sum_{i=1}^{N}\log P(\theta;x_i)\end{equation}
where $\log P(\theta;x_i)\leq 0$.
Because the logarithm function is monotonic and inverse, so that 
$$\theta^{\ast}=\arg\max_{\theta\in \Theta} P(x_1,\cdots,x_n)=\arg\max_{\theta\in \Theta} \sum_{i=1}^{N}\log P(\theta;x_i).$$

```{r log-likelihood, echo=FALSE, fig.height=4, fig.width=4, paged.print=TRUE}
q = seq(0,1,length=100)
L= function(q){q^30 * (1-q)^70}
plot(q,L(q)/L(0.3),ylab="L(q)/L(qhat)",xlab="q",type="l")
```

Given our goodness(or badness)-of-fit measure, our next step is to find the values of the parameters
that give us the best fit – the so-called `maximum likelihood estimators`.
In short, we convert the parameter inference into the numerical optimization. 


* [Likelihood Ratios, Likelihoodism, and the Law of Likelihood](https://plato.stanford.edu/entries/logic-inductive/sup-likelihood.html)
* <https://algorithmia.com/blog/introduction-to-loss-functions>
* [Analysis of Environmental Data Conceptual Foundations: Maximum Likelihood Inference](https://www.umass.edu/landeco/teaching/ecodata/schedule/likelihood.pdf)

Note the following relation
$$\theta^{\ast}=\arg\max_{\theta\in \Theta} \sum_{i=1}^{N}\log P(\theta;x_i) = \arg\max_{\theta\in \Theta}\frac{1}{N} \sum_{i=1}^{N}\log P(\theta;x_i)$$
where $N$ is a constant positive number.

Now we regard the term $\log P(\theta;x_i)$ as an transformation of the observation,
$$\theta^{\ast} 
=\arg\max_{\theta\in \Theta}\mathbb{E}_{x\sim \hat{p}(x)}  \log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} -\mathbb{E}_{x\sim \hat{p}(x)} \log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} \mathbb{E}_{x\sim \hat{p}(x)} -\log P(\theta;x)\\
=\arg\min_{\theta\in \Theta} \mathbb{E}_{x\sim \hat{p}(x)} \log\frac{1}{P(\theta;x)}.
$$
As $N\to\infty$, this average log-likelihood function tends, with probability 1,

$$\mathbb{E}_{x\sim \hat{p}(x)} \log\frac{1}{P(\theta;x)}=\mathbb{E}_{x\sim {p}^{true}(x)} \log\frac{1}{P(\theta;x)}\\
=\int_{x}p^{true}(x)\log\frac{1}{p(\theta;x)}\mathrm dx\\
=\int_{x}p^{true}(x)\log\frac{1}{p(\theta;x)}\mathrm dx+\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}-\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}\\
=\int_{x}p^{true}(x)\log\frac{p^{true}(x)}{p(\theta;x)}\mathrm dx- \underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}\\
=KL(p^{true}(x)\mid p(\theta;x))-\underbrace{\int_{x}p^{true}(x)\log{p^{true}(x)}\mathrm dx}_{constant}$$

where ${p}^{true}(x)$ is the true probability distribution function of $x$.

So we conclude that 
$$\theta^{\ast}=\arg\min_{\theta\in \Theta} KL(p^{true}(x)\mid p(\theta;x))=\arg\min_{\theta\in \Theta}\log\frac{p^{true}(x)}{p(\theta;x)}$$
as $N\to\infty$.

In this sense, mle is to find the optimal approximation to the true probability distribution function constrained in the probability distribution function family $p(\theta;\cdot)$ given some observation.


- [Alternative form of max likelihood ](https://cedar.buffalo.edu/~srihari/CSE676/5.5%20MLBasics-MaxLikelihood.pdf)
- [Maximum Likelihood as minimizing KL Divergence](https://www.jessicayung.com/maximum-likelihood-as-minimising-kl-divergence/)
- [Why Minimize Negative Log Likelihood?](https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/)

------

The loss function of mle is based on the probability distribution family.

For example, $x_i\in \{0, 1\}\sim Bernoulli(p)$, the log-likelihood of $\{x_i\}_{i=1}^N$ is 
$$\ell(p)=\log(\prod_{i=1}^{N}p^{x_i}(1-p)^{1-x_i})=\sum_{i=1}^N (x_i\log p+(1-x_i)\log (1-p))\\=\sum_{i=1}^N(x_i\log\frac{p}{1-p}+\log(1-p))\\=(\sum_{i=1}^Nx_i)\log\frac{p}{1-p}+n\log(1-p).$$
To find the mle, we set the derivative of $\ell(p)$ to zero
$$\frac{\mathrm d \ell(p)}{\mathrm d p}=(\sum_{i=1}^Nx_i)(\frac{1}{p}+\frac{1}{1-p})-\frac{n}{1-p}=0$$
and we get $p^{\ast}=\arg\max_{p}\ell(p)=\frac{(\sum_{i=1}^Nx_i)}{n}$.

For example, $x_i\sim Poisson(\lambda)$, the log-likelihood of $\{x_i\}_{i=1}^N$ is
$$\ell(\lambda)=\log(\prod_{i=1}^{N}\frac{\lambda^{x_i}\exp(-\lambda)}{x_i!})=\sum_{i=1}^N [x_i\log(\lambda)-\log x_i!]-N\lambda\propto \sum_{i=1}^N x_i\log(\lambda)-N\lambda$$
To find the mle, we set the derivative of $\ell(\lambda)$ to zero
$$\frac{\mathrm d \ell(\lambda)}{\mathrm d \lambda}=\sum_{i=1}^N\frac{x_i}{\lambda}-N=0$$
and we get $\lambda=\frac{\sum_{i=1}^N x_i}{N}$.

Let $x_1, \cdots , x_N$ be i.i.d. samples with Laplace density 
$$P(\theta\mid \cdot)=\frac{1}{2}\exp(-\|x-\theta\|_1)$$
where $\theta\in\mathbb{R}$.
Their log-likelihood is 
$$\ell(\theta)=\log(\prod_{i=1}^NP(\theta\mid x_i))=\log(\prod_{i=1}^N\frac{1}{2}\exp(-\|x_i-\theta\|_1)=-N\log(2)-\sum_{i=1}^N\|x_i-\theta\|_1$$
Observe that 
$$\hat{\theta}=\arg\max_{\theta}\ell(\theta)=\arg\max_{\theta}-\sum_{i=1}^N\|x_i-\theta\|_1\\=\arg\min_{\theta}\underbrace{\sum_{i=1}^N\|x_i-\theta\|_1}_{\text{convex}}.$$
According to convex optimization, $0\in\partial \sum_{i=1}^N\|x_i-\hat\theta\|_1$
thus $\sum_{i=1}^N\operatorname{sgn}(x_i-\hat\theta)=0\implies \hat\theta=\operatorname{median}\{x_1,\cdots,x_N\}$.

- https://www.colorado.edu/amath/sites/default/files/attached-files/ch4.pdf
- https://rpubs.com/FJRubio/logisMLE
- https://shodhganga.inflibnet.ac.in/bitstream/10603/21266/12/12_chapter%206.pdf

### Bayesian mle

Sometimes, domain-specific experience will tell us how to select some appropriate  parameters of the probiality distribution functions.

[For example, adult male heights are on average 70 inches  (5'10) with a standard deviation of 4 inches. Adult women are on average a bit shorter and less variable in height with a mean height of 65  inches (5'5) and standard deviation of 3.5 inches. ](https://www.usablestats.com/lessons/normal)

However, we know that human heights cannot be negative. 
It is better to use the truncated normal distribution for adult  heights.
The parameters must be compatible with the domain-specific experience.

[There's one key difference between frequentist statisticians and Bayesian statisticians that we first need to acknowledge before we can even begin to talk about how a Bayesian might estimate a population parameter $\theta$.](https://online.stat.psu.edu/stat414/node/241/)
Bayesian describes the mapping from prior beliefs about $\theta$, summarized in so-called prior density of $P(\theta)$, to new posterior beliefs in the light of observing the data.

[Maximum A Posteriori (MAP) estimator](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/ppt/22-MAP.pdf) of $\theta$ is to maximize the posteriori of $\theta$
$$\theta^{\ast}=\arg\max_{\theta\in \Theta}P(\theta\mid x)\\=\arg\max_{\theta\in \Theta}\frac{P(\theta, x)}{P(x)}\\=\arg\max_{\theta\in \Theta}\frac{P(x\mid \theta)P(\theta)}{P(x)}\\=\arg\max_{\theta\in \Theta} P(x\mid \theta)P(\theta)$$
where $P(x)$ is the function of random variable.

`Bayesian maximum likelihood estimate` is to maximize the likelihood with a penalty function
\begin{equation}\label{Bayesian mle}\theta^{\ast}=\arg\max_{\theta\in \Theta}\sum_{i=1}^{N}\log P(\theta;x_i)+\log P(\theta)\end{equation}
where $\sum_{i=1}^{N}\log P(\theta;x_i)$ is the likelihood of the parameter $\theta$ and $P(\theta)$ is the prior probability density function of $\theta$.

```{r Bayesian-likelihood, echo=FALSE, fig.height=4, fig.width=6, paged.print=TRUE}
q = seq(0,1,length=100)
l= function(q){30*log(q) + 70 * log(1-q)+log(dnorm(q))}
plot(q,L(q),ylab="Bayesian log-likelihood",xlab="q",type="b")
```

MAP estimate is the `mode of the posterior distribution`.
Another option of Bayesian estimate would be to choose the `posterior mean`
\begin{equation}\label{MMSE}\hat{\theta}=\mathbb E[\theta|X=x]=\int_{\theta\in\Theta}\theta P(\theta\mid x)\mathrm d \theta.\end{equation}
It is called minimum mean squared error (MMSE) estimate of the random variable $\theta$ given $x$.

- [Bayesian Maximum Likelihood](http://faculty.wcas.northwestern.edu/~lchrist/course/CIED_2012/bayesian.pdf)
- [Maximum Likelihood vs.Bayesian Estimation](http://www-edlab.cs.umass.edu/cs689/lectures/ml-vs-bayes-estimation.pdf)
- [9.1.2 Maximum A Posteriori (MAP) Estimation](https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php)
- [9.1.4 Conditional Expectation (MMSE)](https://www.probabilitycourse.com/chapter9/9_1_4_conditional_expectation_MMSE.php)
- [ML, MAP, and Bayesian — The Holy Trinity of Parameter Estimation and Data Prediction](https://engineering.purdue.edu/kak/Trinity.pdf)

-----

Suppose that $Y$ follows a binomial distribution with parameters $n$ and $p = \theta$, so that the p.m.f. of $Y$ given $\theta$ is:
$$g(y|\theta) = \binom{n}{y}\theta^y(1-\theta)^{n-y}$$
for $y = 0, 1, 2, \cdots, n$. 
Suppose that the prior p.d.f. of the parameter $\theta$ is the beta p.d.f., that is:
$$h(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}$$
for $0 < \theta < 1$. 
Find the posterior p.d.f of $\theta$, given that $Y = y$. 

First we obtain the joint density function by multiplying the prior and conditional probability
$$P(y,\theta)=g(y|\theta)h(\theta)=\binom{n}{y}\theta^y(1-\theta)^{n-y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\\=\underbrace{\binom{n}{y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}}_{\text{normalization factor}}\theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}$$
and 
$$\theta^{\ast}=\arg\max_{\theta\in (0, 1)}\log(k(y,\theta))=\arg\max_{\theta\in (0, 1)}(y+\alpha-1)\log(\theta)+(n-y+\beta-1)\log(1-\theta).$$
By setting the derivative of the log-likelihood function be 0, we  obtain the following equation
$$(y+\alpha-1)(1-\theta)-\theta(n-y+\beta-1)=y+\alpha-1-\theta(n-y+\beta-1+y+\alpha-1)=0$$
we obtain $\theta=\frac{y+\alpha-1}{n+\alpha+\beta-2}$
so $\theta^{\ast}=\frac{y+\alpha-1}{n-2+\beta+\alpha}$.

And $P(\theta\mid y)=\int_{y}P(y,\theta)\mathrm d y\propto \theta^{y+\alpha-1}(1-\theta)^{n-y+\beta-1}$.
In another word, $\theta\mid y$ follows a $Beta$ distribution, and 
$$\mathbb E(\theta\mid y)=\frac{y+\alpha}{n+\alpha+\beta}.$$

- [Notes for 6.864, ML vs. MAP (Sept. 24, 2009)](http://people.csail.mit.edu/regina/6864/mlvsmap.pdf)
- [Bayesian Estimation](https://online.stat.psu.edu/stat414/node/241/)

------

The mle only takes the samples $\{x_i\}_{i=1}^N$ into consideration
while Bayesian mle also considers the parameters.

mle|Bayesian mle
---|----
point estimation  of density function|Bayesian inference 
$\arg\max_{\theta\in\Theta}P(x\mid\theta)$|$\arg\max_{\theta\in\Theta}P(x\mid\theta)P(\theta)$
$\theta$ is as the unknown parameter of the probability $P(x\mid \theta)$. | $\theta$ is the latent random factor to measure the rareness of the model $P(\cdot\mid \theta)$.

Bayesian mle is to find the frequent pattern of the model.
MAP is to answer the question what is the most possible pattern in the belief of $P(\theta)$ 
if we have observed a sequence of samples generated by a model parameterized by $\theta$.


Bayesian estimates constraint the probability distribution of parameter $\theta$
so it is  to rule out some chances.
The prior $P(\theta)$ penalize the rare values of parameters.
The core of Bayesian statistics is to use the prior probability of the parameters to decrease their uncertainty.

According to the product rule of probability, we can find that
$$P(x,\theta)=P(x\mid \theta)P(\theta)=P( \theta\mid x)P(x)\implies P( \theta\mid x)=\frac{P(x\mid \theta)P(\theta)}{P(x)}\propto P(x\mid \theta)P(\theta).$$

It is summarized as following 
``` posterior is proportional to likelihood times prior```.

In standard Bayesian notation, we use $\pi(\theta)$ to denote the `prior probability density function (pdf)` of parameter $\theta$ with support$S(\theta)$, 
$L(y|\theta)$ the `likelihood function` (i.e. the pdf of data given the parameter) with support $S(Y |\theta)$,
$p(\theta|y)$ the `posterior pdf` with support $S(\theta\mid Y)$ of parameter given the data, 
and $f(y)$ the unconditional pdf for the data with support $S(Y)$. Both $\theta$ and $y$ can be vectors.

From the joint pdf identity, $L(y|\theta)\pi(\theta) = p(\theta|y)f(y)$, the `Bayes formula`
$$ p(\theta|y)=\frac{L(y|\theta)\pi(\theta)}{f(y)}=\frac{L(y|\theta)\pi(\theta)}{\int_{S(\theta\mid Y)}L(y|\theta)\pi(\theta)}.$$
We can re-write the above joint pdf identity as $f(y)=\frac{L(y|\theta)\pi(\theta)}{p(\theta|y)}$.
Now for any fixed $\theta$, we can integrate both sides of the re-expressed joint pdf identity with respect to $y$
over  $S(Y\mid \theta)$  and obtain the prior pdf at $\theta$
$$\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y=\pi(\theta)\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y \\ \Downarrow  \\ \pi(\theta)=\frac{\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y}{\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y}=\int_{y\in S(Y\mid \theta)}f(y)\mathrm d y(\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y)^{-1}.$$
Under so-called positivity assumption, we will obtain the  [Inversion of Bayes formula](https://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0001/SII-2011-0004-0001-a010.pdf):
\begin{equation}\label{IBF}\pi(\theta)=(\int_{y\in S(Y\mid \theta)}\frac{L(y|\theta)}{p(\theta|y)}\mathrm d y)^{-1}\end{equation}
for $\theta \in S(\theta)$.

If we prefer some  posterior pdf, we can use Inversion of Bayes formula to find a proper $\pi(\theta)$. 

* [Inversion of Bayes formula](https://www.intlpress.com/site/pub/files/_fulltext/journals/sii/2011/0004/0001/SII-2011-0004-0001-a010.pdf)
* [Inverse Bayes Formulae (IBF)](http://web.hku.hk/~kaing/Section1_3.pdf)
* [Unexpected Journey to the Converse of Bayes’ Theorem](http://web.hku.hk/~kaing/HKSSinterview.pdf)

### Linear Regression

Now we turn to regression problem.
Simply speaking, regression is to find the relationship of variables.
The common setting of regression problem is training data set $(x_i, y_i)$ for $i=1,\cdots, N$ 
and the model space $\mathcal{M}=\{m(\theta)\mid \theta\in\Theta\}$, where $x_i\in\mathbb{R}^d$ and $y_i\in\mathbb{R}$.
And there is an oracle $m_o\in \mathcal{M}$ so that 
$$y_i=m_o(x_i)+\varepsilon_i\tag{regression}$$
for $i=1,\cdots, N$ and $\{\varepsilon_i\}){i=1}^N$ are  random variables identically indepentlt distributed in some probability family.


```{r regression, echo=FALSE, fig.height=4, fig.width=6, paged.print=TRUE}
x <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120)
y <- c(10, 18, 25, 29, 30, 28, 25, 22, 18, 15, 11, 8)
fit <- lm(y ~ poly(x, 3))   ## polynomial of degree 3
plot(x, y)  ## scatter plot (colour: black)
x0 <- seq(min(x), max(x), length = 20)  ## prediction grid
y0 <- predict.lm(fit, newdata = list(x = x0))  ## predicted values
lines(x0, y0, col = 2)  ## add regression curve (colour: red)
#https://stackoverflow.com/questions/39736847/plot-regression-line-in-r
```

[`The method of least squares` is about estimating parameters by minimizing the `squared discrepancies` between observed data, on the one hand, and their expected values on the other](https://stat.ethz.ch/~geer/bsa199_o.pdf)
The least squares estimator, denoted by $\hat{\beta}$, is that value of b that minimizes
$$\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2.\tag{least squares method}$$
In another word, 
$$\hat{\beta}=\arg\min_{\theta\in\Theta}\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2=\arg\min_{\theta\in\Theta}\frac{1}{N}\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2\\=\arg\max_{\theta\in\Theta}-\sum_{i=1}^{N}(y_i-m(x_i;\theta))^2=\arg\max_{\theta\in\Theta}\sum_{i=1}^{N}-(y_i-m(x_i;\theta))^2\\=\arg\max_{\theta\in\Theta}\prod_{i=1}^{N}\exp(-(\underbrace{y_i-m(x_i;\theta)}_{\varepsilon_i})^2)\\=\arg\max_{\theta\in\Theta}\prod_{i=1}^{N}\exp(-(\varepsilon_i)^2).$$
In some sense, least squares estimation is equivalent to maximum likelihood estimation with respect to the residual $\varepsilon_i$.

The basic model is the linear model, i.e., $m(x_i)=w\cdot x_i - b=\left<w\,\, b, x_i \,\,1\right>$.
Without generality, we set $m(x)=\left<\theta, x\right>$.
The `ordinary least squares` estimate is to minimize the $\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2$, i.e.,
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2=\arg\min_{\theta}\|y-X\theta\|_2^2\tag{ordianry least squares}$$
where $y=(y_1,\cdots, y_N)^T$ and $X=(x_1,\cdots,x_N)^T$.
According to the [first-order optimality conditions], we can obtain that
$$X^T(y-X\hat{\theta})=0\iff X^T X\hat{\theta}= X^T y$$
so we can obtain $\hat{\theta}=( X^T X)^{-1}X^Ty$ when $X^T X$ is invertible.

[Ordinary least squares method](https://www.albert.io/blog/key-assumptions-of-ols-econometrics-review/) holds  some following assumptions:

1. The regression model is linear in the coefficients and the error term.
2. The error term has a population mean of zero.
3. All independent variables are uncorrelated with the error term.
4. Observations of the error term are uncorrelated with each other.
5. The error term has a constant variance (no heteroscedasticity)
6. No independent variable is a perfect linear function of other explanatory variables.
7. The error term is normally distributed (optional).

- https://www.itl.nist.gov/div898/handbook/pmd/section2/pmd21.htm
- https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/
- https://www.econometrics-with-r.org/4-4-tlsa.html
- [Least squares and maximum likelihood by M.R.Osborne](https://maths-people.anu.edu.au/~mike/lsnml.pdf)
- [Least squares and maximum likelihood estimation](https://bookdown.org/egarpor/PM-UC3M/app-ext-mle.html)

From the computtaional perspective, we add some regularization term to deal with ill-posed problem.

For example, [Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf) by 
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2+\lambda\sum_{i=1}^p\|\theta_i\|_2^2=\arg\min_{\theta}\|y-X\theta\|_2^2+\lambda\|\theta\|_2^2\tag{Ridge Regression}.$$
According to the [first-order optimality conditions](https://web.stanford.edu/class/msande312/restricted/OPTconditions.pdf), we can obtain that
$$-X^T(y-X\hat{\theta})+\lambda\hat\theta=0\iff (X^T X+\lambda I)\hat{\theta}= X^T y$$
so that $\hat\theta=(X^T X+\lambda I)^{-1}X^T y$.

Note that $\theta=\frac{\partial m(x))}{\partial x}=\frac{\partial }{\partial x} x^T\theta$ so we can apply the regularization technique to nonlinear model. 

- http://statweb.stanford.edu/~owen/courses/305-1314/Rudyregularization.pdf
- http://www.few.vu.nl/~wvanwie/Courses/HighdimensionalDataAnalysis/WNvanWieringen_HDDA_Lecture234_RidgeRegression_20182019.pdf
- https://en.wikipedia.org/wiki/Regularized_least_squares

[first-order optimality conditions]: https://web.stanford.edu/class/msande312/restricted/OPTconditions.pdf

----


The method of `weighted least squares` can be used when the ordinary least squares assumption of constant variance in the errors is violated (which is called heteroscedasticity).
Simply, it is not necessary to think all the errors(residuals) are equally weighted:
$$\hat{\theta}=\arg\min_{\theta}\sum_{i=1}^N w_i^2(y_i-\left<\theta, x_i\right>)^2=\arg\min_{\theta}\|W(y-X\theta)\|_2^2\tag{weigted least squares}.$$
According to [first-order optimality conditions], we can obtain that $(WX)^T(Wy-WX\hat{\theta})=0\iff (WX)^T WX\hat{\theta}= (WX)^T y$ and 
$$\hat\theta=[(WX)^T WX]^{-1}(WX)^T Wy$$
where $W=\operatorname{diag}(w_1,\cdots,w_n)$.

<img src="https://online.stat.psu.edu/onlinecourses/sites/stat508/files/lesson04/ridge_regression_geomteric.png" width="40%"/>

- [13.1 - Weighted Least Squares](https://online.stat.psu.edu/stat501/lesson/13/13.1)
- https://online.stat.psu.edu/stat501/lesson/13/13.3
- [5.1 - Ridge Regression](https://online.stat.psu.edu/stat508/lesson/5/5.1)
- https://www.itl.nist.gov/div898/handbook/pmd/section1/pmd143.htm
- https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/24/lecture-24--25.pdf

----

Regularization is a popular approach to reducing a model’s predisposition to overfit on the training data 
[and thus hopefully increasing the generalization ability of the model.](https://rohanvarma.me/Regularization/)

The Lasso is a `shrinkage and selection` method for linear regression. It minimizes the usual sum of squared errors, 
with a bound on the sum of the absolute values of the coefficients. 
[It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods.](http://statweb.stanford.edu/~tibs/lasso.html)
The `lasso` estimate is defined as
$$\hat{\theta}=\arg\min_{\theta\in\mathbb{R}^p}\sum_{i=1}^N (y_i-\left<\theta, x_i\right>)^2+\lambda\sum_{i=1}^{p}|\theta_i|=\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_1$$
The optimal codnition is 
$$-X^T(y-x\theta)+\lambda S(\theta)=0$$ 
where $S(\theta)$ is the subgradeint of $\|\theta\|_1$.

Lasso penalty corresponds to Double-exponential prior:
  $\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_1=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\underbrace{\exp(-\lambda \|\theta\|_1)}_{\text{Double-exponential prior}}$. 

Ridge Regression penalty corresponds to Gaussian prior:
$\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\lambda\|\theta\|_2^2=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\underbrace{\exp(-\lambda \|\theta\|_2^2)}_{\text{Gaussian prior}}$. 

Elastic Net penalty corresponds to a new prior:
$\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\alpha\|\theta\|_2^2+(1-\alpha)\|\theta\|_1=\arg\max_{\theta\in\Theta}\exp(-\|y-X\theta\|_2^2)\exp(-\alpha\|\theta\|_2^2-(1-\alpha)\|\theta\|_1)$

The lasso does `variable selection and shrinkage`,
whereas ridge regression, in contrast, `only shrinks`.
Different priors have different effects.

- https://online.stat.psu.edu/stat508/lesson/5
- https://www.cvxpy.org/examples/machine_learning/lasso_regression.html
- http://statweb.stanford.edu/~tibs/lasso.html
- https://trevorhastie.github.io/
- [The Bayesian Lasso slide](https://www2.stat.duke.edu/courses/Fall17/sta521/knitr/Lec-13-Bayes-VarSel/bayes-varsel.pdf)
- [The Bayesian Lasso](https://people.eecs.berkeley.edu/~jordan/courses/260-spring09/other-readings/park-casella.pdf)
- [Lecture 3: More on regularization. Bayesian vs maximum likelihood learning](https://www.cs.mcgill.ca/~dprecup/courses/ML/Lectures/ml-lecture03.pdf)
- [Bayesian Interpretations of Regularization](https://www.mit.edu/~9.520/spring09/Classes/class15-bayes.pdf)
- https://rohanvarma.me/Regularization/
- [The Learning Problem and Regularization](https://www.mit.edu/~9.520/spring11/slides/class02.pdf)
- [New and Evolving Roles of Shrinkage in Large-Scale Prediction and Inference](https://www.birs.ca/events/2019/5-day-workshops/19w5188/schedule#)

```{r regularization, echo=FALSE, fig.height=10, fig.width=6}
library(MASS)  # Package needed to generate correlated precictors
library(glmnet)  # Package to fit ridge/lasso/elastic net models
library(Matrix)
# Generate data
set.seed(19875)  # Set seed for reproducibility
n <- 1000  # Number of observations
p <- 5000  # Number of predictors included in model
real_p <- 15  # Number of true predictors
x <- matrix(rnorm(n*p), nrow=n, ncol=p)
y <- apply(x[,1:real_p], 1, sum) + rnorm(n)

# Split data into train (2/3) and test (1/3) sets
train_rows <- sample(1:n, .66*n)
x.train <- x[train_rows, ]
x.test <- x[-train_rows, ]

y.train <- y[train_rows]
y.test <- y[-train_rows]
fit.lasso <- glmnet(x.train, y.train, family="gaussian", alpha=1)
fit.ridge <- glmnet(x.train, y.train, family="gaussian", alpha=0)
fit.elnet <- glmnet(x.train, y.train, family="gaussian", alpha=.5)


# 10-fold Cross validation for each alpha = 0, 0.1, ... , 0.9, 1.0
# (For plots on Right)
for (i in 0:10) {
    assign(paste("fit", i, sep=""), cv.glmnet(x.train, y.train, type.measure="mse", 
                                              alpha=i/10,family="gaussian"))
}
# Plot solution paths:
par(mfrow=c(3,2))
# For plotting options, type '?plot.glmnet' in R console
plot(fit.lasso, xvar="lambda")
plot(fit10, main="LASSO")

plot(fit.ridge, xvar="lambda")
plot(fit0, main="Ridge")

plot(fit.elnet, xvar="lambda")
plot(fit5, main="Elastic Net")
#https://www4.stat.ncsu.edu/~post/josh/LASSO_Ridge_Elastic_Net_-_Examples.html
```
When the number of predictors is large compared to the number of observations, 
$X$ is likely to be singular and the regression approach is no longer feasible.

`Partial Least Squares`(pls) finds components from $X$ that are also relevant for $Y$. 
Specifically, pls regression searches for a set of components (called latent vectors) 
that performs a simultaneous decomposition of $X$ and $Y$ with the
constraint that these components explain as much as possible of the covariance between $X$ and $Y$.

- https://stats.idre.ucla.edu/wp-content/uploads/2016/02/pls.pdf
- https://personal.utdallas.edu/~herve/Abdi-PLS-pretty.pdf
- https://en.wikipedia.org/wiki/Partial_least_squares_regression
- https://www.camo.com/resources/pls-regression.html
- http://virtuallaboratory.org/lab/pls/

### Bayesian Regularization

`Bayesian Regularization` apply the Bayesian statistics to the regularization techniques.

The inverse Bayesan formula tells us that we  can imply the piror from posterior.

The `sparsity inducing priors` used in the Bayesian approach can broadly be classified as

* A single continuous shrinkage prior such as the Double Exponential prior and the Horseshoe prior;
* Two-group spike-and-slab prior, such as the spike-and-slab Normal prior and spike-and-slab Lasso prior.

The `horseshoe prior` is a member of the family of multivariate scale mixtures of normals, 
and is therefore closely related to widely used approaches for sparse Bayesian learning, including, among others, Laplacian priors (e.g. the LASSO) and Student-t priors (e.g. the relevance vector machine).
The probability density function of Cauchy distribution  is defined as
$$f(y;\mu,\sigma) = \frac{1}{\pi \sigma}\,\frac{1}{1 + (y-\mu)^2/\sigma^2}.$$
The Half-Cauchy distribution is supported on the set of all real numbers that are greater than or equal to $\mu$, that is on $[\mu, \infty)$ with the following pdf:
\[\begin{split} \begin{align} 
f(y;\mu, \sigma) = \left\{\begin{array}{cll} \frac{2}{\pi \sigma}\,\frac{1}{1 + (y-\mu)^2/\sigma^2} & & y \ge \mu \\[1em] 0 & & \text{otherwise}. \end{array}\right. 
\end{align}\end{split}\]

By a spike and slab model we mean a Bayesian model specified by the following prior hierarchy:
$$\tag{The spike and slab model}
(Y_i\mid x_i，\theta,\sigma^2)\sim \mathcal{N}(\theta^T x_i, \sigma^2) \\
(\theta\mid\Gamma)\sim \mathcal{N}(0, \Gamma)\\
\Gamma\sim \pi(d\gamma)\\
\sigma^2\sim \mu(d\sigma^2).
$$

- https://github.com/AsaCooperStickland/Spike_And_Slab
- http://proceedings.mlr.press/v5/carvalho09a.html
- https://projecteuclid.org/euclid.aos/1117114335

During the modeling stage a critical design decision concerns choosing between model complexity and model expressiveness. 
Although one wishes to use a model that can explain as closely as possible the particular application’s generative process, 
one has to restrict oneself to models that are computationally tractable.
[This leads to models that are generally an oversimplification of the inherent process, as seen in the previous chapter.]
These models, due to their simplifying assumptions, fail to learn the parameters settings that produce the desired predictions.

One solution to this problem is to add more complexity to the model to better reflect the underlying process.
The posterior regularization framework incorporates side-information into parameter estimation in the form of linear constraints on posterior expectations, 
which allows tractable learning and inference even when the constraints would be intractable to encode directly in the model parameters.
By defining a flexible language for specifying diverse types of problem-specific prior knowledge, 
we make the framework applicable to a wide variety of probabilistic models, both generative and discriminative.

- [Bayesian Regularization](http://hedibert.org/wp-content/uploads/2015/12/BayesianRegularization.pdf)
- [Mixtures of g Priors for Bayesian Variable Selection](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/readings/liang-etal.pdf)
- [29 : Posterior Regularization](https://www.cs.cmu.edu/~epxing/Class/10708-14/scribe_notes/scribe_note_lecture29.pdf)
- [New and Evolving Roles of Shrinkage in Large-Scale Prediction and Inference](https://www.birs.ca/workshops/2019/19w5188/report19w5188.pdf)
- [Shrinkage priors for Bayesian prediction](https://projecteuclid.org/euclid.aos/1151418241)
- [Posterior Regularization for Structured Latent Variable Models](http://jmlr.csail.mit.edu/papers/volume11/ganchev10a/ganchev10a.pdf)

```{r LaplacesDemon, echo=FALSE, fig.height=4, fig.width=6}
library(LaplacesDemon)
x <- rnorm(100)
lambda <- rhalfcauchy(100, 5)
tau <- 5
x <- dhs(x, lambda, tau, log=TRUE)
x <- rhs(100, lambda=lambda, tau=tau)
plot(density(x))
#https://www.rdocumentation.org/packages/LaplacesDemon/versions/16.1.1/topics/dist.Horseshoe
```

----

Last but not least, we introduce more on Bayesian statistics.


- https://bayesian.org/isba2020-home/
- https://www.stats.ox.ac.uk/bnp12/
- [Objections to Bayesian statistics](http://www.stat.columbia.edu/~gelman/research/published/badbayesmain.pdf)
- [The Limitation of Bayesianism](https://www.cc.gatech.edu/~isbell/classes/reading/papers/wang.bayesianism.pdf)
- [Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?](http://gpss.cc/bark08/slides/1%20zoubin.pdf)
- http://www.math.leidenuniv.nl/~avdvaart/talks/ICM.pdf
- http://www2.stat.duke.edu/~rcs46/lectures_2015/14-bayes1/14-bayes3.pdf
- http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec8.pdf

### Robust regression

[Robust regression can be used in any situation where OLS regression can be applied. It generally gives better accuracies over OLS because it uses a weighting mechanism to weigh down the influential observations. It is particularly resourceful when there are no compelling reasons to exclude outliers in your data.](http://r-statistics.co/Robust-Regression-With-R.html)


- https://stats.idre.ucla.edu/r/dae/robust-regression/
- [13.3 - Robust Regression Methods](https://online.stat.psu.edu/stat501/lesson/13/13.3)
- http://r-statistics.co/Robust-Regression-With-R.html

The least absolute deviation (LAD) method, which is also known as the $L_1$ method,
provides a useful and plausible alternative to least squares (LS) method.


The least absolute deviation (LAD) method is to  minimize
$$\sum_{i=1}^{N}|\underbrace{y_i-m(x_i;\theta)}_{\varepsilon_i}|\tag{least absolute deviation}$$
which in turn minimizes the absolute value of the residuals.
In another word, $\varepsilon_i$ is generated from `Laplacian distribution`.

- [Analysis of least absolute deviation](https://www.math.ust.hk/~makchen/papers/LAD.pdf)

This achieves robustness, but is hard to work with in practice
because the absolute value function is `not differentiable`.

[Peter Huber](https://statmodeling.stat.columbia.edu/2011/05/01/peter_hubers_th/) defines a loss function
$$\rho(r_i)=\begin{cases}\varepsilon_i^2,\text{if  } |\varepsilon_i|<c,\\|2\varepsilon_i|c -c,\text{otherwise}.\end{cases}\tag{Huber}$$

`Log Hyperbolic Cosine Loss` is defined as 
$$\rho(\varepsilon_i)=\log\frac{\exp(\varepsilon_i)+\exp(-\varepsilon_i)}{2}\tag{log-cosh}=\log\operatorname{cosh}(\varepsilon_i).$$

Log cosh is a smooth approximation to huber loss. 
$$\lim_{\varepsilon_i\to 0}\frac{\log\operatorname{cosh}(\varepsilon_i)}{\varepsilon_i^2}=1=\lim_{\varepsilon_i\to \infty}\frac{\log\operatorname{cosh}(\varepsilon_i)}{|\varepsilon_i|}.$$

- [Log Hyperbolic Cosine Loss Improves Variational Auto-Encoder](https://openreview.net/forum?id=rkglvsC9Ym)

If an observation has a response value that is very different from the predicted value based on a model, 
then that observation is called an outlier. 
On the other hand, if an observation has a particularly unusual combination of predictor values (e.g., one predictor has a very different value for that observation compared with all the other data observations), 
then that observation is said to have high leverage. 
Thus, there is a distinction between outliers and high leverage observations, and each can impact our regression analyses differently. 
It is also possible for an observation to be both an outlier and have high leverage. 
Thus, it is important to know how to detect outliers and high leverage data points.

- https://online.stat.psu.edu/stat501/lesson/11

### Penalized mle 

The penalty term in Bayesian elm requires that $P(\theta)$ is a probability density (distribution) function, i.e., $\int_{x}P(x)\mathrm dx=1$.
However, it is not easy to find appropriate prior.
Penalized least squares are to minimize the 
\begin{equation}\label{ Penalized mle}\arg\min_{\theta\in\mathbb{R}^p}\|y-X\theta\|_2^2+\sum_{i=1}^{p}\lambda_ip_i(|\theta_i|)\end{equation}
where the penalty function $p_i$ are not necessary for all $i$.

[One distinguishing feature of the nonconcave penalized likelihood approach is that it can simultaneously select variables and estimate coefficients of variables.](http://www.math.hkbu.edu.hk/~hpeng/Paper/modsel.pdf)

- [Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties](https://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf)
- [Nonconcave penalized likelihood with a diverging number of parameters](https://arxiv.org/abs/math/0406466)
- https://www.sis.uta.fi/tilasto/mttapu/runze.pdf
- https://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_8.html
- https://www.ccg.unam.mx/~vinuesa/tlem/pdfs/Sanderson_PL_2002.pdf
- http://people.vcu.edu/~dbandyop/BIOS625/Penalized.pdf

### Regression Discontinuity Design

We can deal with the categorical independent variable.
`Indicator variable, instrumental variable`

- [Endogenous Regressors and Instrumental Variables](https://eml.berkeley.edu/~powell/e240b_sp10/ivnotes.pdf)
- https://www.mailman.columbia.edu/research/population-health-methods/instrumental-variables
- [8.2 - The Basics of Indicator Variables](https://online.stat.psu.edu/stat462/node/161/)
- https://idss.mit.edu/calendar/idss-distinguished-speaker-seminar-rocio-titiunik/

## Classification Loss

Simply speaking, classification is to
$$\mathbb{R}^d\mapsto \mathbb{D}$$
where $\mathbb{D}$ is the finite categorical set.

Classification is about categorical variable.
[A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories.  For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.  Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest.  A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables.](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/)  
[If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/)

There is no arithmetic and magnitude concept in the world of categorical variables.
Purely categorical variables are not comparable.
In computer, they are always in the format of string or char.
Like other common objects in computer, we can test that if two categorical variables share the same literal value.


- https://online.stat.psu.edu/stat504/node/6/
- [How good is your classifier? Revisiting the role of evaluation metrics in machine learning](https://talks.cam.ac.uk/talk/index/128293)

----

[Category Encoders](http://contrib.scikit-learn.org/category_encoders/index.html) is a set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. 

- http://contrib.scikit-learn.org/category_encoders/
- https://brendanhasz.github.io/2019/03/04/target-encoding.html
- https://www.kaggle.com/waydeherman/tutorial-categorical-encoding
- https://kiwidamien.github.io/encoding-categorical-variables.html
- https://kiwidamien.github.io/james-stein-encoder.html
- https://kiwidamien.github.io/are-you-getting-burned-by-one-hot-encoding.html

----

- https://rohanvarma.me/Loss-Functions/
- https://people.eecs.berkeley.edu/~wainwrig/stat241b/lec11.pdf
- https://people.cs.umass.edu/~akshay/courses/cs690m/files/lec12.pdf
- https://www.cse.huji.ac.il/~daphna/theses/Alon_Cohen_2014.pdf
- [About loss functions, regularization and joint losses : multinomial logistic, cross entropy, square errors, euclidian, hinge, Crammer and Singer, one versus all, squared hinge, absolute value, infogain, L1 / L2 - Frobenius / L2,1 norms, connectionist temporal classification loss](http://christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html)

### 0-1 loss and its surrogates

It is natural to apply the 0-1 loss function in classification problems according to lack of arithmetic operation.
The 0-1 loss is defined as following
\begin{equation}\label{0-1 loss}
L(\hat{y}, y)=\begin{cases}1,&\text{if $\hat{y}\not= y$}\\
                           0,&\text{if $\hat{y}= y$}\end{cases}=\mathbb{I}(\hat{y}\not= y).
\end{equation}
where $\mathbb{I}(\cdot)$ is the indicator function and $\hat y, y$ are categorical variables.
And [this loss will induce the discrete topology space.](https://proofwiki.org/wiki/Definition:Discrete_Topology)
[This metric "shatters" the points, isolating each one within its own unit ball.](https://math.stackexchange.com/questions/2614268/why-is-a-discrete-topology-called-a-discrete-topology)

- [Why is a discrete topology called a discrete topology?](https://math.stackexchange.com/questions/2614268/why-is-a-discrete-topology-called-a-discrete-topology)
- https://proofwiki.org/wiki/Category:Discrete_Topology
- https://stattrek.com/multiple-regression/dummy-variables.aspx

---

For example, it is uased in linear regression of an indicator matrix.

- https://online.stat.psu.edu/stat508/lesson/8/8.5
- http://users.umiacs.umd.edu/~hcorrada/PracticalML/pdf/lectures/classification.pdf

### Binary classification

An n-dimensional pattern (object) $\mathbb x$ has $n$ coordinates, $x=(x_1, x_2, \cdots, x_n)$, 
where each $x_i$ is a real number, $x_i\in\mathbb{R}$ for $i = 1, 2,\cdots, n$. 
Each pattern $\mathbb x_j$ belongs to a class $y_j\in\{-1, +1\}$. 
Consider a training set $T$ of $m$ patterns together with their classes, $T=\{(\mathbb{x}_1, y_1), (\mathbb{x}_2, y_2), \cdots, (\mathbb{x}_m, y_m)\}$. 
Consider a dot product space $S$, in which the patterns $x$ are embedded, $x_1, x_2,\cdots, x_m\in S$. 
Any hyperplane in the space $S$ can be written as
$$\{\mathbb x\in S\mid \mathbb w\cdot\mathbb x+b=0\}, \mathbb w\in S, b\in\mathbb{R}.$$


The dot product $w\cdot\mathbb x$ is defined by:
$$\mathbb w\cdot\mathbb x=\sum_{i}^n w_ix_i$$


A training set of patterns is `linearly separable` if there exists at least one linear classifier defined by the pair $(\mathbb w, b)$ 
which correctly classifies all training patterns. 
This linear classifier is represented by the hyperplane $H$ $(\mathbb{w\cdot x}+b=0)$ and 
defines a region for class $+1$ patterns $(\mathbb{w\cdot x}+b>0)$ and another region for class $-1$ patterns $(\mathbb{w\cdot x}+b<0)$.
The  linear classifiers are in the following form
\begin{equation}\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)
=\begin{cases}1, &\text{if $\mathbb{w\cdot x}+b> 0$;}\\
-1, &\text{if $\mathbb{w\cdot x}+b< 0$.}\end{cases}
\end{equation}

In this case, we can reexpress the loss \ref{0-1 loss} as following
\begin{equation}\tag{binary loss}\label{binary loss}
L(\hat{y}, y)=\begin{cases}1,&\text{if $\hat{y}\not= y$}\\
                           0,&\text{if $\hat{y}= y$}\end{cases}
                           =\frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x}+b)y).
\end{equation}

The optimization procedure will minimize the loss over the training data set $T$
$$\sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\mathbb{y}_i)$$
And if the positive  $\alpha>0$ is small enough, we could obtain
$$(w^*, b^*)=
\arg\min_{\mathbb{w}, b} \sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\mathbb{y}_i)\\=
\arg\min_{\mathbb{w}, b} \sum_{i=1}^{m} \frac{1}{2}(1-\operatorname{sgn}(\frac{1}{\alpha}[\mathbb{w\cdot x_i}+b])\mathbb{y}_i)
$$
where $\alpha\leq \min\{\mathbb{w^*\cdot x_i}+b^*\mid i=1,\cdots, m\}$.
The above objective is `non-differentiable and non-convex`.

If the training set $T$ is `linearly separable`, we can transfer the linear separability condition into constraints
\begin{equation}\label{SVM}\tag{SVM}
 \begin{split}
  (w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2 \\
  &\text{subject to  } \underbrace{(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 0}_{\text{linear separability}} \,\, i=1,\cdots,m.
 \end{split}
\end{equation}
We prefer `the maximal margin hyperplane` 
\begin{equation}\label{hard margins}\tag{hard margins}
 \begin{split}
  (w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2 \\
  &\text{subject to  } \underbrace{(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1}_{\text{linear separability}} \,\, i=1,\cdots,m.
 \end{split}
\end{equation}
where the `support vectors` are defined as $\{(\mathbb{x}_i , y_i)\mid \mathbb{w\cdot x_i}+b)=\mathbb{y}_i\}$.
It is the `quadratic optimization problem with linear constraints`.

If the training set $T$ is not `linearly separable`,  we prefer `the maximal soft margin hyperplane`
\begin{equation}\tag{soft margin}\label{soft margin}
 \begin{split}
  (w^*, b^*) =&\arg\min_{\mathbb{w}, b}\frac{1}{2}\|\mathbb{w}\|_2^2+\lambda\sum_{i=1}^m\varepsilon_i \\
  &\text{subject to  } (\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1- \varepsilon_i \\
  & \varepsilon_i \geq 0\,\, i=1,\cdots,m.
 \end{split}
\end{equation}
Here $\lambda > 0$ is the regularization constant.

A surrogate loss is a loss that is used as a proxy for the 0-1 loss, and usually has better computational properties such as [Hinge loss](https://en.wikipedia.org/wiki/Hinge_loss), [Ramp loss](https://www.jstor.org/stable/23013182).

- http://www.svms.org/history.html
- https://cs231n.github.io/linear-classify/
- https://www.jianshu.com/p/4a40f90f0d98
- https://www.cs.otago.ac.nz/research/student-publications/Haitao_Xu_Tbldm_for_ajcai2018.pdf

---

If $\varepsilon_i> 1$, we can imply that $\operatorname{sgn}(\mathbb{w\cdot x_i}+b)\not=\mathbb{y}_i$, 
i.e.,  the prediction is wrong with respect to the sample $(\mathbb{x}_i , y_i)$.
And $(\mathbb{w\cdot x_i}+b)\mathbb{y}_i\geq 1- \varepsilon_i\implies \varepsilon_i\geq 1- (\mathbb{w\cdot x_i}+b)\mathbb{y}_i$, so the prediction is right if $\varepsilon_i\leq 0$.
The [Hinge loss](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf)  is defined as 
$$\operatorname{H}(y_i,\hat{y}_i)=\max\{0, 1- (\underbrace{\mathbb{w\cdot x_i}+b}_{\hat y_i})y_i\}\tag{Hinge loss}$$
so `hinge loss` is $0$ when the prediction is right.
The `ramp loss` is defined as following
$$\operatorname{R}(y_i,\hat{y}_i)=\min\{1,\max\{0, 1- (\underbrace{\mathbb{w\cdot x_i}+b}_{\hat y_i})y_i\}\}\tag{Ramp loss}$$
which is  closer to the  0-1 loss \eqref{binary loss} than the hinge loss.

Note that $\lim_{t\to 0}\frac{1- t}{\exp(-t)}=1$, `Exponential  Loss` is defined that
$$E(y_i,\hat y_i)=\exp(-(\mathbb{w\cdot x_i}+b)y_i).\tag{Exponential  loss}$$

Another principle to solve the non-limear classification is to take some feature transformation and apply a linear classifer.
Kernel tricks are one of the most popular methods.

kernel tricks maps the data points into higher dimensional space.

- [SVM for Pattern recognition](http://support-vector-machines.org/SVM_pr.html)
- http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node19.html
- http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
- https://en.wikipedia.org/wiki/Support-vector_machine
- https://en.wikipedia.org/wiki/Hinge_loss
- [Nonconvex online support vector machines](https://www.ncbi.nlm.nih.gov/pubmed/20513924)
- [The Support Vector Machine and Mixed Integer Linear Programming: Ramp Loss SVM with L1-Norm Regularization](https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=1007&context=ssor_pubs)

Without modification of loss functions, we can extend the binary claasifers to mutlicall classifers.
The basic belief is that a single binary classifer must learn a specific pattern 
so we can use multiply binary classifer to learn  multiply patterns.

- https://en.wikipedia.org/wiki/Multiclass_classification
- [Multiclass Learning with ECOC](http://blog.pluskid.org/?p=870)

``` {r svm, echo=FALSE, fig.height=4, fig.width=6}
data(cats, package = "MASS")
library(e1071)
m <- svm(Sex~., data = cats)
plot(m, cats)

## more than two variables: fix 2 dimensions
data(iris)
m2 <- svm(Species~., data = iris)
plot(m2, iris, Petal.Width ~ Petal.Length,
     slice = list(Sepal.Width = 3, Sepal.Length = 4))

## plot with custom symbols and colors
plot(m, cats, svSymbol = 1, dataSymbol = 2, symbolPalette = rainbow(4),
color.palette = terrain.colors)
```

- https://arxiv.org/abs/1808.02435
- [Mixed-Integer Support Vector Machine](http://opt.kyb.tuebingen.mpg.de/papers/OPT2009-Guan.pdf)
- [OPTIMIZING $\Psi$-LEARNING VIA MIXED INTEGER PROGRAMMING](https://stat-or.unc.edu/files/2016/04/05-20.pdf)

### Adaptive margin learning

The margin is defined in support vector machines.
The Multiclass SVM loss for the i-th example is then formalized as follows:
$$\sum_{i}\sum_{y\not=y_i}[\Delta+(W x_i+b)^Ty-(W x_i+b)^Ty_i]_{+}$$
where $y_i, y$ are one-hot vectors and $\Delta>0$ is constant.

<img src="https://cs231n.github.io/assets/margin.jpg" width="80%" />

Why the margine $\Delta$ is constant for all categories?
To make the size of the margin at each training point a controlling variable we propose the following learning algorithm [AM-SVM](http://pdfs.semanticscholar.org/a3db/a5e48d2e57941efe7b4cab7e299fd2cd9de7.pdf):
$$\min \xi_i\\ \text{s.t. }\quad y_if(x_i)\geq 1-\xi_i+\alpha_i k(x_i, x_i)\\
\xi_i\geq 0, \alpha_i \geq 0$$

- http://yining-wang.com/mdactive-slides.pdf
- [Adaptive Margin Support Vector Machines for Classification](http://pdfs.semanticscholar.org/a3db/a5e48d2e57941efe7b4cab7e299fd2cd9de7.pdf)
- [Adaptive Large Margin Training for Multilabel Classification](https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3455/3878)
- [AdaptiveFace: Adaptive Margin and Sampling for Face Recognition](http://www.cbsr.ia.ac.cn/users/xiangyuzhu/papers/2019adaptiveface.pdf)
- [Boosting Few-Shot Learning With Adaptive Margin Loss](https://arxiv.org/abs/2005.13826)
- [Margin-adaptive model selection in statistical learning](https://projecteuclid.org/download/pdfview_1/euclid.bj/1302009243)
- https://github.com/cvqluu/Angular-Penalty-Softmax-Losses-Pytorch
- [Deep Ranking Model by Large Adaptive Margin Learning for Person Re-identification](https://arxiv.org/abs/1707.00409)
- [Boosting Few-Shot Visual Learning with Self-Supervision](https://github.com/valeoai/BF3S)
- https://abursuc.github.io/
- http://imagine.enpc.fr/~komodakn/
- https://webia.lip6.fr/~cord/
- https://www.weiranhuang.com/

### Optimal margine distribution machine learning

Optimal margin Distribution Machine (ODM) can achieve a better generalization performance by optimizing the margin distribution explicitly.

- [Optimal Margin Distribution Machine](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkde19odm.pdf)
- [Semi-Supervised Optimal Margin Distribution Machines](https://www.ijcai.org/Proceedings/2018/0431.pdf)
- [Multi-Label Optimal Margin Distribution Machine](http://www.acml-conf.org/2019/conference/accepted-papers/175/)
- [Large Margin Distribution Learning with Cost Interval and Unlabeled Data](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkde16cisldm.pdf)
- [Optimal Margin Distribution Clustering](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16895)
- https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/publication_toc.htm#Cost-Sensitive%20and%20Class-Imbalance%20Learning


### Logistic regression and beyond

We could explain the linear classifier in the probabilistic interpretation 
$$\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})]$$
so that $P(Y=1\mid x)=\frac{\exp(\mathbb{w\cdot x}+b)}{1+\exp(\mathbb{w\cdot x}+b)}=\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}$.
And $P(Y\not=1\mid x)=1-\frac{\exp(\mathbb{w\cdot x}+b)}{1+\exp(\mathbb{w\cdot x}+b)}=\frac{1}{1+\exp(\mathbb{w\cdot x}+b)}$.
In our setting $y_i\{+1,-1\}$, we can get that 
$$P(y\mid x)=\frac{1}{1+\exp(-y[\mathbb{w\cdot x}+b])}$$
where $y\{+1,-1\}$.

We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)=\arg\min_{\mathbb w, b}\ln(\frac{1}{P(y_i\mid x_i)})\\=\arg\min_{\mathbb w, b}\ln(1+\exp(-y_i[\mathbb{w\cdot x_i}+b])).\tag{negative likellihood}$$
Note that $\lim_{z\to\infty}\frac{\ln(1+\exp(z))}{1+z}=1$
so the negative log-likelihood function is the approximation of the hinge loss.

In our  context, we always assume that $y\in\{+1,-1\}$ while another option is 
$y\{0,1\}$.
If $y\{0,1\}$, we still suppose that
$$P(Y=1\mid x)=\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}$$
and 
$$P(Y=0\mid x)-P(Y\not=1\mid x)=\frac{1}{1+\exp([\mathbb{w\cdot x}+b])}$$
so we can express the probaility in 
$$P(Y=y\mid x)=[\frac{1}{1+\exp(-[\mathbb{w\cdot x}+b])}]^{y} [\frac{1}{1+\exp([\mathbb{w\cdot x}+b])}]^{1-y}$$
where $y=0$ or $y=1$.
We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)
\\=\arg\min_{\mathbb w, b}\sum_{i=1}^m-\ln([\frac{1}{1+\exp(-[\mathbb{w\cdot x_i}+b])}]^{y_i} [\frac{1}{1+\exp([\mathbb{w\cdot x_i}+b])}]^{1-y_i})\\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m y_i\ln(1+\exp(-[\mathbb{w\cdot x_i}+b]))+(1-y_i)\ln([1+\exp([\mathbb{w\cdot x_i}+b])]) \\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m
y_i\ln(\frac{1+\exp(-[\mathbb{w\cdot x_i}+b])}{1+\exp([\mathbb{w\cdot x_i}+b])})
+\ln([1+\exp([\mathbb{w\cdot x_i}+b])]) \\
=\arg\min_{\mathbb w, b} \sum_{i=1}^m y_i[\mathbb{w\cdot x_i}+b]+\ln([1+\exp([\mathbb{w\cdot x_i}+b])])
\tag{negative likellihood}.$$

- [Which loss function is correct for logistic regression?](https://stats.stackexchange.com/questions/250937/which-loss-function-is-correct-for-logistic-regression)
- [Why there are two different logistic loss formulation / notations?](https://stats.stackexchange.com/questions/229645/why-there-are-two-different-logistic-loss-formulation-notations)
- [Logistic Regression](https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf)
- [Logistic classification model - Maximum likelihood estimation](https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-likelihood)
- [Robustness and Regularization of Support Vector Machines](http://www.jmlr.org/papers/volume10/xu09b/xu09b.pdf)
- [The Entire Regularization Path for the Support Vector Machine](https://web.stanford.edu/~hastie/Papers/svmpath_jmlr.pdf)

Note that 
$$\hat{y}=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})]=\begin{cases}+1, & \text{if  } P(Y=1\mid x)> P(Y\not=1\mid x);\\ -1, &\text{if  } P(Y=1\mid x)< P(Y\not=1\mid x).\end{cases}$$
If we set 
$$P(Y=1\mid x)=\frac{1}{2}+\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$$
so $P(Y\not=1\mid x)=\frac{1}{2}-\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$ and the following condition also holds
$$\hat{y}=\operatorname{sgn}(\mathbb{w\cdot x}+b)=\operatorname{sgn}[\ln(\frac{P(Y=1\mid x)}{1-P(Y=1\mid x)})].$$
In our setting $y_i\{+1,-1\}$, we can get that 
$$P(y\mid x)=\frac{1}{2}+y\operatorname{sgn}(\mathbb{w}x+b)(1-\exp(-\|\mathbb{w}x+b\|_1))$$
where $y\{+1,-1\}$.
We apply the maximum likelihood principle to estimate the parameters
$$\arg\max_{\mathbb w, b}\prod_{i=1}^m P(y_i\mid x_i)=\arg\min_{\mathbb w, b}\sum_{i=1}^m -\ln [\frac{1}{2}+y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1-\exp(-\|\mathbb{w}x_i+b\|_1))]$$
where $\mathbb{w}x_i+b\not= 0$.
Note that  $\lim_{x\to 0}\frac{-\ln(1+x)}{x}=-1$ so 
$$-\ln [\frac{1}{2}+{y}_{i}\operatorname{sgn}(\mathbb{w}{x}_{i}+b)(1-\exp(-{\|\mathbb{w}{x}_{i}+b\|}_{1}))]\\ \approx \frac{1}{2}- y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1-\exp(-\|\mathbb{w}x_i+b\|_1))\\ =\frac{1}{2}[1-y_i\operatorname{sgn}(\mathbb{w}x_i+b)\frac{1-\exp(-\|\mathbb{w}x_i+b\|_1)}{2}]\\=\frac{1}{2}[1-y_i\operatorname{sgn}(\mathbb{w}x_i+b)]+\frac{y_i\operatorname{sgn}(\mathbb{w}x_i+b)(1+\exp(-\|\mathbb{w}x_i+b\|_1)}{4}).$$
And we can induce more loss functions based on this approach.

A more general result states that Bayes consistent loss functions can be generated using the following formulation：
$$\phi (v)=C[f^{-1}(v)]+(1-f^{-1}(v))C'[f^{-1}(v)]$$
where  $f(\eta ),(0\leq \eta \leq 1)$ is any convertible function such that  $f^{-1}(-v)=1-f^{-1}(v)$
and $C(\eta )$ is  any differentiable strictly concave function such that $C(\eta )=C(1-\eta )$.
See [Loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification).

- https://en.wikipedia.org/wiki/Loss_functions_for_classification
- http://jmlr.org/papers/volume16/masnadi15a/masnadi15a.pdf
- [on the design of loss functions for classification theory robustness to outliers and savageboost](https://papers.nips.cc/paper/3591-on-the-design-of-loss-functions-for-classification-theory-robustness-to-outliers-and-savageboost.pdf)

---

If the training set $T$ is `linearly separable`, we can train a decision tree so that
it defines a region for class $+1$ patterns $\mathbb{w\cdot x}+b>0$ 
and another region for class $-1$ patterns $\mathbb{w\cdot x}+b<0$.

In the analytic form of decision tree, it is defined as following
$$\mathbb{I}(\mathbb{w\cdot x}+b>0)-(1-\mathbb{I}(\mathbb{w\cdot x}+b>0))\tag{Decision tree}$$
where $\mathbb{I}$ is the indicator function. 

The decision tree is to minimize the Split Criteria at each stage.

Tsallis entropy is defined by 
\begin{equation}\label{Tsallis entropy}\tag{Tsallis entropy}
S_q(X) =\frac{1}{1-q}(\sum_{i=1}^n p(x_i)^q-1)
\end{equation}
which converges to Shannon entropy when $q \to 1$.

- https://www.benkuhn.net/tree-imp
- [How does a Decision Tree decide where to split?](http://www.ashukumar27.io/Decision-Trees-splitting/)
- [Selecting Multiway Splits in Decision Trees](https://www.cs.waikato.ac.nz/~ml/publications/1996/Frank-Witten96.pdf)
- [Unifying the Split Criteria of Decision Trees Using Tsallis Entropy](https://arxiv.org/pdf/1511.08136.pdf)
- https://daviddalpiaz.github.io/r4sl/trees.html
- https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
- https://online.stat.psu.edu/stat508/lesson/11/11.2



### Mutil-class classification

The categorical variable is really discrete variables that can take on one of finite possible mutually exclusive states.

We follow the 1-of-K scheme proposed by [Christopher Bishop](https://www.microsoft.com/en-us/research/people/cmbishop/) in [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf).
The categorical variable is represented by a K-dimensional vector $x$  
in which one of the elements $x_k$ equals $1$, and all remaining elements equal $0$,
i.e., one-hot vector.
For example, we use the vector $x$ to represent gender (male and female) so $x\in\{(0, 1)^T, (1, 0)^T\}$
where $(0, 1)^T/(1, 0)^T$ corresponds male/female.

If we denote the probability of $x_k = 1$ by the parameter $\mu_k$, 
then the distribution of $x$ is given
$$P(x\mid \mu)=\prod_{i=1}^K\mu_i^{x_i}$$
where $\sum_{i=1}^K\mu_i=\sum_{i=1}^Kx_i=1$ and $x_i\in\{0,1\}, u_i>0$ for $i=1,\cdots,K$.
And the expectation of the categorical random variable is computed by
$$\mathbb{E}(x)=\sum_{x}xP(x)=(\mu_1,\cdots,\mu_K)^T=\mu.$$
The  marginal distribution for $x_k$ is given by
$$P(x_k)=\sum_{i\not=k}P(x_i)=[\sum_{i\not=k}\mu_i]^{1-x_i}\mu_i^{x_i}=(1-\mu_i)^{1-x_i}\mu_i^{x_i}$$
and the marginal expectation of $x_i$ is $\mu_i$.

- [The Multinomial Distribution](https://dipmat.univpm.it/~demeio/Alabama_PDF/11.%20Bernoulli_Trials/Multinomial.pdf)

Given two distributions $p$ and $q$ over a given variable $X$, the cross entropy is defined as
\begin{equation}
\label{cross}
H(p\|q)=\int_{x}p(x)\log(\frac{1}{q(x)})\mathrm d x\tag{Cross entropy}
\end{equation}

KL divergence in simple term is a measure of how two probability distributions (say $p$ and $q$) are different from each other.
\begin{equation}
\label{k-l}
KL(p\|q)=\int_{x}p(x)\log(\frac{p(x)}{q(x)})\mathrm d x
=\int_{x}p(x)[\log({p(x)})-\log({q(x)})]\mathrm d x \tag{Cross entropy}
\end{equation}
where $p, q$ are probability density function.

- [A Friendly Introduction to Cross-Entropy Loss](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/)

-----

Here we cast the classification task as maximum likelihood estimate for discrete distribution.

For polytomous logistic regression, we will consider the possibility of having $k > 2$ possible outcomes. 
[(Note: The word polychotomous is sometimes used, but note that this is not actually a word!)](https://online.stat.psu.edu/stat501/lesson/15/15.2)

The multiple nominal logistic regression model (sometimes called the multinomial logistic regression model) is given by the following:
$$\pi_{j}=\left\{ 
\begin{array}{ll} 
\dfrac{\exp(\textbf{X}\beta_{j})}{1+\sum_{j=2}^{k}\exp(\textbf{X}\beta_{j})} & \hbox{j=2,\cdots,k} \\ 
\\ 
\dfrac{1}{1+\sum_{j=2}^{k}\exp(\textbf{X}\beta_{j})} 
& \hbox{j=1} \end{array} \right.$$

$\pi_j$ is the probability that an observation is in one of $k$ categories. 
The likelihood for the nominal logistic regression model is given by:
$$L(\beta;\textbf{y},\textbf{X})=\prod_{i=1}^{n}\prod_{j=1}^{k}\pi_{i,j}^{y_{i,j}},$$
where $\sum_{j=1}^k y_{i,j}=\sum_{j=1}^k\pi_{i,j}=1$.

This yields the log likelihood 
$$\ell(\beta)=\sum_{i=1}^{n}\sum_{j=1}^{k}y_{i,j}\log\pi_{i,j}=\sum_{i=1}^{n}\mathbb{I}(y_{i,j}\not=0)\log\pi_{i,j}.$$

Maximizing the likelihood (or log likelihood) has no closed-form solution, 
so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients.

- [15.2 - Polytomous Regression](https://online.stat.psu.edu/stat501/lesson/15/15.2)
- https://www.statisticssolutions.com/mlr/
- [MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES](https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/)
- [Lesson 12: Logistic, Poisson & Nonlinear Regression](https://online.stat.psu.edu/stat462/node/90/)
- [6.2 The Multinomial Logit Model](https://data.princeton.edu/wws509/notes/c6s2)
- https://peterroelants.github.io/posts/cross-entropy-softmax/
- https://peterroelants.github.io/posts/cross-entropy-logistic/
- [Local loss](https://www.cnblogs.com/king-lps/p/9497836.html)

#### Softmax and Cross-Entropy

There is a natural pairing between the softmax activation function and the cross-entropy penalty function.

`Softmax` maps a vector to a discrete probability:
$$\tag{Softmax}p(x)=\operatorname{Softmax}(Wx)=(p_1,\cdots, p_k)^T,\\p_i=\frac{\exp(W_ix)}{\sum_{i=1}^k\exp(W_ix)}=\frac{\exp(z_i)}{\sum_{i=1}^k\exp(z_i)},\\z_i=W_ix,$$
where $x\in\mathbb{R}^p$ and $W_i$ are the row vectors of $W$.


We can compute the derivative of the cross-entropy loss function for the softmax function
$$\frac{\partial p_i}{\partial z_i}=\frac{\exp(z_i)^2-\exp(z_i)\sum_{i=1}^p\exp(z_i)}{(\sum_{i=1}^p\exp(z_i))^2}=p_i^2-p_i\\ \frac{\partial p_i}{\partial z_j}=\frac{\exp(z_j)\exp(z_i)}{(\sum_{i=1}^p\exp(z_i))^2}=p_ip_j\quad \forall\,\, j\not=i$$
and
$$\frac{\partial }{\partial z_i}\log(\frac{1}{p_i})=-\frac{1}{p_i}\frac{\partial p_i}{\partial z_i}=\frac{\sum_{j=1}^k\exp(z_j)-\exp(z_i)}{\sum_{i=1}^k\exp(z_i)}=\frac{\sum_{j\not=i}\exp(z_j)}{\sum_{i=1}^k\exp(z_i)}=1-p_i\\\frac{\partial }{\partial z_j}\log(\frac{1}{p_i})=-p_j.$$
So it is efficent to compute the gradients.

The `Cross-Entropy` for discrete distribution is defined as
$$\sum_{i=1}^k y_i\log(\frac{y_i}{p_i}),\\ y_i\log(\frac{y_i}{p_i})=0 \quad\text{when  }y_i=0.$$
where $\sum_{i=1}^k y_i=\sum_{i=1}^k p_i=1$ and $y_i\geq 0, p_i\geq 0\quad\forall i$.

Using [the one-hot encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding), we encode the categorical label as the one-hot vector such as $y=(0,0, i, 0,0)^T$.
And we use the cross-entropy to measure the predicted probability and the ground truth label after the softmax activation function
$$\sum_{i=1}^n H(\mathbb{y}_i, p)=\sum_{i=1}^n \log(\frac{1}{p_j})=\sum_{i=1}^n -\log({p_j})\quad\text{if } y_{i,j}=1.$$

- [Properties of Cross-Entropy Minimization](https://bayes.wustl.edu/Manual/CrossEntropyMinimization.pdf)
- [On The Pairing Of The Softmax Activation And Cross--Entropy Penalty Functions And The Derivation Of The Softmax Activation Function](https://core.ac.uk/display/24298161)
- [Rethinking Softmax with Cross-Entropy: Neural Network Classifier as Mutual Information Estimator](https://github.com/ZhenyueQin/Research-Softmax-with-Mutual-Information)
- [Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names](https://gombru.github.io/2018/05/23/cross_entropy_loss/)

The softmax function is also the gradient of the `LogSumExp` function, a smooth maximum; defining: 
$$\operatorname{LSE}(z)=\log(\sum_{i=1}^k\exp(z_i))\tag{LogSumExp}$$
the partial derivatives are:
$$\partial_i \operatorname{LSE}(z)=\frac{\exp(z_i)}{\sum_{i=1}^p\exp(z_i)}=p_i.$$
Let $z_j=max\{z_1,\cdots,z_k\}$
$$LSE(z)=\log(\sum_{i=1}^k\exp(z_i))=\log(\exp(z_j)\sum_{i=1}^k\frac{\exp(z_i)}{\exp(z_j)})\\=\log(\exp(z_j)[1+\sum_{i\not=j}\frac{\exp(z_i)}{\exp(z_j)}])= z_j [1  +\sum_{i\not=j}\frac{1}{\exp(z_j-z_i)}]\geq z_j.$$
And if we define
$$\displaystyle f_\varepsilon(x) = \varepsilon \log \Big( \sum_{i=1}^n \exp( x_i / \varepsilon) \Big),$$
and $\lim_{\varepsilon\to 0} f_\varepsilon(x)=\max\{x_1,\cdots,x_n\}$.
Observe that for all $c\in\mathbb{R}$
$$\frac{\exp(z_i)}{\sum_{i=1}^k\exp(z_i)}=\frac{\exp(z_i+c)}{\sum_{i=1}^k\exp(z_i+c)}.$$

- https://en.wikipedia.org/wiki/Smooth_maximum
- https://en.wikipedia.org/wiki/Softmax_function
- [Softmax regression](https://zh.d2l.ai/chapter_deep-learning-basics/softmax-regression.html)


What is the role of the $\exp()$ in the softmax operator?
Consider the funuction $\ln(1+\exp(x))$ instead of the $\exp()$ in the softmax functions, we invent 
$$\frac{\ln(1+\exp(x_i))}{\sum_{i=1}^k\ln(1+\exp(x_i))}.$$


- [Softmax Optimizations for Intel Xeon Processor-based Platforms](https://deepai.org/publication/softmax-optimizations-for-intel-xeon-processor-based-platforms)
- [Efficient Sampled Softmax for Tensorflow](https://deepai.org/publication/efficient-sampled-softmax-for-tensorflow)
- [ADMM-SOFTMAX : An ADMM Approach for Multinomial Logistic Regression  ](https://arxiv.org/abs/1901.09450)
- https://github.com/vene/sparsemap

We first consider the simple problem of selecting the largest value in a vector $\theta\in\mathbb{R}^k$.
We denote the vector mapping
$$\max{\theta}=\theta^Tz,z=\arg\max_{y\in\Delta_k}y^T\theta$$
where $\Delta_k$ is the probability simplex $\Delta_k=\{y\in\mathbb{R}^k\mid \sum_{i=1}^{k}y_i=1, y_i\geq 0\forall i\}\subset \mathbb{R}^k$. 
The `softmax`, a continuous and differentiable approximation to arg max, can be
seen as an entropy-regularized $\arg\max$:
$$\operatorname{softmax}(\theta)=\arg\max_{y\in\Delta_k}y^T\theta+H(y)$$
where $H(y)=\sum_{i=1}^ky_i\ln(\frac{1}{y_i})$.
And the sparsemax is given by 
$$\operatorname{sparsemax}(\theta)=\arg\max_{y\in\Delta_k}y^T\theta-\frac{1}{2}\|y\|_2^2=\arg\min_{y\in\Delta_k}\|\theta-y\|_2^2$$
which is an Euclidean projection onto the simplex.



And we would like some sparsity via the LASSO-like approach
$$\arg\max_{y\in\Delta_k}y^T\theta-\|y\|_1=\arg\max_{y\in\Delta_k}y^T(\theta-\vec{1})=\arg\max_{y\in\Delta_k}y^T\theta$$
where $\|y\|_1$ is the $\ell_1$ norm equal to $y^T\vec{1}=\sum_{i=1}^ky_i$.
There is no novel operator.

And it is natural to think the follwoing opertor
$$\arg\min_{y\in\Delta_k}\|\theta-y\|_1$$
while this is not a well defined.


- [Argmax vs Softmax vs Sparsemax](https://www.cs.utah.edu/~tli/posts/2019/01/blog-post-1/)
- https://kexue.fm/archives/7359
- https://kexue.fm/archives/3290
- https://kexue.fm/archives/6620


#### Hierarchical softmax

Hierarchical softmax is an alternative to the softmax in which the probability of any one outcome depends on a number of model parameters 
that is only logarithmic in the total number of outcomes.


- https://github.com/kojisekig/word2vec-lucene
- https://ruder.io/word-embeddings-softmax/
- http://building-babylon.net/2017/08/01/hierarchical-softmax/
- [Effectiveness of Hierarchical Softmax in Large Scale Classification Tasks](https://www.researchgate.net/publication/329399049_Effectiveness_of_Hierarchical_Softmax_in_Large_Scale_Classification_Tasks)
- https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
- https://arxiv.org/abs/2002.04723


#### Reparameterization trick 

Note that 
$$\lim_{n\to\infty}\frac{\exp(m)}{\sum_{i=1}^k\exp(nz_i)}\\=\lim_{n\to\infty}\frac{\exp(m)}{\exp(m)\sum_{i=1}^k\exp(nz_i-m))}=1$$
where $m=\{nz_1,\cdots,nz_k\}$.
So that we can find a smooth maximum by inner product
$$(z_1,\cdots,z_k)^T\operatorname{softmax}(z)\approx \max\{z_1,\cdots,z_k\}.$$


The `Gumbel-Max trick` provides a simple way to draw samples $z$ (one-hot vector) from a categorical distribution with class probabilities $\pi$
$$z=\text{one-hot}[\arg\max_{i}\log(g_i+\log(\pi_i))]$$
where $g_i=-\log(-\log(u)),u\sim\text{Unif}(0,1)$.

Notice that there is still an argmax in Gumbel Max, which still makes it indifferentiable. Therefore, we use a softmax function to approximate this `argmax` procedure.
The `Gumbel softmax` is defined as
$$y_i=\frac{\exp(\log(g_i+\log(\pi_i))/\tau)}{\sum_{i}\exp(\log(g_i+\log(\pi_i))/\tau)}$$
where $\tau\in(0,\infty)$ is a temparature hyperparameter.


- [The Gumbel Trick](https://irenechen.net/2017/08/the-gumbel-trick/)
- [Categorical reparameterization with Gumbel-Softmax](https://www.ntu.edu.sg/home/lixiucheng/paper/gumbel-softmax.html)
- [The Gumbel-Max Trick for Discrete Distributions](https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/)
- [Proof of the Gumbel Max Trick](https://www.hsfzxjy.site/2019-08-01-proof-of-gumbel-max-trick/)
- [Gumbel-Max Trick Inference](https://laurent-dinh.github.io/2016/11/22/gumbel-max.html)
- [The Gumbel trick](https://francisbach.com/the-gumbel-trick/)
- [The Gumbel-Softmax Trick for Inference of Discrete Variables](https://casmls.github.io/general/2017/02/01/GumbelSoftmax.html)

<img src="https://vitalab.github.io/article/images/concrete/fig1.png" width="70%"/>

The derivative of the argmax is 0 everywhere except at the boundary of state changes, where it is undefined. 
For this reason the `Gumbel-Max` trick is not a suitable reparameterization for use in stochastic computation graphs trained with automatic differentiation. 
This is why the Concrete distribution is introduced; see Figure 1(b) for how to sample from it. Basically, the argmax operation is replaced with a softmax.

- https://vitalab.github.io/article/2018/11/29/concrete.html
- [Differientiable Sampling and Argmax](https://wiki.lzhbrian.me/notes/differientiable-sampling-and-argmax)
- http://papers.nips.cc/paper/5666-variational-dropout-and-the-local-reparameterization-trick
- [The Reparameterization Trick](https://gregorygundersen.com/blog/2018/04/29/reparameterization/)
- [Backpropagating through continuous and discrete samples](https://gabrielhuang.gitbooks.io/machine-learning/content/reparametrization-trick.html)
- https://anotherdatum.com/gumbel-gan.html
- http://people.csail.mit.edu/tommi/papers/Lorberbom_etal_neurips19.pdf
- http://dpkingma.com/
- https://arxiv.org/pdf/1312.6114v10.pdf
- [REINFORCE vs Reparameterization Trick](http://stillbreeze.github.io/REINFORCE-vs-Reparameterization-trick/)

`soft binning`

- [Split Variational Inference](http://bnaic2010.uni.lu/Papers/Category%20B/Bouchard.pdf)
- http://www.sci.utah.edu/~acoste/uou/Image/project1/Arthur_COSTE_Project_1_report.html

#### Local Reparameterization Trick

- https://github.com/gngdb/variational-dropout
- [The (Local) Reparameterization Trick](https://www.cs.toronto.edu/~duvenaud/courses/csc2541/slides/structured-encoders-decoders.pdf)
- [Variational Dropout and the Local Reparameterization Tric](https://www.ics.uci.edu/~welling/publications/papers/VariationalDropout.pdf)


`Generalized linear models and generalized additive models` are extension of linear models from two distinct perspectives.

Generalized linear models use nonlinear model to estimate the mean of the errors
$$\mathbb{E}(Y\mid X)=f(\theta\cdot X)\iff f^{-1}(\mathbb{E}(Y\mid X))=\theta\cdot X.\tag{GLM}$$
where $f$ is invertible nonlinear function.

Generalized additive models are defined as

$$\mathbb{E}(Y\mid X)=\sum_{i=1}^{p}f_i(\theta_i\cdot X)\tag{GAM}$$
where $f_i$ are nonlinear models.

- http://environmentalcomputing.net/intro-to-gams/
- https://m-clark.github.io/generalized-additive-models/
- [Generalized additive model](https://projecteuclid.org/download/pdf_1/euclid.ss/1177013604)
- [Lesson 10: Log-Linear Models](https://online.stat.psu.edu/stat504/node/117/)

#### Tweedie’s Formula

- [Towards Machine Learning: Alternative Methods for Insurance Pricing – Poisson-Gamma GLM’s, Tweedie GLM’s and Artificial Neural Networks](https://www.actuaries.org.uk/system/files/field/document/F7%20Navarun%20Jain.pdf)
- http://statweb.stanford.edu/~ckirby/brad/papers/2011TweediesFormula.pdf
- https://en.wikipedia.org/wiki/Tweedie_distribution

#### Sparse-max

- https://deepai.org/publication/sparse-and-continuous-attention-mechanisms
- https://arxiv.org/abs/1802.04223
- https://arxiv.org/abs/2001.04437
- [From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification](http://proceedings.mlr.press/v48/martins16.pdf)
- https://zhuanlan.zhihu.com/p/38897903
- https://github.com/vene/sparse-structured-attention

#### Cross-Entropy Method

The cross-entropy (CE) method is a new generic approach to combinatorial and multi-extremal optimization and rare event simulation.

- http://iew3.technion.ac.il/CE/
- https://people.smp.uq.edu.au/DirkKroese/ps/aortut.pdf

### Loss for face recognition

Here we will review some metric in face recogition including additive margines.


- https://omoindrot.github.io/triplet-loss
- https://happynear.wang/paper/NormFace.pdf
- https://github.com/happynear/NormFace
- https://github.com/wy1iu/sphereface
- https://zhuanlan.zhihu.com/p/160943421
- http://www.cbsr.ia.ac.cn/users/xiaobowang/papers/AAAI2020.pdf
- [Large Margin In Softmax Cross-Entropy Loss](https://bmvc2019.org/wp-content/uploads/papers/0636-paper.pdf)
- [Additive Margin Softmax for Face Verification github](https://github.com/happynear/AMSoftmax)
- [Additive Margin Softmax for Face Verification](https://arxiv.org/abs/1801.05599)


#### Circle loss

The loss function of the multiclass classification are based on the distribution of the groundtruth.

And the cross entropy is 
$$-\ln\frac{\exp(W_j\cdot x+b_j)}{\sum_{i=1}^k\exp(W_i\cdot x)+b_i}=\ln\frac{\sum_{i=1}^k\exp(W_i\cdot x)+b_i}{\exp(W_j\cdot x)+b_i} =\ln {\sum_{i=1}^k\exp(W_i\cdot x+b_i-W_j\cdot x-b_j)}$$
$$=\ln [{1+\sum_{i\not=j}\exp(W_i\cdot x+b_i-W_j\cdot x-b_j)}].$$
where the training instance is $(x, y)$ and $y=(\underbrace{0,\cdots, 0}_{k-1},1,0,\cdots, 0)^T$.
And based on the one-hot vector $y$, we can rewrite the cross entropy as
$$-\ln[\operatorname{softmax}(Wx+b)^Ty]=-\ln[z\cdot y]=-\ln(\|z\|_2\cos(z, y))$$
where $z=\operatorname{softmax}(Wx+b)$. Note that the inner product of $z$ and $y$ matters.

It is obvious that 
$$\arg\min_{W, b}\ln {\sum_{i=1}^k\exp(W_i\cdot x+b_i-W_j\cdot x-b_j)}=\arg\min_{W, b}{\sum_{i=1}^k\exp(W_i\cdot x+b_i-W_j\cdot x-b_j)}.$$
Note that this loss can not reach the zero because of the probability is always less than 1.

We can invent different loss fucntions by modifying the softmax loss functions or the inner products.
Different loss functions can  lead to different feature representation.



The [Circle Loss](https://arxiv.org/pdf/2002.10857v1.pdf) aimed at 
maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$.

<img src="https://pic4.zhimg.com/80/v2-e6d3e7a67ff6d28da996322cb9855bb3_720w.jpg" width="50%"/>


- https://kexue.fm/archives/3290
- [A Metric Learning Reality Check](https://arxiv.org/abs/2003.08505)    
- https://zhuanlan.zhihu.com/p/34044634
- https://zhuanlan.zhihu.com/p/45014864
- https://happynear.wang/
- https://wyliu.com/
- [Circle Loss: A Unified Perspective of Pair Similarity Optimization](https://arxiv.org/pdf/2002.10857v1.pdf)
- https://github.com/TinyZeaMays/CircleLoss
- https://happynear.wang/
- [Focal Loss](https://xmfbit.github.io/2017/08/14/focal-loss-paper/)


### AUC optimization

[AUC](https://tracholar.github.io/machine-learning/2018/01/26/auc.html) is an index to measure the correctness of machine learning methods.

- https://qrfaction.github.io/2018/04/02/end2endAUC/
- http://www.lamda.nju.edu.cn/gaow/
- http://www.lamda.nju.edu.cn/code_OPAUC.ashx
- https://zhuanlan.zhihu.com/p/74216219
- https://link.springer.com/chapter/10.1007/978-3-540-74976-9_8
- https://papers.nips.cc/paper/2518-auc-optimization-vs-error-rate-minimization.pdf
- https://www.researchgate.net/profile/Wei_Gao57


## Quality of probabilistic forecasts

[The machine learning problem of classifier design](http://www.svcl.ucsd.edu/projects/LossDesign/) is studied from the perspective of ``probability elicitation``, in statistics. 
`This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. `
It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function.


### Proper scoring rules

Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score 
based on the predictive distribution and on the event or value that materializes. 
A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the
distribution $F$ if he or she issues the probabilistic forecast $F$, rather than $G \not= F$.

- https://www2.cs.duke.edu/courses/spring17/compsci590.2/proper_scoring.pdf
- [Proper scoring rules: How should we evaluate probabilistic forecasts?](https://oddacious.github.io/proper_scoring_rule)
- https://stat.uw.edu/research/tech-reports/strictly-proper-scoring-rules-prediction-and-estimation-revised
- https://imaging-in-paris.github.io/semester2019/slides/w3/Rudi.pdf
- http://www.stat.cmu.edu/tr/tr865/tr865.pdf
- https://jmlr.csail.mit.edu/papers/volume16/ovcharov15a/ovcharov15a.pdf

### Bregman divergences

Bregman Divergences are the extension of squares of $\ell_2$.
It is based on the `differentiable convex function` $f$ as following
$$d_f(x,y) := f(x) - f(y) - {\langle \nabla f(y), x-y \rangle} \ge 0$$
which is called the Bregman divergence induced by $f$.
If $f$ is only convex, we can take a subgradient $z\in\partial f(y)$ to rreplace the gradient $\nabla f(y)$.
And $d_f(x, y)=0$ if $x=y$.

- [Meet the Bregman Divergences](http://mark.reid.name/blog/meet-the-bregman-divergences.html)
- [Bregman Divergence and Mirror Descent](http://users.cecs.anu.edu.au/~xzhang/teaching/bregman.pdf)
- [Generalized cross entropy loss for training deep neural networks with noisy labels](https://papers.nips.cc/paper/8094-generalized-cross-entropy-loss-for-training-deep-neural-networks-with-noisy-labels.pdf)

### f-Divergences

Roughly speaking, all f-divergences quantify the difference between a pair of distributions, each with different operational meaning.
Note that we can use the following divergence to measure the discrepancy of two distribution function:
$$KL(p\|q)=\int_{x}p(x)\log(\frac{p(x)}{q(x)})\mathrm d x=\int_{x}p(x)[-\log(\frac{q(x)}{p(x)})]\mathrm d x\geq 0.$$
f-Divergences are to extend the $\log(\cdot)$ function to the function $f$.

The functions $f$ are supposed to be convex and $f(1)=0$
For example, $f(x)=x^2-1$ and $t\log(t)$.

- https://en.wikipedia.org/wiki/F-divergence
- https://people.eecs.berkeley.edu/~jordan/jordan-ssg.pdf
- http://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf
- [Inequalities for Csiszár f-Divergence in Information Theory](https://rgmia.org/monographs/csiszar_list.html)
- [ON SURROGATE LOSS FUNCTIONS AND f-DIVERGENCES1](https://projecteuclid.org/download/pdfview_1/euclid.aos/1236693153)
- [Variational Inference with Tail-adaptive f-Divergence](https://arxiv.org/pdf/1810.11943.pdf)
- https://hal.inria.fr/inria-00542337/document


### Fenchel-Young losses

`Fenchel-Young losses` constructed from a generalized entropy, including the Shannon and Tsallis entropies, induce predictive probability distributions.

It is naed after [Fenchel-Young theorem]() generated by
$$L_{\Omega}(\theta; y):=\Omega^{\ast}(\theta)+\Omega(y)-\left<\theta, y\right>\\
=\sup_{p\in\mathrm{dom}(\Omega)}\{\left<\theta, p\right>-\Omega(p)\} -[\left<\theta, y\right>-\Omega(y)]
$$
where $\Omega^{\ast}$ is the Fenchel conjugate of $\Omega$ by $\Omega^{\ast}(\theta)=\sup_{p\in\mathrm{dom}(\Omega)}\left<\theta, p\right>-\Omega(p)$ and the function $\Omega$ is well-designed to make it welll-defined.



- https://github.com/mblondel/fenchel-young-losses
- https://alexander-schwing.de/projects.php
- [Learning with Fenchel-Young Losses](https://jmlr.csail.mit.edu/papers/volume21/19-021/19-021.pdf)

`Geometric loss` is a generalization of the logistic loss that incorporates a metric or cost between classes.

- https://github.com/arthurmensch/g-softmax
- https://www.amensch.fr/docs/presentations/icml2019_presentation.pdf
- https://mblondel.org/publications/
- http://www-stat.wharton.upenn.edu/~buja/




### Differentiable ranking and sorting 

Sorting is however a poor match for the end-to-end, automatically differentiable pipelines of deep learning.

> The sorting operation is one of the most basic and commonly used building blocks in computer programming. 
In machine learning, it is commonly used for robust statistics. 
However, seen as a function, it is piecewise linear and as a result includes many kinks at which it is non-differentiable. 
More problematic is the related ranking operator, commonly used for order statistics and ranking metrics. 
It is a piecewise constant function, meaning that its derivatives are null or undefined. 
While numerous works have proposed differentiable proxies to sorting and ranking, 
they do not achieve the $O(nlogn)$ time complexity one would expect from sorting and ranking operation.

For example, given the input sample $x=(5.1, 4.4, 2.3, 3.6, 8.8)^T$, 
the ranking $R(\cdot)$ will return the ascent order of the entity, i.e., $R(x)=(4,3,1,2,5)^T$.
However the sorting $S(\cdot)$ will return the vector in ascent order  as an permuttaion of the input sample, i.e, $S(x)=(2.3, 3.6, 4.4, 5.1, 8.8)^T$.
In notation, the ranking function $R(x)$ maps the vector to its order $R:\mathbb{R}^d\mapsto \{0,1,2,\cdots, d\}$ 
so that $R(x_i)\geq R(x_j)\iff x_i\geq x_j$.
The sorting function $S:\mathbb{R}^d\mapsto \mathbb{R}^d$ return the sorted vector in descending order.
These functions are `non-linear`, in another word $R(x+y)\not=R(x)+R(y)$ and $S(x+y)=S(x)+S(y)$ for some $x, y$.
What is worse, they are not differetiable so we cannot compute the Jacobian matrix of $R(x), S(x)$.

> For  all $x\in\mathbb{R}^d$ and $\rho=(d, d-1,\cdots, 1)^T$, we have 

$$S(x)=\arg\max_{y\in\mathcal{P}(x)}\left< y, \rho \right>$$
$$R(x)=\arg\max_{y\in\mathcal{P}(\rho)}\left< y, -x\right>$$
where $\mathcal{P}(x)$ ($\mathcal{P}(\rho)$) is all the permutation of $x$ ($\rho$).

- https://mblondel.org/
- https://qb3.github.io/research.html

#### Sinkhorn Ranks/Sorts

In following, we set $x\in\mathbb{R}^d$ and $\rho=(d, d-1,\cdots, 1)^T$.

Differentiable ranking is based on the regularized linear programming formulations of ranking.  
We formulate the sorting 
$$S(x)=\arg\max_{y\in\mathcal{P}(x)}\left< y, \rho \right>=\\ Px,P=\arg\max_{M}\left<Mx, \rho \right>$$
where $M$ is a permutation matrix (so that $M^T=M^{-1}$).

Here we focus on 
$$\arg\max_{M}\left<Mx, \rho \right>=\arg\max_{M}2\left<Mx, \rho \right>-(\|Mx\|_2^2+\|\rho\|_2^2)\\
=\arg\min_{M\in\mathcal{P}}\|Mx-\rho\|_2^2)$$
where $M^T=M^{-1}$ and $M_{i,j}\in\{0, 1\}$, $\sum_{i}M_{ij}=\sum_{j}M_{ij}=1$.
It is non-differentiable and requires considering $n!$ permutations.
This is an maximization problem on matrix manifold.

The set of doubly stochastic matrices gives the Birkhoff Polytope
$$\mathcal{B}=\{M\in [0, 1]^{d\times d}\mid M_{ij}=\sum_{j}M_{ij}=1\}.$$
The `Birkhoff-von Neumann theorem` states that this defines the convex hull of ${N\times N}$   permutation matrices.

So that we will define a soft sort function $\operatorname{softsort}$
$$\operatorname{softsort}(x)=Px,P=\arg\max_{M\in \mathcal{B}-\mathcal{P}}\left<Mx, \rho \right>+\sum_{i}\sum_{j}[M_{ij}\log(M_{ij})+(1-M_{ij})\log(1-M_{ij})]$$
where $\mathcal{B}$ is the Birkhoff Polytope.
And another `softsort` is defined as 
$$\arg\max_{z}\left<z,\rho\right>\\ \text{subject to } z=Mx, M^T\vec{1}=M\vec{1}=\vec{1}, M\succeq 0.$$
And we can reformulate it as the following
$$\arg\max_{z}\left<z, \rho\right>\\ \text{subject to } z=Mx,P=M^T, M\vec{1}=\vec{1}, P\vec{1}=\vec{1}.$$

If we define the following entropy regularried loss function
$$F(M)=\sum_{i=1}^d(\sum_{j=1}^d M_{ij}x_j\rho_i)+\sum_{i}^d\sum_{j}^d M_{ij}\log(M_{ij})+(1-M_{ij})\log(1-M_{ij})$$
subject to $\sum_{i}M_{ij}=1\,\,\forall j, \sum_{j} M_{ij}=1\,\,\forall i, M_{ij}>0\forall i,j\in\{1,2,\cdots, d\}$.
And we can obtain that the partial derivative $\frac{\partial F(M)}{\partial M_{ij}}=x_j\rho_i + \log(M_{ij})- \log(1-M_{ij})$.                                                                                                                                           
$$\max_{M, P}\sum_{i=1}^d(\sum_{j=1}^d M_{ij}x_j\rho_i)+\sum_{i}^d\sum_{j}^d[ P_{ij}\log(P_{ij})+(1-P_{ij})\log(1-P_{ij})]\\ \text{subject to } M=P^T, M\vec{1}=\vec{1}, M\in(0, 1)^{d\times d}$$
and we can define its augmented Lagrangian
$$L{\tau}({M, P,\lambda,\theta})=\sum_{i=1}^d (\sum_{j=1}^d M_{ij}x_j\rho_i)+\sum_{i}^d\sum_{j}^d[ P_{ij}\log(P_{ij})+(1-P_{ij})\log(1-P_{ij})]\\+\sum_{i}^d\sum_{j}^d[\lambda_{ij}(M_{ij}-P_{ji})+\frac{\tau}{2}(M_{ij}-P_{ji})^2]+\sum_{i=1}^d \theta_i (\sum_{j=1}^d M_{ij}-1)+\frac{\tau}{2}(\sum_{j=1}^d M_{ij}-1)^2$$


- https://ncatlab.org/nlab/show/Birkhoff-von+Neumann+theorem
- [Learning Latent Permutations with Gumbel-Sinkhorn Networks](https://arxiv.org/abs/1802.08665)
- https://github.com/HeddaCohenIndelman/Learning-Gumbel-Sinkhorn-Permutations-w-Pytorch
- https://duvenaud.github.io/learn-discrete/slides/Gumbel-Sinkhorn-Slides.pdf
- [Differentiating Parameterized Inequality Constrained Optimization Problems](https://cs.anu.edu.au/courses/CSPROJECTS/19S2/initialTalks/u6555078.pdf)
- [Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent](https://konstmish.github.io/files/slides/sinkhorn_slides.pdf)
- http://www.stat.columbia.edu/~gonzalo
- https://rfsantacruz.com/publications/
- https://konstmish.github.io/
- https://basurafernando.github.io/papers/vpl-pami.pdf

`Sinkhorn Ranks/Sorts` are based on the Sinkhorn algorithms.

- https://papers.nips.cc/paper/8693-massively-scalable-sinkhorn-distances-via-the-nystrom-method.pdf
- https://research.google/pubs/pub48637/
- http://members.cbio.mines-paristech.fr/~jvert/talks/191127inria/inria.pdf
- [Ranking via Sinkhorn Propagation](https://www.cs.princeton.edu/~rpa/pubs/adams2011sinkhorn.pdf)
- [Stochastic Optimization of Sorting Networks via Continuous Relaxations](https://arxiv.org/abs/1903.08850)
- https://www.cs.brandeis.edu/~hugues/sorting_networks.html
- http://members.cbio.mines-paristech.fr/~jvert/talks/200114turing/turing.pdf
- https://marcocuturi.net/pub.html
- [Talk: Differentiable Ranks and Sorting using Optimal Transport](http://www.fields.utoronto.ca/talks/Differentiable-Ranking-and-Sorting-using-Optimal-Transport)
- [Talk: Differentiable Ranking and Sorting using Optimal Transport](https://neurips.cc/media/Slides/nips/2019/westballa+b(12-15-50)-12-16-20-15725-differentiable_.pdf)
- [Differentiable Ranking and Sorting using Optimal Transport](http://papers.nips.cc/paper/8910-differentiable-ranking-and-sorting-using-optimal-transport)
- https://github.com/google-research/google-research/tree/master/soft_sort
- https://www.microsoft.com/en-us/research/publication/softrank-optimising-non-smooth-rank-metrics/
- https://www.shivani-agarwal.net/Teaching/E0371/Papers/jir10-smoothrank.pdf
- https://mrolinek.github.io/publication/2019-11-ranking2019

#### Differentiable ranking

We define the following operator:
$$P_Q(w, x)=\arg\min_{y\in\mathcal{P}(x)} \| y-w\|_2^2$$
$$P_E(w, x)=\log[\arg\max_{y\in\mathcal{P}(\exp(x))}\operatorname{KL}(y, \exp(w))].$$
Note that $y\in\mathcal{P}(x)$ so that $\|y\|=\|x\|$
$$S(x)=\arg\max_{y\in\mathcal{P}(x)}\left< y, \rho \right>\\=\arg\max_{y\in\mathcal{P}(x)}2\left< y, \rho \right>-(\underbrace{\|x\|_2^2+\|\rho\|_2^2}_{\text{constants}})\\=\arg\max_{y\in\mathcal{P}(x)}2\left< y, \rho \right>-(\underbrace{\|y\|_2^2+\|\rho\|_2^2}_{\|y\|=\|x\|})\\=\arg\min_{y\in\mathcal{P}(x)} \| y-\rho\|_2^2\\=P_Q(\rho, x).$$

We set $x\in\mathbb{R}^d$ and $\rho=(d, d-1,\cdots, 1)$.
The`soft sort` is defined as
$$s_{\Phi}(x; \epsilon)= P_{\Phi}(\frac{\rho}{\epsilon}, x)$$
where $\Phi\in\{Q, E\}$.
For example,
$$s_{E}(x; \epsilon)= P_Q(\frac{\rho}{\epsilon}, x)=\arg\min_{y\in\mathcal{P}(x)} \|y- \frac{\rho}{\epsilon}\|_2^2.$$

The `soft rank` is defined as 
$$r_{\Phi}(x;\epsilon)= P_{\Phi}(-\frac{x}{\epsilon}, \rho)\quad \epsilon>0.$$
For example,
$$r_{Q}(x;\epsilon)= P_Q(-\frac{x}{\epsilon}, \rho)=\arg\min_{y\in\mathcal{P}(\rho)} \|y +\frac{x}{\epsilon}\|_2^2 \quad \epsilon>0.$$


- https://github.com/google-research/fast-soft-sort
- [Fast Differentiable Sorting and Ranking](https://arxiv.org/abs/2002.08871)
- [SoftSort: A Continuous Relaxation for the argsort Operator](https://arxiv.org/pdf/2006.16038v1.pdf)
- https://github.com/sprillo/softsort

####  Differentiable Top-k Operator


Given a vector $x=(x_1, x_2,\cdots, x_n)$, the top-k selector will give the k largest elements in $x$.
Top-1 selector will give the maximum element: $x^T\arg\max_{h}h^Tx=\max(x)$
where $h^T\vec{1}=1$ and $h_i>0$ for $i=1,\cdots,n$.
Top-k selector will give the maximum element: $T_k(x)=\operatorname{diag}(\arg\max_{h}h^tx)x$
where $h^T\vec{1}=k$ and $h_i\in\{0,1\}$ for $i=1,\cdots,n$.
And we can use the top-k selector to find the order of every element.




- [On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization](https://arxiv.org/abs/1607.05447)
- http://www.cs.toronto.edu/~duvenaud/
- [Differentiable Top-k Operator with Optimal Transport](https://arxiv.org/abs/2002.06504)
- [Learning with Differentiable Perturbed Optimizers](https://arxiv.org/abs/2002.08676)


## Ranking loss   

A metric/distance $d$  in mathematics is defined as a function following three properties:
1. Non-negativeness: $d(x, y)\geq 0\quad\forall x, y\in S$;
2. Symetry: $d(x, y)=d(y,x) \quad\forall x, y\in S$;
3. Triangle inequality: $d(x,z)+d(z,y)\geq d(x,y) \quad\forall x, y, z\in S$.

For example, the 0-1 loss is a metric.
And diveregence only requires $d(x, y)\geq 0\quad\forall x, y\in S$ and $d(x, y)=0$ if $x=y$.

We can induce a loss function based on the probability distribution function of the ground truth.

- https://en.wikipedia.org/wiki/Divergence_(statistics)

[An ordinal variable is similar to a categorical variable.  The difference between the two is that there is a clear ordering of the variables.](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/)
In another word, ordinal variable is comparable while we cannot add either substract two ordinal variables.
All real numerical variable is ordinal while complex variable is not ordinal.
Binning is a direct way to tranlate numerical variables to orderial variables.

An ordinal variable is a categorical variable for which the possible values are ordered. 
[Ordinal variables can be considered “in between” categorical and quantitative variables.](https://web.ma.utexas.edu/users/mks/statmistakes/ordinal.html)

[Example: Educational level might be categorized as](https://web.ma.utexas.edu/users/mks/statmistakes/ordinal.html)
1.  Elementary school education
2.  High school graduate
3.  Some college
4.  College graduate
5.  Graduate degree

*  In this example (and for many ordinal variables), the quantitative differences between the categories are uneven, even though the differences between the labels are the same. (e.g., the difference between 1 and 2 is four years, whereas the difference between 2 and 3 could be anything from part of a year to several years)
*  Thus it does not make sense to take a mean of the values.
*  Common mistake: Treating ordinal variables like quantitative variables without thinking about whether this is appropriate in the particular situation at hand.
*  For example, the “floor effect” can produce the appearance of interaction when using Least Squares Regression, when no interaction is present

- https://web.ma.utexas.edu/users/mks/statmistakes/ordinal.html
- https://en.wikipedia.org/wiki/Ordinal_data#Ways_to_analyse_ordinal_data

[Ordinal Regression](https://betanalpha.github.io/assets/case_studies/ordinal_regression.html) is a statistical technique that is used to predict behavior of ordinal level dependent variables with a set of independent variables.


- [Loss Functions for Ordinal regression](http://fa.bianp.net/blog/2013/loss-functions-for-ordinal-regression/)
- https://www.stat.uchicago.edu/~pmcc/
- [COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them](https://web.ma.utexas.edu/users/mks/statmistakes/ordinal.html)
- [Source Code for Support Vector Ordinal Regression](http://www.gatsby.ucl.ac.uk/~chuwei/svor.htm)
- [6.5 Models for Ordinal Response Data](https://data.princeton.edu/wws509/notes/c6s5)
- https://papers.nips.cc/paper/6659-adversarial-surrogate-losses-for-ordinal-regression
- https://www.jstor.org/stable/2984952
- https://pythonhosted.org/mord/
- https://en.wikipedia.org/wiki/Ordinal_regression
- https://www.statisticssolutions.com/regression-analysis-ordinal-regression/

### Learning to rank


In information retrieval,  learning to rank is to sort the items $\{d_i\}_{i}^{n_q}$ when given a query $q$.
The ranking loss is to measure how the sorted order is different from the ground truth.
The ranking function $f$ learns to assign an absolute score (categories) to each item in isolation.
Learning to rank is another kind of supervised learning task including pairwise, listwise and groupwise methods.


* Pointwise - Regression, Classification, Ordinal regression (items to be ranked are treated in isolation)
* Pairwise -  Rank-preference models (items to be ranked are treated in pairs)
* Listwise - Treat each list as an instance. Usually tries to directly optimise the evaluation measure (e.g. mean average precision)

We can take the classic methods in other fields to learn the rank like the pointwise methods.



- http://nn4ir.com/sigir2017/slides/05_LearningToRank.pdf
- [Learning to Rank: From Pairwise Approach to Listwise Approach](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf)
- https://www.cl.cam.ac.uk/teaching/1516/R222/l2r-overview.pdf
- [Learning to rank past and future](https://wwwconference.org/www2009/pdf/T7A-LEARNING%20TO%20RANK%20TUTORIAL.pdf)
- [Learning to Rank for Information Retrieval](https://wwwconference.org/www2009/pdf/T7A-LEARNING%20TO%20RANK%20TUTORIAL.pdf)
- http://ltr-tutorial-sigir19.isti.cnr.it/
- [Variance Reduction in Gradient Exploration for Online Learning to Rank](https://dl.acm.org/doi/10.1145/3331184.3331264)
- http://www.cse.scu.edu/~yfang/energy-ICTIR16.pdf
- http://members.cbio.mines-paristech.fr/~jvert//talks/190918mines/mines.pdf

Regression|Ordinal Regression|Classification
----|----|----
Continous| Ordinal | Discrete
$y=f(\vec{x})$|$y=\operatorname{thresh}f(\vec{x})$|$y=\operatorname{sign}f(\vec{x})$
Regression Loss | Ordinal Regression Loss | Classification Loss

The regression loss and classification loss can be regarded as the log-likelihood, which relies on the probility distribution.

- [Future directions in learning to rank](http://www.yichang-cs.com/yahoo/JMLR11_LTR_future.pdf)
- [Learning to match](http://www.hangli-hl.com/uploads/3/4/4/6/34465961/learning_to_match.pdf)
- [Learning to hash](https://cs.nju.edu.cn/lwj/L2H.html)
- https://learning2hash.github.io/

### Relevance Judgements

For vectors, we have discussed more on the design of loss functions.
[Vector Space Models](http://mlwiki.org/index.php/Vector_Space_Models) is to representation of a set of documents as vectors in a common vector space.
[Vector Quantization](http://www.mqasem.net/vectorquantization/vq.html) will accelerate the vector search.

In the context of web search, the inputs are some key words in the format of string.


[`Relevancy`, the notion that some results are better than others is one of the key factors that distinguishes search engines from most other databases.](http://blog.florian-hopf.de/2016/10/book-review-relevant-search.html)
Search relevance is a measurement of how closely related a document is to a query. 
Search relevance can be determined in a wide variety of ways, ranging from simple binary relevance to a weighted relevance algorithm such as TF-IDF, 
which assigns a relevance score to documents.
So it is a natural language processing problem.


- https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/
- https://trec.nist.gov/data/reljudge_eng.html
- http://people.ischool.berkeley.edu/~hearst/irbook/10/node8.html
- [Relevance Judgment Guidelines](http://trec-cds.appspot.com/relevance_guidelines.pdf)
- [Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?](https://www.ischool.utexas.edu/~ml/papers/kutlu-sigir18.pdf)
- [Relevance Judgment: What Do Information Users Consider Beyond Topicality?](http://www2.hawaii.edu/~donnab/lis670/xu_and_chen_relevance_beyond_topicality_2006.pdf)
- [Loss Functions for Preference Levels: Regression with Discrete Ordered Labels](https://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf)
- [Relevance as process: judgements in the context of scholarly research](http://informationr.net/ir/10-2/paper226.html)

### Evaluation of ranked retrieval results


[A ranking model is given by a collection of items, and an unknown total ordering of these items. ](http://proceedings.mlr.press/v97/tang19a/tang19a.pdf)

`Precision at position $k$` for query $q$:
$$P@k=\frac{\text{the number of relevant documentsin top $k$ results}}{k}$$
`Average precision` for query $q$:
$$AveP=\frac{\text{Precision at position } k}{\text{the number of relevant documentsin top $k$ results}}$$
`Mean average precision(MAP)` is the averaged precision over all queries 
$$MAP=\frac{\sum_{q=1}^Q AveP_{q}}{Q}$$
where $AveP_{q}$ is `Average precision` for query $q$.

`Discounted Cumulative Gain (DCG)` is the metric of measuring ranking quality. It is mostly used in information retrieval problems such as measuring the effectiveness of the search engine algorithm by ranking the articles it displays according to their relevance in terms of the search keyword.
The `Cumulative Gain` is the sum of these relevance scores.
The discounted cumulative gain can be calculated by the formula:
$$DCG=\sum_i\frac{\text{rel}_i}{\log_2(i+1)}$$
where $\text{rel}_i$ represents the weight assigned to the label of the document at position $i$.

`Normalized Cumulative Gain (NDCG)` at position $n$ for query $q$  normalizes DCG at rank $n$ by the DCG value at rank $n$ of the ideal ranking
$$NDCG@n=\frac{DCG@n}{IDCG@n}$$
where $IDCG@n$ is the  Discounted Cumulative Gain (DCG) of the ideal ranking, 
which would first return the documents with the highest relevance level,
then the next highest relevance level, etc.

- https://www.zhihu.com/people/richard-lee-91-28/posts
- https://catboost.ai/docs/references/ndcg.html
- http://proceedings.mlr.press/v30/Wang13.pdf
- https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
- [Ranking Relevance in Yahoo Search](http://www.yichang-cs.com/yahoo/KDD16_yahoosearch.pdf)
- https://github.com/IGARDS/rankability_toolbox
- https://prof.beuth-hochschule.de/fileadmin/prof/aloeser/shuaib_ltr_places_defence.pdf
- https://tech.meituan.com/2018/12/20/head-in-l2r.html
- https://www.aclweb.org/anthology/P09-5005/
- https://pub.ist.ac.at/~vnk/papers/MRJKK-CVPR2018.pdf
- [Learning to Rank with Alpha-divergence and Entropy Regularization](https://github.com/xiaohai2016/StatRank)
- [Ranking Measures and Loss Functions in Learning to Rank](https://papers.nips.cc/paper/3708-ranking-measures-and-loss-functions-in-learning-to-rank.pdf)
- [IR evaluation methods for retrieving highly relevant documents](https://dl.acm.org/doi/10.1145/345508.345545)
- [A Short Survey on Search Evaluation](https://staff.fnwi.uva.nl/e.kanoulas/a-short-survey-on-search-evaluation/)
- [Metrics for evaluating ranking algorithms](https://stats.stackexchange.com/questions/159657/metrics-for-evaluating-ranking-algorithms)
- [Metric Learning to Rank](https://bmcfee.github.io/papers/mlr.pdf)
- [Evaluation of ranked retrieval results](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html)
- [Forum for Information Retrieval Evaluation](http://fire.irsi.res.in/fire/2019/Lecture%20Series)
- https://brianmcfee.net/papers/mlr.pdf


Ranking Losses are essentialy the ones explained above, and are used in many different aplications with the same formulation or minor variations. 
However, different names are used for them, which can be confusing. 
[Here I explain why those names are used.](https://gombru.github.io/2019/04/03/ranking_loss/)

1. Ranking loss: This name comes from the information retrieval field, where we want to train models to rank items in an specific order.
2. Margin Loss: This name comes from the fact that these losses use a margin to compare samples representations distances.
3. Contrastive Loss: Contrastive refers to the fact that these losses are computed contrasting two or more data points representations. 
  This name is often used for Pairwise Ranking Loss, but I’ve never seen using it in a setup with triplets.
4. Triplet Loss: Often used as loss name when triplet training pairs are employed.
5. Hinge loss: Also known as max-margin objective. It’s used for training SVMs for classification. 
    It has a similar formulation in the sense that it optimizes until a margin. That’s why this name is sometimes used for Ranking Losses.

- https://omoindrot.github.io/triplet-loss
- https://github.com/adambielski/siamese-triplet
- [Understanding Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, Hinge Loss and all those confusing names](https://gombru.github.io/2019/04/03/ranking_loss/)

### The Rankability of Data

The concept of rankability  refers to a dataset's inherent ability to produce a meaningful ranking of its items.

- https://epubs.siam.org/doi/pdf/10.1137/18M1183595
- http://www.fields.utoronto.ca/talks/Rankability-Data
- https://arxiv.org/abs/1912.00275
- https://anderson-data-science.com/research/
- https://www.thomasrcameron.com/


### LambdaMART and the related

The evaluation metric is not differentiable and continous like the 0-1 loss for the classification tasks.
It is first to minimize the nonsmooth cost function as shown in `LambdaRank` of 
[Learning to Rank with Nonsmooth Cost Functions](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf).

Multiple Additive Regression Tree(MART) is another name of gradient boosting decision trees,
which use the gradient of the loss function to create new decision trees.

`LambdaMART` use the `Lambda` gradient to generate new trees.


- https://stanfordphd.com/MART.html
- [Multiple Additive Regression Trees](http://statweb.stanford.edu/~jhf/MART.html)
- https://arxiv.org/pdf/1809.05818.pdf
- [Learning to Rank by Optimizing NDCG Measure](http://people.cs.pitt.edu/~valizadegan/Publications/NDCG_Boost.pdf)
- https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/LambdaMART_Final.pdf
- https://www.microsoft.com/en-us/research/wp-content/uploads/2005/08/icml_ranking.pdf
  

#### LambdaRank


- [Learning to Rank with Nonsmooth Cost Functions](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf)
- [Learning to Rank using Gradient Descent](https://www.microsoft.com/en-us/research/publication/learning-to-rank-using-gradient-descent/)

#### LambdaMART


- https://ml-compiled.readthedocs.io/en/latest/about.html
- [From RankNet to LambdaRank to LambdaMART: An Overview](https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/)
- [Unbiased LambdaMART](https://arxiv.org/pdf/1809.05818.pdf)
- https://liam.page/uploads/slides/lambdamart.pdf
- [Unbiased LambdaMART](https://arxiv.org/pdf/1809.05818.pdf)

#### LambdaLoss

Different from previous work that converts ranks in ranking metrics completely to scores, 
our idea is to define loss functions using both `scores and ranks` explicitly.


- [Essential Loss: Bridge the Gap between Ranking Measures and Loss Functions in Learning to Rank](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/techReport.pdf)
- [LambdaLoss: Metric-Driven Loss for Learning-to Rank](https://www.tdcommons.org/cgi/viewcontent.cgi?article=2281&context=dpubs_series)
- [The LambdaLoss Framework for Ranking Metric Optimization](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/1e34e05e5e4bf2d12f41eb9ff29ac3da9fdb4de3.pdf)
- https://zhuanlan.zhihu.com/p/134672431
- http://marc.najork.org/
- https://www.cnblogs.com/bentuwuying/p/6681943.html
- [Differentiable Unbiased Online Learning to Rank](https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/oosterhuis-differentiable-2018.pdf)


#### LambdaFM

- https://wildltr.github.io/ptranking/ltr_adhoc/about/
- http://wnzhang.net/papers/lambdafm.pdf
- https://github.com/fajieyuan/LambdaFM

## Boosting and loss function

The machine learning problem of classifier design is studied from  the perspective of probability elicitation, in statistics. 
This shows that the standard approach of proceeding from the specification of a loss,  to the minimization of conditional risk is overly restrictive. 
It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, 
and derive the loss function. 
This has various consequences of practical interest, 
such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, 
and 2) many new losses can be derived for classification problems. 
These points are illustrated by the derivation of a number of novel Bayes consistent loss functions, 
some of which are not convex but do not compromise the computational tractability of classifier design. 
A number of algorithms custom tailored for specific classification problems are derived based on novel loss functions. 
[These include classification algorithms for cost sensitive learning, robust outlier resistant classification and variable margin classification.](http://www.svcl.ucsd.edu/projects/LossDesign/)

- [Classifier Loss Function Design](http://www.svcl.ucsd.edu/projects/LossDesign/)
- https://www.jiqizhixin.com/articles/2020-04-16-10
- http://www.52caml.com/head_first_ml/ml-chapter6-boosting-family/


### Boosting and optimization

[It has shown](https://web.stanford.edu/~hastie/Papers/AdditiveLogisticRegression/alr.pdf) that how boosting algorithms result from building additive models using Newton updates of the exponential loss function.
Then the boosting methods are extended to gradeint boosting.

- https://static.miraheze.org/yoavfreundwiki/b/bd/BoostingAndInformationGeometry.pdf
- https://www.cs.princeton.edu/courses/archive/spr09/cos598D/schedule.html
- [Boosting, Convex Optimization, and Information Geometry](https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/chapter008.html)
- [Logistic Regression, AdaBoost and Bregman Distances](https://link.springer.com/content/pdf/10.1023/A:1013912006537.pdf)
- http://www.lsta.upmc.fr/BIAU/bc2.pdf
- https://web.stanford.edu/~jduchi/projects/DuchiSi09d.pdf
- https://web.stanford.edu/~hastie/Papers/Rossetboost.pdf
- https://web.stanford.edu/~hastie/TALKS/boost.pdf
- https://www.stat.berkeley.edu/~binyu/summer08/colin.bregman.pdf
***********

- https://en.wikipedia.org/wiki/AdaBoost
- https://explained.ai/gradient-boosting/L2-loss.html
- https://www.cnblogs.com/jcchen1987/p/4581651.html

### Robust Boosting

- https://github.com/max-andr/provably-robust-boosting
- http://home.olemiss.edu/~xdang/papers/CBoost17.pdf
- http://home.olemiss.edu/~xdang/
- https://www.tau.ac.il/~saharon/papers/bagboost.pdf
- http://proceedings.mlr.press/v119/pfetsch20a/pfetsch20a.pdf


### Regularized Boosting

- https://www.tau.ac.il/~saharon/papers/PhDthesis.pdf
- https://proceedings.neurips.cc/paper/2005/file/ec0bfd000f253eff3acb1043e1c06979-Paper.pdf
- https://papers.nips.cc/paper/2019/file/465636eb4a7ff4b267f3b765d07a02da-Paper.pdf
- https://www.cs.princeton.edu/~schapire/papers/speedsparsity.pdf
- https://academic.oup.com/nar/article/46/7/e39/4817397

### Sparse Boosting

- https://jmlr.csail.mit.edu/papers/volume7/buehlmann06a/buehlmann06a.pdf
- https://stat.ethz.ch/Manuscripts/buhlmann/sparseboost-JMLR.pdf
- http://www.stat.ucdavis.edu/~chohsieh/rf/icml_sparse_GBDT.pdf
- https://static.miraheze.org/yoavfreundwiki/b/bd/BoostingAndInformationGeometry.pdf

### Bayesian Boosting

- https://jmlr.csail.mit.edu/papers/volume15/wu14a/wu14a.pdf
- https://github.com/bayesian/boosting
- https://github.com/mpearmain/BayesBoost
- https://link.springer.com/article/10.1023/A:1008931416845

### Learned-Loss Boosting

- http://faculty.bscb.cornell.edu/~hooker/learnedloss.pdf

##  Adverisal losses

- https://stat.uw.edu/news-resources/articles/perlman-stuetzle-and-wellner-retire-after-35-years-uw
- https://www.cse.iitk.ac.in/users/purushot/articles.php
- https://sss1.github.io/publications.html

## Contrastive learning

> Contrastive self-supervised learning techniques are a promising class of methods that build representations 
> by learning to encode what makes two things similar or different.

[Contrastive methods, as the name implies, learn representations by contrasting positive and negative examples.](https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html)

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. 
Despite its success, the influence of different view choices has been less studied. 
[In this paper, we use empirical analysis to better understand the importance of view selection,](https://hobbitlong.github.io/InfoMin/) 
and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact.

- https://blog.einstein.ai/prototypical-contrastive-learning-pushing-the-frontiers-of-unsupervised-learning/
- [Contrastive Self-Supervised Learning](https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html)
- [Contractive Auto-Encoders: Explicit Invariance During Feature Extraction](https://icml.cc/Conferences/2011/papers/455_icmlpaper.pdf)
- https://blog.einstein.ai/prototypical-contrastive-learning-pushing-the-frontiers-of-unsupervised-learning/
- https://hobbitlong.github.io/InfoMin/
- https://github.com/HobbitLong/PyContrast

## Semantic Similarity

Different words can share the same meanings.


- http://semanticmatching.eu/semantic-matching.html
- http://semanticmatching.eu/semantic-matching.html
- https://fastsemsim.readthedocs.io/en/latest/
- https://blog.lateral.io/
- http://www4.comp.polyu.edu.hk/~cslzhang/paper/DeepHashing_TIP.pdf
- https://www.cortical.io/freetools/compare-text/
- [Efficient Graph-Based Document Similarity](https://usc-isi-i2.github.io/papers/cpaul16-eswc.pdf)
- http://www.semanticsimilarity.org/
- https://github.com/jinfengr/hcan
- https://www.textkernel.com/solution/match/

## Metric learning

[Goal of metric learning algorithm is to learn a metric which assigns small distance to similar points and relatively large distance to dissimilar points.](https://parajain.github.io/metric_learning_tutorial/)
       

- https://pydml.readthedocs.io/en/latest/index.html    
- http://sanmi.cs.illinois.edu/research.html
- https://ai.stanford.edu/~ang/papers/nips02-metric.pdf
- http://www.cs.toronto.edu/~mvolkovs/NIPS2019_GSS.pdf
- [Distance Metric Learning for Large Margin Nearest Neighbor Classification](http://www.jmlr.org/papers/volume10/weinberger09a/weinberger09a.pdf)
- [A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms and Software](https://arxiv.org/pdf/1812.05944v1.pdf)
- http://webia.lip6.fr/~cord/pdfs/courses/2019RIVCourse2.pdf
- http://cseweb.ucsd.edu/~naverma/talks/metric_learning_tutorial_verma.pdf

### Learning Loss

The loss function $\ell(\cdot,\cdot)$ seems independent with the training data.
Usually we choose the squred error for the regression task and the cross-entropy attached with softmax layer for the classification task.

Zongben Xu propose the [the independence assumption of loss function on dataset](https://dl.acm.org/doi/pdf/10.1145/3397271.3402428).

In the least square methods, we only focus on the residual/error, i.e., we assump that the residual/error $y_i-f(x_i)$ are distributed in Gaussian and we minimize the mean squared errors $\frac{1}{N}\sum_{i=1}^{N}(y_i-f(x_i))^2$. 

Why not the Gaussian distribution?


<img src="https://dgyoo.github.io/dgyoo.github.io/images/cvpr19.PNG" width="60%"/>
<img src="https://vitalab.github.io/article/images/learing-loss-active/fig1.jpeg" width="60%"/>

- https://dgyoo.github.io/
- [Learning Adaptive Loss for Robust Learning with Noisy Labels](https://arxiv.org/abs/2002.06482)
- [SoDeep: a Sorting Deep net to learn ranking loss surrogates](https://arxiv.org/pdf/1904.04272.pdf)
- [Learning to teach with dynamic loss functions](https://papers.nips.cc/paper/7882-learning-to-teach-with-dynamic-loss-functions.pdf)
- [Learning Loss for Active Learning](https://vitalab.github.io/article/2019/07/04/Learning-Loss-for-Active-Learning.html)
- https://gmum.net/slides/rymarczyk_loss_for_activity_learning.pdf
- https://arxiv.org/abs/2003.04521
- https://github.com/technicolor-research/sodeep

## Self-paced Learning

Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals 
that gradually incorporates easy to more complex samples into training. 
While several easy SPL implementation strategies have been proposed,
 it is still short of a general paradigm for guiding the construction of rational SPL learning regimes targeting specific applications. 
 To resolve this problem, we provide an axiom for insightfully formulating the underlying principles of self-paced learning. 
 This axiomatic understanding not only involves the previous SPL learning schemes as its special cases, 
 but also can be utilized to extend a series of new SPL implementation regimes based on certain application aims. 
 In the recent two years, we have constructed several SPL realizations, 
 including SPaR, SPLD, SPCL, SPMF, SPMIL, based on this axiom, and achieved the best performance in several known benchmark datasets, 
 e.g., Web Query, Hollywood2, and Olympic Sports. 
 Especially, this paradigm has been integrated into the system developed by CMU Informedia team, 
 and achieved the leading performance in challenging semantic query (SQ)/000Ex tasks of the TRECVID MED/MER competition organized by NIST in 2014.



- http://gr.xjtu.edu.cn/web/dymeng/2
- http://gr.xjtu.edu.cn/web/dymeng/6



## Model assessment and selection

We will focus on the alternatives to cross-validation in order to find more proper models.

- https://faculty.washington.edu/edford/assessment.html
- https://rafalab.github.io/pages/649/section-07.pdf
- http://www1.se.cuhk.edu.hk/~seem5470/lecture/model-assess-select-2017.pdf
- http://compdiag.molgen.mpg.de/ngfn/docs/2003/oct/model-selection.pdf
- https://github.com/avehtari/modelselection

### Credible interval

- http://www2.stat.duke.edu/~rcs46/lecturesModernBayes/601-module3-morebayes/lecture5-more-bayes.pdf