Skip to content

Latest commit

 

History

History
181 lines (138 loc) · 11.4 KB

EP_Implementation.md

File metadata and controls

181 lines (138 loc) · 11.4 KB

This note provides the details of the implementation of Gaussian process regression (GPR) for preferential learning using expectation propagation (EP).

We assume that $f$ is a sample path of GP with zero mean and some stationary kernel $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$. Following [2, 3], we consider that the duel is determined as follows:

$$ \boldsymbol{x}_{i, w} \succ \boldsymbol{x}_{i, l} \Leftrightarrow f(\boldsymbol{x}_{i, w}) + \epsilon_w > f(\boldsymbol{x}_{i, l}) + \epsilon_l, $$

where $\boldsymbol{x}_{i, w}$ and $\boldsymbol{x}_{i, l}$ are a winner and a loser of $i$-th duel, respectively, and i.i.d. additive noise $\epsilon_w$ and $\epsilon_l$ follow the normal distribution $\mathcal{N}(0, \sigma^2_{\rm noise})$. Therefore, the training data $\mathcal{D}_t$ can be written as,

$$ \mathcal{D}_t = \{ \boldsymbol{x}_{i, w} \succ \boldsymbol{x}_{i, l} \}_{i=1}^t \equiv \{v_i < 0\}_{i=1}^t, $$

where $v_i = f(\boldsymbol{x}_{i, l}) + \epsilon_l - f(\boldsymbol{x}_{i, w}) - \epsilon_w$. For brevity, we denote $\{v_i &lt; 0\}_{i=1}^t$ as $\boldsymbol{v}_t &lt; \boldsymbol{0}$, where $\boldsymbol{v}_t = (v_1, \dots, v_t)^\top$. Then, we approximate the posterior

$$ p\left(\boldsymbol{f}_{\rm tes} \mid \boldsymbol{v}_t < \boldsymbol{0} \right) = \frac{\Pr\left(\boldsymbol{v}_t < \boldsymbol{0} \mid \boldsymbol{f}_{\rm tes}\right) p\left(\boldsymbol{f}_{\rm tes}\right)}{\Pr(\boldsymbol{v}_t < \boldsymbol{0})}, $$

where $\boldsymbol{f}_{\rm tes} = \bigl( f(\boldsymbol{x}_{1, {\rm tes}}), \dots, f(\boldsymbol{x}_{m, {\rm tes}}) \bigr)^\top \in \mathbb{R}^m$ is an output vector on arbitrary inputs.

Prior work [2] apply EP to function values on the duels, i.e., $\bigl( f(\boldsymbol{x}_{1, w}), f(\boldsymbol{x}_{1, l}), \dots, f(\boldsymbol{x}_{t, w}), f(\boldsymbol{x}_{t, l}) \bigr)$. However, this formulation requires two-dimensional local moment matching for $\bigl( f(\boldsymbol{x}_{i, w}), f(\boldsymbol{x}_{i, l}) \bigr)$. Therefore, the implementation is highly cumbersome (See Appendix A in [2]).

Alternatively, we apply EP to $\boldsymbol{v}_t | \boldsymbol{v}_t &lt; \boldsymbol{0}$ directly. Then, since $\boldsymbol{v}_t | \boldsymbol{v}_t &lt; \boldsymbol{0}$ follows a truncated multivariate normal distribution with a simple truncation, we can implement EP using one dimensional local moment matching, which is simpler than that of [2]. We have referred to Appendix B.2 of [4] for the subsequent derivation, although [4] contains some mathematical typos for the update of the mean parameters, which is fixed in the following derivation.

Let $\boldsymbol{v} = (v_1, \dots, v_t)^\top \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$ and we denote a probability density function (PDF) of the normal distribution as $\mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)$. Then, we consider the EP approximation for the truncated normal distribution $p(\boldsymbol{v} | \boldsymbol{v} &lt; \boldsymbol{0})$ as follows:

$$\begin{align} p(\boldsymbol{v} | \boldsymbol{v} < \boldsymbol{0}) &\propto \mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \prod_{i=1}^t \mathbb{I}\\{ v_i < 0 \\} \\\ &\approx \mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \prod_{i=1}^t \mathcal{N}( v_i | \tilde{\mu}_i, \tilde{\sigma}\_{i}^2 ) \\\ &= \mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \mathcal{N}( \boldsymbol{v} | \tilde{\boldsymbol{\mu}}, \tilde{\boldsymbol{\Sigma}} ) \\\ &\propto \mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) \end{align}$$

where $\mathbb{I}\{ v_i &lt; 0 \}$ is an indicator function, which is $1$ if $v_i &lt; 0$ and $0$ otherwise, $\tilde{\boldsymbol{\mu}} \in \mathbb{R}^t = (\tilde{\mu}_1, \dots, \tilde{\mu}_t)$, $\tilde{\boldsymbol{\Sigma}} \in \mathbb{R}^{t \times t}$ is a diagonal matrix, whose $(i,i)$-th element is $\tilde{\sigma}_i^2$, and resulting mean $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$ are computed as follows:

$$ \begin{align} \boldsymbol{\mu} &= \boldsymbol{\Sigma} \left( \tilde{\boldsymbol{\Sigma}}^{-1} \tilde{\boldsymbol{\mu}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu} \right) ,\\ \boldsymbol{\Sigma} &= \left(\boldsymbol{\Sigma}_0^{-1} + \tilde{\boldsymbol{\Sigma}}^{-1} \right)^{-1}. \end{align} $$

Then, $\tilde{\boldsymbol{\mu}}$ and $\tilde{\boldsymbol{\Sigma}}$ are iteratively refined. These parameters are initialized as $\tilde{\mu}_i = 0$ and $\tilde{\sigma}_{i}^2 = \infty$ for all $i$, which corresponds to $\boldsymbol{\mu} = \boldsymbol{\mu}_0$ and $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}_0$.

Let us consider the update of $\tilde{\mu}_j$ and $\tilde{\sigma}_{j}^2$. First, we consider the cavity distribution

$$\begin{align} q_{\backslash j}(\boldsymbol{v}) &= \mathcal{N}(\boldsymbol{v} | \bar{\boldsymbol{\mu}}\_{\backslash j}, \bar{\boldsymbol{\Sigma}}\_{\backslash j}) \\\ &\propto \mathcal{N}(\boldsymbol{v} | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0) \prod_{i=1, i \neq j}^t \mathcal{N}( v_i | \tilde{\mu}_i, \tilde{\sigma}\_{i}^2 ) , \end{align}$$

where

$$\begin{align} \bar{\mu}\_{\backslash j, j} &= \mathbb{E}\_{q_{\backslash j}}[v_j] = \bar{\sigma}\_{\backslash j, j}^2 \left( \mu_j / \sigma_{j}^2 - \tilde{\mu}_j / \tilde{\sigma}\_{j}^2 \right) ,\\\ \bar{\sigma}\_{\backslash j, j}^2 &= \mathbb{V}\_{q_{\backslash j}} [v_j] = \left( 1 / \sigma_{j}^2 - 1 / \tilde{\sigma}\_{j}^2 \right)^{-1}, \end{align}$$

where $\mathbb{E}$ and $\mathbb{V}$ are the expectation and variance with respect to the distribution in the subscript. Then, $\tilde{\mu}_j$ and $\tilde{\sigma}_{j}^2$ are updated by the moment matching between a truncated normal distribution

$$ r_j(v_j) = \frac{\mathcal{N} \left( v_j | \bar{\mu}_{\backslash j, j}, \bar{\sigma}_{\backslash j, j}^2 \right) \mathbb{I}\{ v_j < 0 \}}{\Phi (\beta)}, $$

where $\beta = - \bar{\mu}_{\backslash j, j} / \bar{\sigma}_{\backslash j, j}$ and $\Phi ( \cdot )$ is a cumulative distribution function of the standard normal distribution, and the approximated normal distribution

$$ \begin{align} \hat{r}_j(v_j) &= \mathcal{N} \left( v_j | \mathbb{E}_{\hat{r}_j} \bigl[ v_j \bigr], \mathbb{V}_{\hat{r}_j} \bigl[ v_j \bigr] \right) \\ &\propto \mathcal{N} \left( v_j | \bar{\mu}_{\backslash j, j}, \bar{\sigma}_{\backslash j, j}^2 \right) \mathcal{N} \left( v_j | \tilde{\mu}_j, \tilde{\sigma}_{j}^2 \right). \end{align} $$

Specifically, we set $\tilde{\mu}_j$ and $\tilde{\sigma}_{j}^2$ using the following equations:

$$ \begin{align} \bar{\mu}_{\backslash j, j} + \frac{\phi(\beta)}{\Phi(\beta)} \bar{\sigma}_{\backslash j, j} = \mathbb{E}_{r_j}\bigl[v_j\bigr] &= \mathbb{E}_{\hat{r}_j} \bigl[ v_j \bigr] = \mathbb{V}_{\hat{r}_j}\bigl[v_j\bigr] \left( \frac{\bar{\mu}_{\backslash j, j}}{\bar{\sigma}_{\backslash j, j}^2} + \frac{\tilde{\mu}_j}{\tilde{\sigma}_{j}^2} \right), \\ \left( 1 - \frac{\beta \phi(\beta)}{\Phi(\beta)} - \left( \frac{ \phi(\beta)}{\Phi(\beta)} \right)^2 \right) \bar{\sigma}_{\backslash j, j}^2 = \mathbb{V}_{r_j}\bigl[v_j\bigr] &= \mathbb{V}_{\hat{r}_j} \bigl[ v_j \bigr] = \left( \frac{1}{\bar{\sigma}_{\backslash j, j}^2} + \frac{1}{\tilde{\sigma}_{j}^2} \right)^{-1}, \end{align} $$

where $\phi$ is PDF of the standard normal distribution. Note that the expectation and variance of the truncated normal $r_j(v_j)$ and the normal $\hat{r}_j(v_j)$ can be obtained analytically. Consequently, we obtain the update rule of $\tilde{\mu}_j$ and $\tilde{\sigma}_{j}^2$ as follows:

$$ \begin{align} \tilde{\mu}_j &\leftarrow \bar{\mu}_{\backslash j, j} + \left( \beta + \frac{ \phi(\beta)}{\Phi(\beta)} \right)^{-1} \bar{\sigma}_{\backslash j, j}, \\ \tilde{\sigma}_{j}^2 &\leftarrow \left( \left( \frac{\beta \phi(\beta)}{\Phi(\beta)} + \left( \frac{ \phi(\beta)}{\Phi(\beta)} \right)^2 \right)^{-1} -1 \right) \bar{\sigma}_{\backslash j, j}^2 . \end{align} $$

This update rule is repeated until convergence for all $j=1, \dots, t$.

The prediction of $\boldsymbol{f}_{\rm tes}$ can be performed as follows:

$$ \begin{align} \boldsymbol{f}_{\rm tes} | \mathcal{D}_t &\sim \mathcal{N}(\boldsymbol{\mu}_{\rm tes}, \boldsymbol{\Sigma}_{\rm tes}), \\ \boldsymbol{\mu}_{\rm tes} &= \boldsymbol{\Sigma}_{{\rm tes}, \boldsymbol{v}} \left( \boldsymbol{\Sigma}_0 + \tilde{\boldsymbol{\Sigma}} \right)^{-1} \tilde{\boldsymbol{\mu}},\\ \boldsymbol{\Sigma}_{\rm tes} &= \boldsymbol{\Sigma}_{{\rm tes}, {\rm tes}} - \boldsymbol{\Sigma}_{{\rm tes}, \boldsymbol{v}} \left( \boldsymbol{\Sigma}_0 + \tilde{\boldsymbol{\Sigma}} \right)^{-1} \boldsymbol{\Sigma}_{{\rm tes}, \boldsymbol{v}}^\top, \end{align} $$

where $\boldsymbol{\Sigma}_{{\rm tes}, {\rm tes}}$ and $\boldsymbol{\Sigma}_{{\rm tes}, \boldsymbol{v}}$ are the covariance matrices, whose definitions are shown in Proposition 3.1 in our paper [1]. Note that $\boldsymbol{\Sigma}_0$ corresponds to $\boldsymbol{\Sigma}_{\boldsymbol{v}, \boldsymbol{v}}$ in our paper [1].

Other formulation

We described the EP for the truncated multivariate normal distribution. On the other hand, using $z_i = f(\boldsymbol{x}_{i, l}) - f(\boldsymbol{x}_{i, w})$, we can apply EP to the distribution of $\boldsymbol{z}_t = (z_1, \dots, z_t)^\top$:

$$\begin{align} p(\boldsymbol{z}_t | \boldsymbol{v}_t < \boldsymbol{0}) &\propto \mathcal{N}(\boldsymbol{z}_t | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0 - 2\sigma_{\rm noise}\boldsymbol{I}) \prod_{i=1}^t \Pr( z_i + \epsilon_l - \epsilon_w < 0 | z_i) \\\ &= \mathcal{N}(\boldsymbol{z}_t | \boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0 - 2\sigma_{\rm noise}\boldsymbol{I}) \prod_{i=1}^t \Phi\left( -\frac{z_i}{\sqrt{2}\sigma_{\rm noise}} \right) \\\ \end{align}$$

where $\boldsymbol{I}$ is the identity matrix. Then, the derivation of the EP procedure is slightly changed. However, the derivation for this formulation is essentially same as the GP classification model [5]. See [5] for the detailed derivation.

Reference