Summarizing and rewriting A tutorial on the free-energy framework for modelling perception and learning by Rafal Bogacz (2015).
Any computational model, in order to be biologically plausible, would need to satisfy two constraints:
- Local computation: each neuron performs computations only based on the activity of its input neurons and the synaptic weights associated with those neurons.
- Local plasticity: the “strength” of a synapse (or how much influence a presynaptic neuron has on a postsynaptic one —synaptic plasticity) can change over time only based on the activity of the two neurons it connects (its pre-synaptic and post-synaptic neurons).
The basic computations a neuron performs are:
- A sum of its input, weighted by the strengths of the relative synaptic connections
- A transformation of that sum through a function that describes the relationship between the neuron’s total input and output (firing-input)
A simple organism tries to infer the size (diameter) of food based on the light intensity it observes.
In this example, we consider a problem in which the value of a single variable
where
This means that when the food size is
The animal can refine its guess for the size
To compute how likely different sized of
Such Bayesian approach integrates the information brought by the stimulus with the prior knowledge ($p(v)$), but is challenging for a simple biological system.
Since
Instead of finding the whole posterior distribution
According to
Maximizing
Incorporating the constant term into a constant
To find
The derivative of
In our example, since
To find our best guess
This method of gradient ascent is computationally much simpler than the Bayesian Inference, and quickly converges to the desired value.
Let’s denote the two terms in
The above terms are the prediction errors:
-
$\epsilon_u$ denotes how the interred size differs from the prior expectations. -
$\epsilon_u$ represents how much the measured light intensity differs from the expected one give$\phi$ as the food item size.
Rewriting the equation for updating
The model parameters
We’ll consider simple neural nodes which change their activity proportionally to the input they receive; for example,
Once these equations converge,
In this architecture, the computations are performed as follows.
The node
Terminology:
- an excitatory input is an input that adds to a neuron’s activity—a positive term.
- an inhibitory input is an input that subtracts from a neuron’s activity—a negative term.
- a tonically active node is a node that has a constant output—a constant in the system.
To understand what the above graph represents, and how it can be used to simulate calculations, let’s start by laying down its components.
The above architecture contains, first of all, nodes. Each node can represent either a variable or a constant (tonically active nodes). Every variable in the graph is updated in each unit of time to a new value, by an amount (rate of change) defined by the equations described by the graph itself.
In the above architecture we have:
Variable Nodes | Constant Nodes |
---|---|
The lines connecting nodes represent the inputs and outputs of each node. The direction in which the information is flowing is determined by where the point of the line is —e.g. the connection between nodes
There are two types of inputs, excitatory and inhibitory, denoted by a arrow head and a black dot respectively. The type of the input determines its sign in the equation described in each node, positive if the input is excitatory, and negative if it is inhibitory.
Furthermore, each connection has a weight, denoted by the label near each line. The weight is a scalar value associated with the relative input in the equation of the recipient node. When the label is surrounded by a box, it signifies that the expression, rather than being the weight, is the entire finale term of the equation.
Summarizing the connections present in the above architecture, we have:
Sender | Recipient | Sign | Weight | Term |
---|---|---|---|---|
\ | ||||
\ | ||||
Ultimately, we can derive the equations for the rates of change of each variable node, by summing all the terms described by their input connections with other nodes.
For example, the rate of change of node
Exactly equivalent to
To simulate the model described by this graph then, simply means to:
where
Our imaginary animal might wish to refine its expectation about the typical food size and the error it makes when observing light after each stimulus. In practice, we want to update the parameters
Maximizing
The intuition is, maximizing
over all possible values
After inferring the most likely value of
“Given my best guess of
$\phi$ , how likely is it that my internal parameters would expect its light intensity to be the measured one,$u$ ?”.
Although
In the same way we adjusted our guess of
Although the environment is constantly variable —food items have all different sizes, so there’s no single ground truth the model could possibly converge to— it’s nevertheless useful to consider the values of the “ideal” parameters, for which the relative rates of change are equal to
For example, the expected rate of change for
Analogously, the expected rate of change for
The equations for the rates of change simplify significantly when rewritten in terms of the prediction errors
These parameters update rules correspond to very simple synaptic plasticity mechanisms, since all the rules include only values that can be known by the synapse (the activity of the pre-synaptic and post-synaptic neurons, and the strengths of the synapse itself) and are also Hebbian, since they depend on the products of those activities.
For example, the update rule for
It’s important to note that the minimum value of
One assumption we’ve made so far is that the function
From now on, we will consider a function
First, we’ll consider a simple linear function
This allows to further simplify the graph representing our model:
The excitatory and inhibitory connections between
After rewriting
which, when written in terms of
It’s important to note that this rule is Hebbian as well, as the synaptic weights encoding
Second, we’ll consider a nonlinear function
Accordingly, the graph representation of the model changes as follows:
In this new neural implementation, the activities sent between nodes
Analogously to the previous case, we can derive the rate of change of
This rule too is Hebbian for the top connection between
In this section, we’ll discuss the relationship between the computation in the model and a technique of statistical inference involving the minimization of free energy.
As mentioned in the previous section, the posterior distribution
We want the distribution
If the two distributions
Since we choose our approximate distribution to be a delta function, we’ll simply look for the value of its centre parameter
By substituting
Then, since
The integral in the last line of the above equation is called free-energy. We will denote its negative by
We note that the negative free-energy
Assuming
Using
In Section 2.4 we discussed how we wish to find the parameters of the model for which the sensory observation
Since the Kullback-Leibler divergence is non-negative,
As the dimensionality of the inputs and features increases, the nodes’ dynamics and the synaptic plasticity rules remain the same as described in the previous sections, just generalized to multiple dimensions.
With the necessary introduction of matrices and vectors, to clarify the notation, we’ll denote single numbers or variables in italic (
We assume the animal has observed sensory input
where
The probability of observing sensory input
where
The negative free-energy
To calculate the vector of most likely values
Organic rule | Generalization for matrices |
---|---|
if |
if |
Since
Analogously to
To briefly recap what those variables mean semantically:
-
$\bar\epsilon_p$ is the prior prediction error —how much the inferred variable$\bar\phi$ deviates from the prior knowledge$\bar v_p$ , weighted by how confident we are in our prior knowledge. -
$\bar\epsilon_u$ is the sensory prediction error —the difference between the actual sensory input$\bar u$ and the predicted sensory input$g(\bar\phi,{\bf\Theta})$ , weighted by confident we are in the measurement.
It is useful to recall that we multiply —effectively weight— these errors by the inverse covariance matrices
Furthermore, it’s useful to recall that the function
With the prediction errors now defined as
The partial derivative
It’s useful to consider a simple example where
The corresponding partial derivative matrix
The update rate for
where the operator
Then, the nodes can compute the prediction errors with the following dynamics:
Analogously to what we did in Section 2.4, we can also find the rules to update the model parameters encoded in the synaptic connections generalized to higher dimensions:
Lastly, the update rate for the parameters
The above rules,