Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometry of transforms #9

Open
3 tasks
mjhajharia opened this issue Jul 6, 2022 · 9 comments
Open
3 tasks

Geometry of transforms #9

mjhajharia opened this issue Jul 6, 2022 · 9 comments

Comments

@mjhajharia
Copy link
Owner

understanding the geometry of transforms better:

  • behavior of tail
  • convexity
  • (needs more thought and discussion)
@mjhajharia mjhajharia changed the title Discussion Geometry of transforms Jul 6, 2022
@adamhaber
Copy link
Collaborator

One thing I thought about while trying to compare Stan and TFP's Cholesky bijectors was to try and visualize this geometry. Something along these lines:

For a 3x3 correlation matrix, the unconstrained Cholesky factor is 3 numbers - let's call them x,y,z. We could compute, for example, lkj_corr_cholesky_lpdf(f(x,y,z) | eta) and lkj_corr_cholesky_lpdf(g(x,y,z) | eta) where f,g are different transforms, and we can play with different values of eta. If we evaluate these on a 3x3x3 grid (say, x,y,z going from -10 to 10 in jumps of 0.1), we'll get a cube which might capture interesting properties of these different geometries; might be interesting to visualize different 2d projections of this cube, as well as the ratio between the cube for g and the cube for f.

Hope the explanation makes sense! What do you think @mjhajharia @bob-carpenter ?

@bob-carpenter
Copy link
Collaborator

@adamhaber: I think anything we can do to help illustrate these transforms would be great.

My thinking is that we want to evaluate geometry in the tail, body, and head of the density. Maybe the geometry is well-behaved around the mode, but not in the tail.

Tail: leapfrogs until hit body (draw with log density in central 99% interval of posterior log density); this measures how well the transform works to remove transient bias

Body: ESS/leapfrog to see how well it samples after adaptation

Mode: we can evaluate number of leapfrogs to body

I think we also want to look at a couple of other things in all of these places. One is the norm of the gradient

$$ f(x) = \big|\big| \nabla_x \log \pi(x) \big|\big| $$

Not interesting at mode. In the body, we can get a distribution over this. In the tail?

I think it'd also be interesting to test positive definiteness of Hessian and if positive definite, compute its condition number. This we can do at mode, and the in the body we can get a distribution again.

@adamhaber
Copy link
Collaborator

I've pushed a notebook with some examples of the kind of stuff I think we can do, here:

https://github.com/mjhajharia/transforms/blob/feature/corr-cholesky-geometry/transforms/cholesky/visualize%20geometry.ipynb

This is just for specific values of K and eta, but can easily be generalized... let me know what you think!

@bob-carpenter
Copy link
Collaborator

Thanks, @adamhaber. Is there a rendered form somewhere?

@adamhaber
Copy link
Collaborator

adamhaber commented Jul 19, 2022 via email

@bob-carpenter
Copy link
Collaborator

:-). I was expecting graphics given the notebook title included "visualization". This time I actually read the text.

Negative definite is good. The negative of the inverse Hessian of the log density is what we want to be positive definite, right? I always get tripped up with negations and inversions and log/exp---anything that can go either way.

Isotropic is an issue for simplex transforms, too. There's an isotropic log ratio transform that I still don't understand yet.

How much does the K = 5 value affect the geometry?

@adamhaber
Copy link
Collaborator

The negative of the inverse Hessian of the log density is what we want to be positive definite, right? I always get tripped up with negations and inversions and log/exp---anything that can go either way.

Why inverse Hessian? My intuition here is since the mode is the maximum of the log prob function, it should be negative definite (so tiny movements from the mode are always decreasing the log prob, in second order since the first order is always zero).

Isotropic is an issue for simplex transforms, too. There's an isotropic log ratio transform that I still don't understand yet.

Interesting! Do you have any intuition regarding how this might affect the sampler?

How much does the K = 5 value affect the geometry?

What do you mean?

@sethaxen
Copy link
Collaborator

The negative of the inverse Hessian of the log density is what we want to be positive definite, right? I always get tripped up with negations and inversions and log/exp---anything that can go either way.

Why inverse Hessian? My intuition here is since the mode is the maximum of the log prob function, it should be negative definite (so tiny movements from the mode are always decreasing the log prob, in second order since the first order is always zero).

You're both right! The Hessian of the log density is negative definite at the mode (i.e. locally convex), and the inverse of a negative/positive definite matrix is also negative/positive definite. I think what we want is for the Hessian to be negative definite everywhere (globally convex), which implies unimodality.

I suspect more strongly, we would prefer negative definite Hessian < negative diagonal Hessian < negative scalar Hessian, each of which takes us closer to simple multivariate normal geometry for which metric adaptation is ideal.

Here's a thought. For augmented/expanded distributions like simplex augmented softmax we have an extra DOF (e.g. $r$) and a transform $f: y \mapsto (x, r)$. The uniform density on unconstrained space is $p_Y(y) = |J_f(y)| p_R(r(y) | x(y))$, and we need to pick a proper prior $p_R(r(y) | x(y))$ according to some heuristics. The Hessian of the log density is the sum of the Hessians of its two components:
$$H_y [\log p_Y(\cdot)] = H_y [\log |J_f(\cdot)|] + H_y [\log p_R(r(\cdot) | x(\cdot))].$$

If this is analytically tractable, would it make sense to pick a $p_R$ form that guarantees negative definiteness of $H_y [\log p_Y(\cdot)]$ and also tries to bring its off-diagonal terms to zero (either at every point, the mode, or the prior mean of $H_y [\log p_Y(\cdot)]$)?

@bob-carpenter
Copy link
Collaborator

Thanks, @sethaxen. I think we're all on the same page up to some of us (me!) being sloppy with signs. By "negative scalar Hessian" do you mean a scalar product of the identity matrix? That would be nice, but even a simple log transform on a vector of positive values doesn't match this.

If we could pick a $p_R$ that guarantees negative definiteness of the Hessian, that'd be great. I have no idea how to do that.

How much does the K = 5 value affect the geometry?

What I mean is how much does pulling the probability mass toward the unit matrix help condition the Hessian? Empirically, how much easier is it to sample K = 0.1 (pushes mass to corners) vs. K = 1 (uniform) vs. K = 10 (pushes mass toward unit matrix)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants