-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Implement specialized Hurdle distribution #7810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
77ed668
to
9c65ca1
Compare
3d5772c
to
bdb3f12
Compare
bdb3f12
to
ac73b55
Compare
This does not require the arbitrary truncation of continuous distribution in the logp/logcdf
ac73b55
to
9a65487
Compare
9a65487
to
0917343
Compare
dist | ||
for dist in dists | ||
if ( | ||
getattr(dist, "rv_type", None) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was too restrictive, a subclass also inherits the dispatch function, and need not be in the registry explicitly
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7810 +/- ##
==========================================
+ Coverage 91.77% 92.86% +1.08%
==========================================
Files 107 107
Lines 18381 18407 +26
==========================================
+ Hits 16870 17093 +223
+ Misses 1511 1314 -197
🚀 New features to boost your workflow:
|
pymc/distributions/truncated.py
Outdated
|
||
# Creating a hurdle with the adjusted weights and the truncated distribution | ||
# Should be equivalent to truncating the original hurdle distribution | ||
return op.rv_op(adjusted_weights, zero_dist, truncated_dist, size=size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is not quite correct, one must also adjust psi by the amount of mass truncated in the other dist...
Also users may be more interested in a Hurdle with a truncated dist, instead of a truncated hurdle, as it won't distort the meaning of the original psi
This is something that could be added at the bambi level perhaps @tomicapretto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricardoV94 In my use case I am interested in constraining the gamma distribution of the hurdle-gamma. The reason is that there are natural upper limits as it relates to body parts (i.e. the injury area cannot exceed the size of the body part or some proportion thereof). Therefore, any predictions are above that threshold would not make clinical sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a subtle difference, a Truncated(Hurdle(psi, Gamma))
vs Hurdle(psi, Truncated(Gamma))
The generative process of the first is that you resample the whole hurdle process everytime you get a draw out of the truncation bounds, which necessarily inflates the zeros above (1-psi).
In the second is the usual hurdle process, you either get zeros with (1-psi) or you get non zeros from a truncated gamma.
I guess your is more like the second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricardoV94 It is the Hurdle(psi, Truncated(Gamma)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I suspect some functionality needs to go into bambi. You can create that distribution already in PyMC, using the same strategy with which HurdleGamma is defined, but using a TruncatedGamma as the dist.
Does Bambi allow to use arbitrary PyMC distributions or only a subset that it already understands?
Because otherwise the constrained thing in bambi would do the Truncated(Hurdle(Gamma))
not the one you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bambi does allow arbitrary non-mixture continuous PYMC distributions (the way @tomicapretto implemented it is very intuitive and flexible). Ah, so constrained would do Truncated(Hurdle(psi, Gamma)) and inflate 0 counts. Good to know, especially that is very important in what I study. Bambi allows you to apply truncated priors, but erroneous data could overwhelm that prior ... I should actually test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the commit to truncate hurdles. Instead what you want seems to be this:
import pymc as pm
from pymc.distributions import Gamma, Truncated
from pymc.distributions.mixture import _Hurdle
class HurdleTruncatedGamma:
def __new__(cls, name, psi, alpha=None, beta=None, mu=None, sigma=None, upper=None, **kwargs):
return _Hurdle._create(
name=name,
nonzero_p=psi,
nonzero_dist=Truncated.dist(Gamma.dist(alpha=alpha, beta=beta, mu=mu, sigma=sigma), upper=upper),
**kwargs,
)
@classmethod
def dist(cls, psi, alpha=None, beta=None, mu=None, sigma=None, upper=None, **kwargs):
return _Hurdle._create(
name=None,
nonzero_p=psi,
nonzero_dist=Truncated.dist(Gamma.dist(alpha=alpha, beta=beta, mu=mu, sigma=sigma), upper=upper),
**kwargs,
)
dist = HurdleTruncatedGamma.dist(psi=0.5, mu=20, sigma=10, upper=30, size=(1000,))
draws = pm.draw(dist, random_seed=48)
print((draws == 0).mean(), draws.max()) # 0.5 29.921184342002615
That means we may still need the scalarloop implemented in numba, since I suspect the special operation will show up in the gradient of the Truncated Gamma. Anyway, this PR already makes hurdles less expensive by default, just not this special Hurdle of a Truncated variable.
@ricardoV94 This is amazing. Thank you so much for this! |
0917343
to
3f7efa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We should probably add a test similar to one that motivate this PR in the first class
We have the pre-existing hurdlesl tests, in a sense this is just a refactor/optimization. Can't think of anything |
What caused the error originally reported here? pymc-devs/nutpie#163 Does that have a test already? |
That was fixed sometime ago in PyTensor: pymc-devs/pytensor#1137 The performance question when in numba is addressed by pymc-devs/pytensor#1445 Neither is PyMC specific |
Feel free to merge whenever you feel comfortable! I think it's good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I just left two very minor comments
@@ -56,8 +56,8 @@ | |||
] | |||
|
|||
|
|||
class MarginalMixtureRV(SymbolicRandomVariable): | |||
"""A placeholder used to specify a distribution for a mixture sub-graph.""" | |||
class _BaseMixtureRV(SymbolicRandomVariable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you also change the _print_name
?
) | ||
|
||
return mix_logp | ||
mix_support_point = pt.sum(weights * support_point_components, axis=mix_axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the logsumexp and have log scale weights here? Is it because the weights are already in the 0-1 range and taking the log won’t help with precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not computing any log quantities nor starting with any log quantities so I don't think it would help. Also the initial point is not so critical?
It indirectly addresses the issue reported in in pymc-devs/nutpie#163
The new objects have a logp that handles the discrete + continuous process correctly, without requiring the arbitrary truncation of the latter at epsilon. This provides a cheaper and more stable logp / logcdf.
For discrete variables we keep using a truncation
Also added special logic to truncate a Hurdle distribution which solves bambinos/bambi#768, this is not the desired behavior, reverted itCC @zwelitunyiswa
📚 Documentation preview 📚: https://pymc--7810.org.readthedocs.build/en/7810/