-
-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MultivariateGaussian #1301
Add MultivariateGaussian #1301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a first review! Overall, very good job. There's just some details to iron out :)
river/proba/gaussian.py
Outdated
0.0 | ||
>>> for x in X.to_dict(orient="records"): | ||
... p = p.update(x) | ||
>>> p._var |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good idea to expose private attributes in documentation. There should be sigma
parameter to be consistent with proba.Gaussian
(if it's too annoying you can also replace sigma
with var
in proba.Gaussian)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had mixed feelings about this. Liked the representation though. Nevertheless, using pd.DataFrame results in nice representation for docs.
river/proba/gaussian.py
Outdated
𝒩(μ=(0.385, 0.376, 0.501), | ||
σ^2=([0.069 0.019 -0.004] | ||
[0.019 0.100 -0.044] | ||
[-0.004 -0.044 0.078])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ideal if you could get it to be like this:
𝒩(μ=(0.385, 0.376, 0.501), | |
σ^2=([0.069 0.019 -0.004] | |
[0.019 0.100 -0.044] | |
[-0.004 -0.044 0.078])) | |
𝒩( | |
μ=(0.385, 0.376, 0.501), | |
σ^2=( | |
[0.069 0.019 -0.004] | |
[0.019 0.100 -0.044] | |
[-0.004 -0.044 0.078] | |
) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improves readability a lot. I modified the representation to match the idea.
river/proba/gaussian.py
Outdated
>>> p.sigma[0][0] == p_.sigma | ||
True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we want to be able to access values by feature name, and not by position. See what I mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! Along with var, sigma now returns pd.DataFrame which allows feature name indexing.
river/proba/gaussian.py
Outdated
return list(self._var.matrix.values())[-1].mean.n | ||
|
||
@property | ||
def mu(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a dictionary, not a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.
river/proba/gaussian.py
Outdated
) | ||
|
||
@property | ||
def var(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a pandas DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.
Co-authored-by: Max Halford <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing suggestions from Max Halford.
river/proba/gaussian.py
Outdated
return list(self._var.matrix.values())[-1].mean.n | ||
|
||
@property | ||
def mu(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.
river/proba/gaussian.py
Outdated
) | ||
|
||
@property | ||
def var(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Still a few comments :). But you're on the right track, keep it up
river/proba/gaussian.py
Outdated
mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values()) | ||
var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False) | ||
var_str = " [" + var_str.replace("\n", "]\n [") + "]" | ||
return f"𝒩(\n μ=({mu_str}),\n σ^2=(\n{var_str}\n )\n)" | ||
|
||
def update(self, x, w=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def update(self, x, w=1.0): | |
def update(self, x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. :)
river/proba/gaussian.py
Outdated
@@ -274,7 +273,7 @@ def __call__(self, x): | |||
var = self.var | |||
if var is not None: | |||
try: | |||
return multivariate_normal(self.mu, var).pdf(x) | |||
return multivariate_normal([*self.mu.values()], var).pdf(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a bug here. You have to align self.mu
with x
. For instance:
return multivariate_normal([self.mu[i] for i in x], var).pdf(x)
See what I mean? Same goes .cdf(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.
river/proba/gaussian.py
Outdated
@@ -286,13 +285,17 @@ def __call__(self, x): | |||
|
|||
def cdf(self, x): | |||
x = list(x.values()) | |||
return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x) | |||
return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x) | |||
|
|||
def sample(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should return a dictionary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, consistency is the key. :) Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented suggestions from Max Halford.
river/proba/gaussian.py
Outdated
@@ -274,7 +273,7 @@ def __call__(self, x): | |||
var = self.var | |||
if var is not None: | |||
try: | |||
return multivariate_normal(self.mu, var).pdf(x) | |||
return multivariate_normal([*self.mu.values()], var).pdf(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.
river/proba/gaussian.py
Outdated
mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values()) | ||
var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False) | ||
var_str = " [" + var_str.replace("\n", "]\n [") + "]" | ||
return f"𝒩(\n μ=({mu_str}),\n σ^2=(\n{var_str}\n )\n)" | ||
|
||
def update(self, x, w=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. :)
river/proba/gaussian.py
Outdated
@@ -286,13 +285,17 @@ def __call__(self, x): | |||
|
|||
def cdf(self, x): | |||
x = list(x.values()) | |||
return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x) | |||
return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x) | |||
|
|||
def sample(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, consistency is the key. :) Done.
Introduced a new base class as well, I realized using proba.ContinuousDIstribution comes with type hinting which is not consistent with proba.MultivariateGaussian. @MaxHalford, you introduced this idea in discussion, please can you review, whether it is consistent? :) |
I'm merging because it's very good! Congrats @MarekWadinger :) |
Thank you, I'm very happy to hear that! :) I see lot of potential in this project and I'd be happy to contribute further. Currently, I'm playing around with conversion to sklearn to employ my model in validation scheme using sklearn's hyperparameter tuning. I might have some questions/suggestions related to getting/setting parameters in River2SKLClassifier, which is troubling me a bit. |
We have some basic hyperparameter optimization capabilities in River. Have you checked it out? I'm not sure it's well documented 😅 |
I did, actually tried to use it with an ensemble of weak unsupervised learners to optimize their parameters, taking majority vote as ground truth. Turned out protection of learners was the key challenge and I had to postpone further work. Nevertheless, here I need to fit to the validation framework of this repo, to validate my results against algorithms in recent comparison study of streaming anomaly detectors. Here they use sklearn's random grid search to avoid single point of failure which is wrong parameter selection. |
Implemented MultivariateGaussian, as part of ContinuousDistributions, taking advantage of EmpiricalCovariance class, extending its example in doctest, showing basic usage of the class.