Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultivariateGaussian #1301

Merged
merged 17 commits into from
Aug 2, 2023
Merged

Add MultivariateGaussian #1301

merged 17 commits into from
Aug 2, 2023

Conversation

MarekWadinger
Copy link
Contributor

Implemented MultivariateGaussian, as part of ContinuousDistributions, taking advantage of EmpiricalCovariance class, extending its example in doctest, showing basic usage of the class.

Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a first review! Overall, very good job. There's just some details to iron out :)

0.0
>>> for x in X.to_dict(orient="records"):
... p = p.update(x)
>>> p._var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to expose private attributes in documentation. There should be sigma parameter to be consistent with proba.Gaussian (if it's too annoying you can also replace sigma with var in proba.Gaussian)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had mixed feelings about this. Liked the representation though. Nevertheless, using pd.DataFrame results in nice representation for docs.

river/proba/gaussian.py Outdated Show resolved Hide resolved
river/proba/gaussian.py Outdated Show resolved Hide resolved
Comment on lines 143 to 146
𝒩(μ=(0.385, 0.376, 0.501),
σ^2=([0.069 0.019 -0.004]
[0.019 0.100 -0.044]
[-0.004 -0.044 0.078]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal if you could get it to be like this:

Suggested change
𝒩(μ=(0.385, 0.376, 0.501),
σ^2=([0.069 0.019 -0.004]
[0.019 0.100 -0.044]
[-0.004 -0.044 0.078]))
𝒩(
μ=(0.385, 0.376, 0.501),
σ^2=(
[0.069 0.019 -0.004]
[0.019 0.100 -0.044]
[-0.004 -0.044 0.078]
)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improves readability a lot. I modified the representation to match the idea.

river/proba/gaussian.py Outdated Show resolved Hide resolved
Comment on lines 188 to 189
>>> p.sigma[0][0] == p_.sigma
True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we want to be able to access values by feature name, and not by position. See what I mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! Along with var, sigma now returns pd.DataFrame which allows feature name indexing.

return list(self._var.matrix.values())[-1].mean.n

@property
def mu(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a dictionary, not a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.

)

@property
def var(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a pandas DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.

river/proba/gaussian.py Outdated Show resolved Hide resolved
river/proba/gaussian.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@MarekWadinger MarekWadinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing suggestions from Max Halford.

return list(self._var.matrix.values())[-1].mean.n

@property
def mu(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.

)

@property
def var(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.

Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Still a few comments :). But you're on the right track, keep it up

mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values())
var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False)
var_str = " [" + var_str.replace("\n", "]\n [") + "]"
return f"𝒩(\n μ=({mu_str}),\n σ^2=(\n{var_str}\n )\n)"

def update(self, x, w=1.0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def update(self, x, w=1.0):
def update(self, x):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. :)

@@ -274,7 +273,7 @@ def __call__(self, x):
var = self.var
if var is not None:
try:
return multivariate_normal(self.mu, var).pdf(x)
return multivariate_normal([*self.mu.values()], var).pdf(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a bug here. You have to align self.mu with x. For instance:

return multivariate_normal([self.mu[i] for i in x], var).pdf(x)

See what I mean? Same goes .cdf(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.

@@ -286,13 +285,17 @@ def __call__(self, x):

def cdf(self, x):
x = list(x.values())
return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x)
return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x)

def sample(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should return a dictionary :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, consistency is the key. :) Done.

Copy link
Contributor Author

@MarekWadinger MarekWadinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented suggestions from Max Halford.

@@ -274,7 +273,7 @@ def __call__(self, x):
var = self.var
if var is not None:
try:
return multivariate_normal(self.mu, var).pdf(x)
return multivariate_normal([*self.mu.values()], var).pdf(x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.

mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values())
var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False)
var_str = " [" + var_str.replace("\n", "]\n [") + "]"
return f"𝒩(\n μ=({mu_str}),\n σ^2=(\n{var_str}\n )\n)"

def update(self, x, w=1.0):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. :)

@@ -286,13 +285,17 @@ def __call__(self, x):

def cdf(self, x):
x = list(x.values())
return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x)
return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x)

def sample(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, consistency is the key. :) Done.

@MarekWadinger
Copy link
Contributor Author

Introduced a new base class as well, I realized using proba.ContinuousDIstribution comes with type hinting which is not consistent with proba.MultivariateGaussian. @MaxHalford, you introduced this idea in discussion, please can you review, whether it is consistent? :)

@MaxHalford MaxHalford merged commit b13be28 into online-ml:main Aug 2, 2023
7 of 11 checks passed
@MaxHalford
Copy link
Member

I'm merging because it's very good! Congrats @MarekWadinger :)

@MarekWadinger
Copy link
Contributor Author

Thank you, I'm very happy to hear that! :) I see lot of potential in this project and I'd be happy to contribute further.

Currently, I'm playing around with conversion to sklearn to employ my model in validation scheme using sklearn's hyperparameter tuning. I might have some questions/suggestions related to getting/setting parameters in River2SKLClassifier, which is troubling me a bit.

@MaxHalford
Copy link
Member

We have some basic hyperparameter optimization capabilities in River. Have you checked it out? I'm not sure it's well documented 😅

@MarekWadinger
Copy link
Contributor Author

I did, actually tried to use it with an ensemble of weak unsupervised learners to optimize their parameters, taking majority vote as ground truth. Turned out protection of learners was the key challenge and I had to postpone further work.

Nevertheless, here I need to fit to the validation framework of this repo, to validate my results against algorithms in recent comparison study of streaming anomaly detectors.

Here they use sklearn's random grid search to avoid single point of failure which is wrong parameter selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants