Add MultivariateGaussian #1301

MarekWadinger · 2023-07-30T09:30:45Z

Implemented MultivariateGaussian, as part of ContinuousDistributions, taking advantage of EmpiricalCovariance class, extending its example in doctest, showing basic usage of the class.

MaxHalford

Here's a first review! Overall, very good job. There's just some details to iron out :)

MaxHalford · 2023-07-31T17:56:57Z

river/proba/gaussian.py

+    0.0
+    >>> for x in X.to_dict(orient="records"):
+    ...     p = p.update(x)
+    >>> p._var


It's not a good idea to expose private attributes in documentation. There should be sigma parameter to be consistent with proba.Gaussian (if it's too annoying you can also replace sigma with var in proba.Gaussian)

Had mixed feelings about this. Liked the representation though. Nevertheless, using pd.DataFrame results in nice representation for docs.

river/proba/gaussian.py

MaxHalford · 2023-07-31T17:58:22Z

river/proba/gaussian.py

+    𝒩(μ=(0.385, 0.376, 0.501),
+    σ^2=([0.069 0.019 -0.004]
+     [0.019 0.100 -0.044]
+     [-0.004 -0.044 0.078]))


It would be ideal if you could get it to be like this:

Suggested change

𝒩(μ=(0.385, 0.376, 0.501),

σ^2=([0.069 0.019 -0.004]

[0.019 0.100 -0.044]

[-0.004 -0.044 0.078]))

𝒩(

μ=(0.385, 0.376, 0.501),

σ^2=(

[0.069 0.019 -0.004]

[0.019 0.100 -0.044]

[-0.004 -0.044 0.078]

)

)

Improves readability a lot. I modified the representation to match the idea.

river/proba/gaussian.py

MaxHalford · 2023-07-31T18:00:42Z

river/proba/gaussian.py

+    >>> p.sigma[0][0] == p_.sigma
+    True


Ideally, we want to be able to access values by feature name, and not by position. See what I mean?

Definitely! Along with var, sigma now returns pd.DataFrame which allows feature name indexing.

MaxHalford · 2023-07-31T18:00:57Z

river/proba/gaussian.py

+            return list(self._var.matrix.values())[-1].mean.n
+
+    @property
+    def mu(self):


This should return a dictionary, not a list

Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.

MaxHalford · 2023-07-31T18:01:09Z

river/proba/gaussian.py

+        )
+
+    @property
+    def var(self):


This should return a pandas DataFrame

I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.

river/proba/gaussian.py

Co-authored-by: Max Halford <[email protected]>

MarekWadinger

Implementing suggestions from Max Halford.

MarekWadinger · 2023-08-01T08:17:35Z

river/proba/gaussian.py

+            return list(self._var.matrix.values())[-1].mean.n
+
+    @property
+    def mu(self):


Thank you for your valuable suggestions. I made properties "mu" and "mode" return dictionary.

MarekWadinger · 2023-08-01T08:18:33Z

river/proba/gaussian.py

+        )
+
+    @property
+    def var(self):


I made var and sigma return DataFrames. Seems to serve well for representations which are now aligned on dot.

MaxHalford

Looking good! Still a few comments :). But you're on the right track, keep it up

MaxHalford · 2023-08-01T08:51:43Z

river/proba/gaussian.py

+        mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values())
+        var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False)
+        var_str = "        [" + var_str.replace("\n", "]\n        [") + "]"
+        return f"𝒩(\n    μ=({mu_str}),\n    σ^2=(\n{var_str}\n    )\n)"

    def update(self, x, w=1.0):


Suggested change

def update(self, x, w=1.0):

def update(self, x):

MaxHalford · 2023-08-01T08:52:59Z

river/proba/gaussian.py

@@ -274,7 +273,7 @@ def __call__(self, x):
        var = self.var
        if var is not None:
            try:
-                return multivariate_normal(self.mu, var).pdf(x)
+                return multivariate_normal([*self.mu.values()], var).pdf(x)


There might be a bug here. You have to align self.mu with x. For instance:

return multivariate_normal([self.mu[i] for i in x], var).pdf(x)

See what I mean? Same goes .cdf(x)

Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.

MaxHalford · 2023-08-01T08:53:29Z

river/proba/gaussian.py

@@ -286,13 +285,17 @@ def __call__(self, x):

    def cdf(self, x):
        x = list(x.values())
-        return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x)
+        return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x)

    def sample(self):


This method should return a dictionary :)

Right, consistency is the key. :) Done.

MarekWadinger

Implemented suggestions from Max Halford.

MarekWadinger · 2023-08-02T06:27:05Z

river/proba/gaussian.py

@@ -274,7 +273,7 @@ def __call__(self, x):
        var = self.var
        if var is not None:
            try:
-                return multivariate_normal(self.mu, var).pdf(x)
+                return multivariate_normal([*self.mu.values()], var).pdf(x)


Thank you for pointing out this. I realized that neither mu and var were sorted the same way. Made both sort their keys alphabetically. Now I let x adapt the order of keys, which is consistent between mu and var.

MarekWadinger · 2023-08-02T06:28:03Z

river/proba/gaussian.py

+        mu_str = ", ".join(f"{m:.3f}" for m in self.mu.values())
+        var_str = self.var.to_string(float_format="{:0.3f}".format, header=False, index=False)
+        var_str = "        [" + var_str.replace("\n", "]\n        [") + "]"
+        return f"𝒩(\n    μ=({mu_str}),\n    σ^2=(\n{var_str}\n    )\n)"

    def update(self, x, w=1.0):


MarekWadinger · 2023-08-02T06:28:37Z

river/proba/gaussian.py

@@ -286,13 +285,17 @@ def __call__(self, x):

    def cdf(self, x):
        x = list(x.values())
-        return multivariate_normal(self.mu, self.var, allow_singular=True).cdf(x)
+        return multivariate_normal([*self.mu.values()], self.var, allow_singular=True).cdf(x)

    def sample(self):


Right, consistency is the key. :) Done.

MarekWadinger · 2023-08-02T06:33:16Z

Introduced a new base class as well, I realized using proba.ContinuousDIstribution comes with type hinting which is not consistent with proba.MultivariateGaussian. @MaxHalford, you introduced this idea in discussion, please can you review, whether it is consistent? :)

MaxHalford · 2023-08-02T08:02:49Z

I'm merging because it's very good! Congrats @MarekWadinger :)

MarekWadinger · 2023-08-02T08:10:04Z

Thank you, I'm very happy to hear that! :) I see lot of potential in this project and I'd be happy to contribute further.

Currently, I'm playing around with conversion to sklearn to employ my model in validation scheme using sklearn's hyperparameter tuning. I might have some questions/suggestions related to getting/setting parameters in River2SKLClassifier, which is troubling me a bit.

MaxHalford · 2023-08-02T08:22:09Z

We have some basic hyperparameter optimization capabilities in River. Have you checked it out? I'm not sure it's well documented 😅

MarekWadinger · 2023-08-02T08:32:47Z

I did, actually tried to use it with an ensemble of weak unsupervised learners to optimize their parameters, taking majority vote as ground truth. Turned out protection of learners was the key challenge and I had to postpone further work.

Nevertheless, here I need to fit to the validation framework of this repo, to validate my results against algorithms in recent comparison study of streaming anomaly detectors.

Here they use sklearn's random grid search to avoid single point of failure which is wrong parameter selection.

MarekWadinger added 3 commits July 26, 2023 10:46

Implement MultivariateGaussian

6fe74de

Updated GaussianScorer API to work with Pipelines

470eca7

Merge remote-tracking branch 'upstream/main'

7d99b89

MarekWadinger requested review from MaxHalford and smastelini as code owners July 30, 2023 09:30

MaxHalford requested changes Jul 31, 2023

View reviewed changes

MarekWadinger and others added 10 commits August 1, 2023 08:26

Apply suggestions from code review

33a2b3e

Co-authored-by: Max Halford <[email protected]>

UPDATE: MultivariateGaussian sample returns float

5d0e9a7

Remove private attributes from doctest

1622fe5

UPDATE: make representation nicer

02c67d6

UPDATE: doctest for representation

693808e

Replaced not implemented with TODOs

abb911f

Improve coverage and doctest readability

7ae847f

UPDATE: make mu, var return dict, pd.DataFrame

b3ec137

Reformatting by pre-commit

411af34

Make sigma return pd.DataFrame

318e973

MarekWadinger commented Aug 1, 2023

View reviewed changes

MaxHalford requested changes Aug 1, 2023

View reviewed changes

MarekWadinger added 4 commits August 2, 2023 07:53

FIX: sort named values

cdb56c9

ADD: typehints; UPDATE: type in cdf, pdf

c6bd614

ADD: proba.MultivariateContinuousDistribution base class

45ef6d8

UPDATE: sample of MultivariateGaussian returns dict

566bcf1

MarekWadinger commented Aug 2, 2023

View reviewed changes

MaxHalford merged commit b13be28 into online-ml:main Aug 2, 2023
7 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultivariateGaussian #1301

Add MultivariateGaussian #1301

MarekWadinger commented Jul 30, 2023

MaxHalford left a comment

MaxHalford Jul 31, 2023

MarekWadinger Aug 1, 2023

MaxHalford Jul 31, 2023

MarekWadinger Aug 1, 2023

MaxHalford Jul 31, 2023

MarekWadinger Aug 1, 2023

MaxHalford Jul 31, 2023

MarekWadinger Aug 1, 2023

MaxHalford Jul 31, 2023

MarekWadinger Aug 1, 2023

MarekWadinger left a comment

MarekWadinger Aug 1, 2023

MarekWadinger Aug 1, 2023

MaxHalford left a comment

MaxHalford Aug 1, 2023

MarekWadinger Aug 2, 2023

MaxHalford Aug 1, 2023

MarekWadinger Aug 2, 2023

MaxHalford Aug 1, 2023

MarekWadinger Aug 2, 2023

MarekWadinger left a comment

MarekWadinger Aug 2, 2023

MarekWadinger Aug 2, 2023

MarekWadinger Aug 2, 2023

MarekWadinger commented Aug 2, 2023

MaxHalford commented Aug 2, 2023

MarekWadinger commented Aug 2, 2023

MaxHalford commented Aug 2, 2023

MarekWadinger commented Aug 2, 2023

Add MultivariateGaussian #1301

Add MultivariateGaussian #1301

Conversation

MarekWadinger commented Jul 30, 2023

MaxHalford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarekWadinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxHalford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarekWadinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarekWadinger commented Aug 2, 2023

MaxHalford commented Aug 2, 2023

MarekWadinger commented Aug 2, 2023

MaxHalford commented Aug 2, 2023

MarekWadinger commented Aug 2, 2023