Skip to content

Handle NaNs #360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jo-mueller opened this issue Jan 7, 2025 · 6 comments
Open

Handle NaNs #360

jo-mueller opened this issue Jan 7, 2025 · 6 comments
Labels
bug Something isn't working

Comments

@jo-mueller
Copy link
Collaborator

When using the algorithm widgets, there is no check for NaNs in the processed dataframes. To handle these correctly, NaN rows should be removed before passing the data onto the respective algorithm. Likewise, the NaN rows should be added again after the processing or napari will complain that the number of features in the input/outputs doesn't match.

@jo-mueller jo-mueller added the bug Something isn't working label Jan 7, 2025
@jo-mueller jo-mueller added this to the v0.9.0 milestone Jan 7, 2025
@zoccoler
Copy link
Collaborator

zoccoler commented Mar 3, 2025

how about we implement this in a minor version prior to v0.9.0? I think this would already be useful for the next workshop, I am getting errors with the dimensionality reduction algorithms sometimes because of this.

@jo-mueller
Copy link
Collaborator Author

It's a bit strange because this problem was encountered before and also fixed by #70 so I'm not quite sure why the algorithms fail sometimes... 🤔

@zoccoler
Copy link
Collaborator

zoccoler commented Mar 5, 2025

I am using version 0.8.1.

I managed to make a MWE:

Draw a label with a single pixel (and a couple others larger if you like) and measure all features with napari-skimage-regionprops. You will end up with a table with aspect_ratio = np.nan, and roundness and circularity = np.inf.

Then run UMAP with default parameters.

I got the errors below and napari shuts itself down after a few seconds.

napari_clusters_plotter\_dimensionality_reduction.py:489: UserWarning: These features contain inf values: ['roundness', 'circularity']. They will be excluded from the analysis.!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\umap_.py:2462: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\spectral.py:519: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead.!

The inf part seems OK, so I am guessing the problem is with the NaN?

@jo-mueller
Copy link
Collaborator Author

jo-mueller commented Mar 5, 2025

@zoccoler Can not reproduce, unfortunately. This is my test code:

import numpy as np
import napari
import pandas as pd

import napari_clusters_plotter as ncp
ncp.__version__

labels = np.zeros((100, 100), dtype=int)
labels[:1, :1] = 1
labels[10:15, 10:15] = 2
labels[20:25, 20:25] = 3
labels[30:35, 30:35] = 4
labels[40:45, 40:45] = 5
labels[50:55, 50:55] = 6
labels[60:65, 60:65] = 7

features = pd.DataFrame({
    'label': [1, 2, 3, 4, 5, 6, 7],
    'feature1': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
    'feature2': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'feature3': [0.7, np.nan, 0.9, 1.0, 1.1, 1.2, 1.3]
})

viewer = napari.Viewer()
viewer.add_labels(labels, name='labels', features=features)

When I run a UMAP on this data, the result looks just fine. If I add np.inf to the data, then the workflow still works - which it shouldn't. My susspicion here would be that the problem is that np.nans and np.infs are handled in two different places, namely here and here. If I combine infs and nans into the features dataframe, it still works, though. Not sure what causes the error.

Edit: Version is also 0.8.1

@zoccoler zoccoler removed this from the v0.9.0 milestone Mar 6, 2025
@zoccoler
Copy link
Collaborator

zoccoler commented Mar 6, 2025

I ran your code and it works, but if I get the measurements using napari-skimage-regionprops (no intensity and moments in this case) and then run UMAP on all features, I get some warnings and the UMAP columns never show up.

c:\Users\mazo260d\Documents\GitHub\napari-clusters-plotter\napari_clusters_plotter\_dimensionality_reduction.py:489: UserWar
ning: These features contain inf values: ['roundness', 'circularity']. They will be excluded from the analysis.!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\sklearn\utils\extmath.py:1101: RuntimeWarning: invalid value encou
ntered in divide!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\sklearn\utils\extmath.py:1106: RuntimeWarning: invalid value encou
ntered in divide!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\sklearn\utils\extmath.py:1126: RuntimeWarning: invalid value encou
ntered in divide!

Or this:

c:\Users\mazo260d\Documents\GitHub\napari-clusters-plotter\napari_clusters_plotter\_dimensionality_reduction.py:489: UserWar
ning: These features contain inf values: ['roundness', 'circularity']. They will be excluded from the analysis.!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by
 setting random_state. Use no seed for parallelism.!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\umap_.py:2462: UserWarning: n_neighbors is larger than the da
taset size; truncating to X.shape[0] - 1!
C:\Users\mazo260d\miniforge3\envs\tim25\lib\site-packages\umap\umap_.py:134: UserWarning: A large number of your vertices we
re disconnected from the manifold.
Disconnection_distance = inf has removed 0 edges.
It has fully disconnected 2 vertices.
You might consider using find_disconnected_points() to find and remove these points from your data.
Use umap.utils.disconnected_vertices() to identify them.!

It is somehow inconsistent though, it does not always fail. One problem could be running UMAP with way more columns/features than rows.

I can't identify fully the problem, so let's not change now to check if this really becomes an issue in the near future.

I like the catching NaN decorator approach, maybe we could blend inf handling the same way.
One problem I notice is if we have a whole column with NaNs.

@jo-mueller
Copy link
Collaborator Author

I like the catching NaN decorator approach, maybe we could blend inf handling the same way.

I tried that and I think we run into problems if we try to handle np.inf twice. If we put it into the decorator, then the decorator should also handle the StandardScaler and remove the if-clause that currently does the checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants