-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the expected repeatability of cugraph.leiden? #4072
Comments
@naimnv - can you check this out? I haven't looked at how we are using the random numbers in Leiden and how deterministic we should expect things to be. I think it's a reasonable expectation (in the abstract) that if you provide the same random number state to begin an algorithm and run it on the same data that you would get the same result. If there is something that we're doing that makes the algorithm non-deterministic (even when including the state of the random number generator as one of the controlled inputs), we should probably understand and document this. @jpintar - Is there a reproducer you could share to jump start our analysis? Ideally something that loads a graph, runs your 10 runs and demonstrates the variations you are seeing. |
Here is a zipped
When I run this on an NVIDIA A100, a typical output looks like this:
So even after setting |
I will take a look at it tomorrow |
@jpintar Could you please provide us with the smallest adjacency matrix [perhaps sub-sampled] where cluster assignments is different, ie adjusted Rand index is less than 1.0 ? |
For the adjacency matrix that you provided, locally I get similar modularity (differs by 1e-03 from run to run) |
I've been testing adjacency matrices of different sizes and I'm getting some Rand index values less than 1.0 at 1100 vertices (but not at 1000 vertices). A representative run:
By the way, these are all KNN-graphs, generated from subsamples of real scRNA-seq data. The zipped |
Have you been able to make any progress on this? |
I am running into the same problem while running Leiden clustering on a KNN graph constructed with exact neighbors. |
@jpintar @mbruhns |
Thank you for looking into this more! With version
Like you, I see improvement for the 100,000 node graph, especially for
For my purposes, the new |
Thank you for checking out the latest version. We're still investigating it to find out the remaining cause of the variation. We'll get back to you as soon as we know more. |
Is there any progress on this? I find myself having to run a lot of Leiden clustering these days, and it would be great to be able to reliably do this on GPU, but as I said, without runs being consistent to at least RI around 0.95, preferably more, that's just not an option... |
Hi, |
The Leiden dendrogram was being populated with both Louvain cluster assignment and mapping between Leiden and Louvain clustering. The flattening of the dendrogram was being accomplished by applying the Louvain clustering, then mapping between the Leiden clusters at each level. Unfortunately, the Leiden to Louvain mapping allows a many-to-one relationship, so in certain cases the flattening was non-deterministic. There wasn't enough information in the dendrogram to perform the mapping deterministically. This PR modifies the dendrogram to be similar to Louvain, just keeping the cluster assignments at each level. The mapping between Louvain and Leiden clusters is done when creating the dendrogram, where there is sufficient information to perform this translation deterministically. Closes #4072 Authors: - Chuck Hastings (https://github.com/ChuckHastings) - Naim (https://github.com/naimnv) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Naim (https://github.com/naimnv) - Rick Ratzel (https://github.com/rlratzel) URL: #4347
We were able to identify and eliminate the variability. It turns out the Leiden implementation was fine, but the collapsing of the dendrogram into the final result was non-deterministic. The information stored in the dendrogram was ambiguous, it was missing some information that was known during algorithm execution that would allow for the dendrogram to be flattened deterministically. We modified Leiden to store within the dendrogram results that included sufficient detail to be able to return a deterministic flattening, thus giving us consistent results. Please retest with the latest 24.06 code (either from source or the nightly builds). If you have any further issues, open a new issue and let us know. |
@ChuckHastings, thank you for working on this! I've just tested with version
Since this is probably a different problem than the dendrogram flattening that you fixed, should I open a new issue? Or can you reopen this one? |
I create a new issue for you with your comment above. We will investigate there, since - as you suggest - it's likely a different problem with similar symptoms. |
What is your question?
I know that one cannot expect exact repeatability over multiple runs of
cugraph.leiden
(even settingrandom_state
to a constant). But how similar can we rely on the results being?I've repeated 10 runs on a 15,000-vertex graph on 13 different GPUs, and got an average adjusted Rand index of 0.82 across the runs on a given GPU (min: 0.69, max: 0.95). That corresponds to a difference of ±1 cluster detected on average (and up to a 6-cluster difference in the worst case) for this graph. Is that as consistent as we can hope for?
Code of Conduct
The text was updated successfully, but these errors were encountered: