Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the expected repeatability of cugraph.leiden? #4072

Closed
2 tasks done
jpintar opened this issue Dec 29, 2023 · 16 comments · Fixed by #4347
Closed
2 tasks done

What is the expected repeatability of cugraph.leiden? #4072

jpintar opened this issue Dec 29, 2023 · 16 comments · Fixed by #4347
Assignees
Labels
question Further information is requested

Comments

@jpintar
Copy link

jpintar commented Dec 29, 2023

What is your question?

I know that one cannot expect exact repeatability over multiple runs of cugraph.leiden (even setting random_state to a constant). But how similar can we rely on the results being?

I've repeated 10 runs on a 15,000-vertex graph on 13 different GPUs, and got an average adjusted Rand index of 0.82 across the runs on a given GPU (min: 0.69, max: 0.95). That corresponds to a difference of ±1 cluster detected on average (and up to a 6-cluster difference in the worst case) for this graph. Is that as consistent as we can hope for?

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open issues and have found no duplicates for this question
@jpintar jpintar added the question Further information is requested label Dec 29, 2023
@ChuckHastings
Copy link
Collaborator

@naimnv - can you check this out? I haven't looked at how we are using the random numbers in Leiden and how deterministic we should expect things to be. I think it's a reasonable expectation (in the abstract) that if you provide the same random number state to begin an algorithm and run it on the same data that you would get the same result. If there is something that we're doing that makes the algorithm non-deterministic (even when including the state of the random number generator as one of the controlled inputs), we should probably understand and document this.

@jpintar - Is there a reproducer you could share to jump start our analysis? Ideally something that loads a graph, runs your 10 runs and demonstrates the variations you are seeing.

@jpintar
Copy link
Author

jpintar commented Jan 11, 2024

Here is a zipped npz file containing the adjacency matrix of a 100,000 vertex graph. The quick-and-dirty code below will illustrate the issue:

import pandas as pd
import cudf as cd
import numpy as np
import cupy as cp
import scipy as sp
import cugraph as cg
import itertools
import random
from sklearn.metrics import adjusted_rand_score

def set_seeds(seed):
    random.seed(seed)
    np.random.seed(seed)
    cp.random.seed(seed)

# Function to repeatedly perform cugraph.leiden and return
# data frames with stats and partition assignments for each run
def repeat_leiden(graph, iterations, random_state, max_iter):
    leiden_stats = []
    leiden_partitions = []
    for i in range(iterations):
        # Set all seeds out of precaution
        set_seeds(random_state)
        # Run cugraph.leiden with fixed random_state
        parts, mod  = cg.leiden(graph, random_state=random_state, max_iter=max_iter)
        
        leiden_stats.append(
            {
               'modularity': mod,
               'partition_count': parts['partition'].nunique()
            }
        )
        
        leiden_partitions.append(
            parts.sort_values('vertex').to_pandas()['partition']
        )
    
    leiden_stats = pd.DataFrame(leiden_stats)
    leiden_stats.index.name = 'run'
    
    leiden_partitions = pd.concat(
        leiden_partitions, 
        axis=1, 
        keys=[f"partition_{i}" for i in range(iterations)]
    ).reset_index(drop=True)
    leiden_partitions.index.name = 'vertex'

    return leiden_stats, leiden_partitions

# Function to compute adjusted Rand index between pairs of runs
def compute_rand_scores(partitions_df):
    rand_scores = np.full((partitions_df.shape[1] - 1, partitions_df.shape[1]), np.nan)
    for i,j in itertools.combinations(range(partitions_df.shape[1]), 2):
        rand_scores[i, j] = adjusted_rand_score(
            partitions_df[f"partition_{i}"],
            partitions_df[f"partition_{j}"]
        )
    rand_scores = np.delete(rand_scores, 0, axis=1)
    return rand_scores

# Load adjacency matrix and generate graph
adjacency = sp.sparse.load_npz('./adjacency.npz')

offsets = cd.Series(adjacency.indptr)
indices = cd.Series(adjacency.indices)
weights = cd.Series(adjacency.data)

g = cg.Graph()
g.from_cudf_adjlist(offsets, indices, weights)

# Perform clustering and compute Rand scores
leiden_stats, leiden_partitions = repeat_leiden(g, iterations=10, random_state=1234, max_iter=10000)
rand_scores = compute_rand_scores(leiden_partitions)

# Convert adjacency matrix to float64 and repeat
adjacency_64 = adjacency.astype('float64')

offsets_64 = cd.Series(adjacency_64.indptr)
indices_64 = cd.Series(adjacency_64.indices)
weights_64 = cd.Series(adjacency_64.data)

g_64 = cg.Graph()
g_64.from_cudf_adjlist(offsets_64, indices_64, weights_64)

leiden_stats_64, leiden_partitions_64 = repeat_leiden(g_64, iterations=10, random_state=1234, max_iter=10000)
rand_scores_64 = compute_rand_scores(leiden_partitions_64)

# Print results
np.set_printoptions(linewidth=150)
print(f"Min. modularity (float32): {leiden_stats['modularity'].min()}")
print(f"Mean modularity (float32): {leiden_stats['modularity'].mean()}")
print(f"Max. modularity (float32): {leiden_stats['modularity'].max()}\n")

print(f"Min. partition count (float32): {leiden_stats['partition_count'].min()}")
print(f"Max. partition count (float32): {leiden_stats['partition_count'].max()}\n")

print('Adjusted Rand index matrix (float32):')
print(rand_scores)
print(f"\nMin. adjusted Rand index (float32): {np.nanmin(rand_scores)}")
print(f"Mean adjusted Rand index (float32): {np.nanmean(rand_scores)}")
print(f"Max. adjusted Rand index (float32): {np.nanmax(rand_scores)}\n\n")

print(f"Min. modularity (float64): {leiden_stats_64['modularity'].min()}")
print(f"Mean modularity (float64): {leiden_stats_64['modularity'].mean()}")
print(f"Max. modularity (float64): {leiden_stats_64['modularity'].max()}\n")

print(f"Min. partition count (float64): {leiden_stats_64['partition_count'].min()}")
print(f"Max. partition count (float64): {leiden_stats_64['partition_count'].max()}\n")

print('Adjusted Rand index matrix (float64):')
print(rand_scores)
print(f"\nMin. adjusted Rand index (float64): {np.nanmin(rand_scores_64)}")
print(f"Mean adjusted Rand index (float64): {np.nanmean(rand_scores_64)}")
print(f"Max. adjusted Rand index (float64): {np.nanmax(rand_scores_64)}")

When I run this on an NVIDIA A100, a typical output looks like this:

Min. modularity (float32): 0.8951270580291748
Mean modularity (float32): 0.8961624324321746
Max. modularity (float32): 0.8971302509307861

Min. partition count (float32): 40
Max. partition count (float32): 44

Adjusted Rand index matrix (float32):
[[0.75088924 0.73472888 0.70018278 0.75595923 0.71810147 0.70763418 0.72208604 0.69235962 0.71466045]
 [       nan 0.72100122 0.65890278 0.71082352 0.70139466 0.67741928 0.7483544  0.6896195  0.70494497]
 [       nan        nan 0.69125541 0.75972938 0.75688551 0.72175652 0.72950286 0.73229129 0.7107452 ]
 [       nan        nan        nan 0.7182471  0.71213015 0.73381263 0.61834689 0.75539244 0.71371697]
 [       nan        nan        nan        nan 0.75071688 0.6976863  0.72713385 0.70082697 0.72347295]
 [       nan        nan        nan        nan        nan 0.74508279 0.68753365 0.75040769 0.70460173]
 [       nan        nan        nan        nan        nan        nan 0.63538147 0.79620864 0.72844381]
 [       nan        nan        nan        nan        nan        nan        nan 0.63022071 0.72055443]
 [       nan        nan        nan        nan        nan        nan        nan        nan 0.73431252]]

Min. adjusted Rand index (float32): 0.6183468875962401
Mean adjusted Rand index (float32): 0.7154546432734051
Max. adjusted Rand index (float32): 0.7962086419196115


Min. modularity (float64): 0.8953406848341683
Mean modularity (float64): 0.8962776434572689
Max. modularity (float64): 0.8976905676365944

Min. partition count (float64): 41
Max. partition count (float64): 44

Adjusted Rand index matrix (float64):
[[0.75088924 0.73472888 0.70018278 0.75595923 0.71810147 0.70763418 0.72208604 0.69235962 0.71466045]
 [       nan 0.72100122 0.65890278 0.71082352 0.70139466 0.67741928 0.7483544  0.6896195  0.70494497]
 [       nan        nan 0.69125541 0.75972938 0.75688551 0.72175652 0.72950286 0.73229129 0.7107452 ]
 [       nan        nan        nan 0.7182471  0.71213015 0.73381263 0.61834689 0.75539244 0.71371697]
 [       nan        nan        nan        nan 0.75071688 0.6976863  0.72713385 0.70082697 0.72347295]
 [       nan        nan        nan        nan        nan 0.74508279 0.68753365 0.75040769 0.70460173]
 [       nan        nan        nan        nan        nan        nan 0.63538147 0.79620864 0.72844381]
 [       nan        nan        nan        nan        nan        nan        nan 0.63022071 0.72055443]
 [       nan        nan        nan        nan        nan        nan        nan        nan 0.73431252]]

Min. adjusted Rand index (float64): 0.6581237251290206
Mean adjusted Rand index (float64): 0.7217165835712174
Max. adjusted Rand index (float64): 0.8139315511881964

So even after setting random_state to a constant (and other random seeds, too, just in case), there is considerable variation between runs. As can be seen, using float64 doesn't improve things, either. All this is with cuGraph 23.12.00.

@naimnv
Copy link
Contributor

naimnv commented Jan 12, 2024

I will take a look at it tomorrow

@naimnv
Copy link
Contributor

naimnv commented Jan 12, 2024

@jpintar Could you please provide us with the smallest adjacency matrix [perhaps sub-sampled] where cluster assignments is different, ie adjusted Rand index is less than 1.0 ?

@naimnv
Copy link
Contributor

naimnv commented Jan 12, 2024

For the adjacency matrix that you provided, locally I get similar modularity (differs by 1e-03 from run to run)

@jpintar
Copy link
Author

jpintar commented Jan 12, 2024

I've been testing adjacency matrices of different sizes and I'm getting some Rand index values less than 1.0 at 1100 vertices (but not at 1000 vertices). A representative run:

Min. modularity (float32): 0.8092981576919556
Mean modularity (float32): 0.809576016664505
Max. modularity (float32): 0.8100026249885559

Min. partition count (float32): 15
Max. partition count (float32): 16

Adjusted Rand index matrix (float32):
[[1.         0.97365048 0.92694001 1.         1.         0.92694001 1.         0.98764321 0.98764321]
 [       nan 0.97365048 0.92694001 1.         1.         0.92694001 1.         0.98764321 0.98764321]
 [       nan        nan 0.90061508 0.97365048 0.97365048 0.90061508 0.97365048 0.98588849 0.98588849]
 [       nan        nan        nan 0.92694001 0.92694001 1.         0.92694001 0.91428736 0.91428736]
 [       nan        nan        nan        nan 1.         0.92694001 1.         0.98764321 0.98764321]
 [       nan        nan        nan        nan        nan 0.92694001 1.         0.98764321 0.98764321]
 [       nan        nan        nan        nan        nan        nan 0.92694001 0.91428736 0.91428736]
 [       nan        nan        nan        nan        nan        nan        nan 0.98764321 0.98764321]
 [       nan        nan        nan        nan        nan        nan        nan        nan 1.        ]]

Min. adjusted Rand index (float32): 0.9006150827613733
Mean adjusted Rand index (float32): 0.9654275826626596
Max. adjusted Rand index (float32): 1.0


Min. modularity (float64): 0.8092981003204439
Mean modularity (float64): 0.8093171918721854
Max. modularity (float64): 0.8093935580791511

Min. partition count (float64): 15
Max. partition count (float64): 16

Adjusted Rand index matrix (float64):
[[0.98764321 1.         0.98764321 1.         1.         1.         1.         1.         1.        ]
 [       nan 0.98764321 1.         0.98764321 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321]
 [       nan        nan 0.98764321 1.         1.         1.         1.         1.         1.        ]
 [       nan        nan        nan 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321 0.98764321]
 [       nan        nan        nan        nan 1.         1.         1.         1.         1.        ]
 [       nan        nan        nan        nan        nan 1.         1.         1.         1.        ]
 [       nan        nan        nan        nan        nan        nan 1.         1.         1.        ]
 [       nan        nan        nan        nan        nan        nan        nan 1.         1.        ]
 [       nan        nan        nan        nan        nan        nan        nan        nan 1.        ]]

Min. adjusted Rand index (float64): 0.987643214839121
Mean adjusted Rand index (float64): 0.995606476387243
Max. adjusted Rand index (float64): 1.0

By the way, these are all KNN-graphs, generated from subsamples of real scRNA-seq data. The zipped npz file is attached.

@jpintar
Copy link
Author

jpintar commented Jan 30, 2024

Have you been able to make any progress on this?

@mbruhns
Copy link

mbruhns commented Feb 2, 2024

I am running into the same problem while running Leiden clustering on a KNN graph constructed with exact neighbors.

@naimnv
Copy link
Contributor

naimnv commented Feb 13, 2024

@jpintar @mbruhns
Thank you for reporting the issue. We have a fix that got merged in brach-24.04, and one can use brach-24.04 nightly to run the latest version of the leiden algorithm.
With the latest code, we don't see any randomness for 1100 nodes any more. However, we still see some randomness (significantly less than before) for the input graph with 100,000 nodes that @jpintar shared with us. We are still working on it to figure out the possible sources of randomness.
Meanwhile, could you please check if the adjusted rand index is in acceptable range for your use cases?
@jpintar Could you please find the smallest graph where adjusted rand index is less than 1.0?
Thank you.

@jpintar
Copy link
Author

jpintar commented Feb 14, 2024

Thank you for looking into this more! With version 24.04.00a49, the smallest graph where I see randomness is 3200 nodes
(zipped npz file). Here is a typical set of runs for that graph:

Min. modularity (float32): 0.8323758244514465
Mean modularity (float32): 0.8323758363723754
Max. modularity (float32): 0.8323759436607361

Min. partition count (float32): 21
Max. partition count (float32): 21

[[0.9803510505 0.9898394315 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505 0.9803510505]
 [         nan 0.9904819326 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
 [         nan          nan 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326 0.9904819326]
 [         nan          nan          nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan 1.0000000000 1.0000000000 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan 1.0000000000 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan          nan 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan          nan          nan 1.0000000000]]

Min. adjusted Rand index (float32): 0.9803510505416672
Mean adjusted Rand index (float32): 0.9945889621392852
Max. adjusted Rand index (float32): 1.0


Min. modularity (float64): 0.8323758339478304
Mean modularity (float64): 0.8323758339478304
Max. modularity (float64): 0.8323758339478304

Min. partition count (float64): 21
Max. partition count (float64): 21

[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
 [nan  1.  1.  1.  1.  1.  1.  1.  1.]
 [nan nan  1.  1.  1.  1.  1.  1.  1.]
 [nan nan nan  1.  1.  1.  1.  1.  1.]
 [nan nan nan nan  1.  1.  1.  1.  1.]
 [nan nan nan nan nan  1.  1.  1.  1.]
 [nan nan nan nan nan nan  1.  1.  1.]
 [nan nan nan nan nan nan nan  1.  1.]
 [nan nan nan nan nan nan nan nan  1.]]

Min. adjusted Rand index (float64): 1.0
Mean adjusted Rand index (float64): 1.0
Max. adjusted Rand index (float64): 1.0

Like you, I see improvement for the 100,000 node graph, especially for float64:

Min. modularity (float32): 0.8949086666107178
Mean modularity (float32): 0.896155321598053
Max. modularity (float32): 0.8966082334518433

Min. partition count (float32): 42
Max. partition count (float32): 43

[[0.6723109469 0.9709272313 0.6737672732 0.9620348230 0.8459497711 0.9620285832 0.7872396727 0.9631560364 0.8458706805]
 [         nan 0.6747463127 0.9738234400 0.6784998752 0.6399577699 0.6729823392 0.7119539592 0.6746788762 0.6365549299]
 [         nan          nan 0.6733504544 0.9786646319 0.8622556460 0.9776790525 0.7935470305 0.9825837654 0.8562050057]
 [         nan          nan          nan 0.6790135984 0.6397949990 0.6740246784 0.7162443494 0.6736307983 0.6351064191]
 [         nan          nan          nan          nan 0.8578959644 0.9802567737 0.7912772208 0.9728043958 0.8475066575]
 [         nan          nan          nan          nan          nan 0.8554273786 0.7674377436 0.8518021370 0.9817595718]
 [         nan          nan          nan          nan          nan          nan 0.7939334290 0.9803393581 0.8487181300]
 [         nan          nan          nan          nan          nan          nan          nan 0.7918065473 0.7654208202]
 [         nan          nan          nan          nan          nan          nan          nan          nan 0.8470709990]]

Min. adjusted Rand index (float32): 0.635106419080863
Mean adjusted Rand index (float32): 0.8093786683616847
Max. adjusted Rand index (float32): 0.9825837653527872


Min. modularity (float64): 0.8965980895995945
Mean modularity (float64): 0.8965980895995951
Max. modularity (float64): 0.8965980895995959

Min. partition count (float64): 42
Max. partition count (float64): 42

[[0.9519281421 0.9421814386 0.9544243928 0.9513851152 0.9505608662 0.9509022129 0.9458599876 0.9556839480 0.9462716129]
 [         nan 0.9566963678 0.9751989656 0.9695654552 0.9698496606 0.9639583974 0.9701558107 0.9658491024 0.9681307955]
 [         nan          nan 0.9592236242 0.9541186566 0.9647041779 0.9648726289 0.9588673504 0.9519356888 0.9538477944]
 [         nan          nan          nan 0.9788656461 0.9820126071 0.9732194225 0.9715958099 0.9801581983 0.9723216578]
 [         nan          nan          nan          nan 0.9728891017 0.9736846287 0.9690093318 0.9835770019 0.9685012892]
 [         nan          nan          nan          nan          nan 0.9737014577 0.9797423804 0.9745889712 0.9718622543]
 [         nan          nan          nan          nan          nan          nan 0.9688776565 0.9706534031 0.9655177203]
 [         nan          nan          nan          nan          nan          nan          nan 0.9709590313 0.9851334883]
 [         nan          nan          nan          nan          nan          nan          nan          nan 0.9725410395]]

Min. adjusted Rand index (float64): 0.9421814386490993
Mean adjusted Rand index (float64): 0.9656796509007762
Max. adjusted Rand index (float64): 0.9851334883268393

For my purposes, the new float64 performance—i.e. worst case adjusted Rand index ~0.95 and no variation in partition number—is just about tolerable, though not ideal. But the float32 performance is still not acceptable (and generally, no variation in the number of partitions will be acceptable).

@naimnv
Copy link
Contributor

naimnv commented Feb 14, 2024

Thank you for checking out the latest version. We're still investigating it to find out the remaining cause of the variation. We'll get back to you as soon as we know more.

@jpintar
Copy link
Author

jpintar commented Mar 18, 2024

Is there any progress on this? I find myself having to run a lot of Leiden clustering these days, and it would be great to be able to reliably do this on GPU, but as I said, without runs being consistent to at least RI around 0.95, preferably more, that's just not an option...

@naimnv
Copy link
Contributor

naimnv commented Mar 18, 2024

Hi,
We'll review the code again, aiming to eliminate any additional sources of randomness other than numerical precision. I hope to get back to you soon.

rapids-bot bot pushed a commit that referenced this issue Apr 22, 2024
The Leiden dendrogram was being populated with both Louvain cluster assignment and mapping between Leiden and Louvain clustering.  The flattening of the dendrogram was being accomplished by applying the Louvain clustering, then mapping between the Leiden clusters at each level.

Unfortunately, the Leiden to Louvain mapping allows a many-to-one relationship, so in certain cases the flattening was non-deterministic.  There wasn't enough information in the dendrogram to perform the mapping deterministically.

This PR modifies the dendrogram to be similar to Louvain, just keeping the cluster assignments at each level.  The mapping between Louvain and Leiden clusters is done when creating the dendrogram, where there is sufficient information to perform this translation deterministically.

Closes #4072

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Naim (https://github.com/naimnv)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Naim (https://github.com/naimnv)
  - Rick Ratzel (https://github.com/rlratzel)

URL: #4347
@ChuckHastings
Copy link
Collaborator

We were able to identify and eliminate the variability. It turns out the Leiden implementation was fine, but the collapsing of the dendrogram into the final result was non-deterministic. The information stored in the dendrogram was ambiguous, it was missing some information that was known during algorithm execution that would allow for the dendrogram to be flattened deterministically.

We modified Leiden to store within the dendrogram results that included sufficient detail to be able to return a deterministic flattening, thus giving us consistent results.

Please retest with the latest 24.06 code (either from source or the nightly builds). If you have any further issues, open a new issue and let us know.

@jpintar
Copy link
Author

jpintar commented Apr 24, 2024

@ChuckHastings, thank you for working on this! I've just tested with version 24.06.00a43. The issue is resolved completely when the adjacency matrix is float64, but there is still some variability, although much less, when it's float32 and the graph is large (75,000 nodes). I'm attaching the zipped npz file of the adjacency matrix. Here are the results I'm getting from 10 runs, as before:

Min. modularity (float32): 0.8989160060882568
Mean modularity (float32): 0.8994227409362793
Max. modularity (float32): 0.9014488458633423

Min. partition count (float32): 67
Max. partition count (float32): 68

[[1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan 0.8200713715 0.8200713715 0.8200713715 1.0000000000 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan          nan 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan          nan          nan          nan          nan 1.0000000000]]

Min. adjusted Rand index (float32): 0.8200713715385887
Mean adjusted Rand index (float32): 0.9360253765470538
Max. adjusted Rand index (float32): 1.0

Since this is probably a different problem than the dendrogram flattening that you fixed, should I open a new issue? Or can you reopen this one?

@ChuckHastings
Copy link
Collaborator

I create a new issue for you with your comment above. We will investigate there, since - as you suggest - it's likely a different problem with similar symptoms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants