Skip to content

[WIP] : improve current timing script #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

akshitasure12
Copy link
Contributor

@akshitasure12 akshitasure12 commented May 31, 2025

improves #51

@akshitasure12 akshitasure12 changed the title [WIP] : update makefile for running timing scripts [WIP] : improve current timing script May 31, 2025
Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to show text for both the time speedup and the time itself?
I often care more about speed-ups when the length of time is large -- the current values shown make it hard to see how the length of time changes between 50 and 300 nodes. Should I care that parallel is slower for 50 nodes?

Any ideas about why it would be slower for 50 nodes? It is doing all n_jobs=-1 processors -- what is making it slower?

@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jun 2, 2025

Any way to show text for both the time speedup and the time itself?

We could display both of them by making an annotation dataframe that stores formatted strings. Something like this:

display.at[number_of_nodes[num], p] = f"{parallelTime:.2g}s\n{timesFaster:.2g}x"

Another way could be to create two dataframes (one for the parallel run time and another for speed-ups) and overlaying one over the other. For our use case, I think the annotation dataframe would be much simpler.

Should I care that parallel is slower for 50 nodes?

image Although the parallel implementation appears to be slower for 50 nodes, the difference in their runtimes is minimal. This is likely due to parallel overhead, which tends to be more noticeable on smaller graphs. Still, the implementation remains efficient overall -- so, it seems reasonable to me to overlook this small dip in small-scale graphs. LMKWYT

Copy link
Contributor Author

@akshitasure12 akshitasure12 Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why the speed-up with 500 nodes is less than with 300 nodes. Could you please help me understand this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does it take to run and create this tiing picture? If it isn't very long, try running it a few times and see if the dip in performance between 300 and 500 goes away.

If the results change from run to run, we should try to pick "better" parameters that don't show those slowdowns. Most computers these days have background processes running that can imapct timing scripts. To see through the haze of such noise, we could try multiple runs and take the minimum time from those runs. Another approach is to choose larger problems so the small differences from background processes looks more like noise.

It's also nice to repeatedly double the parameter (or multiply by a constant factor other than 2, like 10). That allows you to see whether the time is increasing like n^2 or n^3. Doubling nodes will take ~8 times longer for O(n^3). Of course, the big-O is only valid in the limit as n-> infinity, so take it all with a grain of salt. But if it is consistently a factor of 8 longer when doubling, you can be fairly constant it is running an O(n^3) code.

In this case, since we don't care so much about graphs with 50 nodes, and we do care about large graphs. We should probably increase the number of nodes. Maybe start with 250 nodes and double it up to 4000 for this picture. Lots of problems should really be run on 10_000-100_000 nodes. But that would take a long time to do the timing... :)

I guess the point of these plots is to show the impact of parallel code. So the best results would show a factor near 2 for njobs=2 and near 5 for njobs=5. And the title or caption should say what njobs was for the experiment.

:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dschult Thank you for your insights!

How long does it take to run and create this tiing picture? If it isn't very long, try running it a few times and see if the dip in performance between 300 and 500 goes away.

It didn’t take long to run, so I executed it a few more times to check if the dip was just a one-off. What I observed was that the speedup would initially peak, but then wouldn't yield any speedup for larger number of nodes.
prevtime
After some debugging, I found the issue was in the is_reachable function, where num_in_chunk was being passed instead of n_jobs:

num_in_chunk = max(len(G) // n_jobs, 1)
node_chunks = nxp.chunks(G, num_in_chunk)

needs to be replaced with:

node_chunks = nxp.chunks(G, n_jobs)

This is being addressed in PR#112

After applying that fix, I ran it 5 times and considered the minimum time from these runs for both is_reachable (ref. heatmap) and betweenness_centrality (ref. heatmap). However, I still observe a decline in speedup at higher node counts. Could this be due to system limitations?

System details:

MacBook Pro 13-inch M2
macOS Sequoia 15.5
Apple M2 Chip
Memory 8GB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!! So good to hear you found one fix. Too bad there are still some mysteries.
Let's see -- are we running into memory issues? (probably not.... 1600 nodes -> 2.5M edges for the complete graph (prob=1)... each takes about 32-128 bytes so I think the 8GB memory would be OK unless the data has to be copied many times in the parallel version. I think they share memory, but I'm not sure.

Let's hope it is another bug. One that involves code that affects both of these functions. I guess start in the utility functions that are called in both functions.

@dschult dschult added the type: Enhancement New feature or request label Jun 5, 2025
@dschult dschult force-pushed the improve-timing branch 2 times, most recently from d12a986 to 6174eb3 Compare June 5, 2025 17:33
@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jun 5, 2025

Hi @dschult, I had previously added another commit to move numpy to the test dependencies, but I think it might not have been pulled. Would you like me to reapply that?

@dschult
Copy link
Member

dschult commented Jun 6, 2025

Yes -- I'm very sorry to have lost that commit to the PR. I didn't pull the latest changes when I rebased on the main branch.

It all looks good now that you reapplied that. 🚀

@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jun 8, 2025

In the recent commit, I tried to implement memmapping in nx.tournament.is_reachable.

I replaced NetworkX's graph object with a NumPy-based adjacency matrix. This is because the problem involved making multiple copies of the input graph for each worker. Joblib.parallel has a special way of handling large numpy arrays by passing them as memory-mapped references to worker processes. This file is then accessed in read-only mode (mmap_mode="r") by each parallel worker utilising joblib's dump and load.

The heatmap obtained gives speedups ranging from 11x to 25x with this change, view heatmap.

Note : the current implementation assumes that the graph nodes are integers — I plan to make this more dynamic soon. One potential drawback is the memory usage in case of sparse graphs.
LMKWYT

@dschult
Copy link
Member

dschult commented Jun 9, 2025

Interesting that they have special tools for numpy arrays, but not for python objects.
For the nodes, you can map nodes to integers and back with tools like:

nodelist = list(G)  # map of integers to nodes
nodemap = {n: i for i, n in enumerate(nodelist)}

Those are good speeds -- so it looks like memory might be an issue more generally too.

It might be possible to use a numpy matrix in a more memory efficient way than the adjacency matrix.
For example, if the max degree is M, and number of nodes is N, you could make an NxM numpy array and use it as an adjacency list. Make all entries -1 (or another "sentry" value) after the last valid neighbor. So that is NxM instead of NxN.
Then A[n, A[n] != -1] gets the neighbors (there might be a more efficient way to get the neighbors, I'm not sure).

Another approach would be to use a scipy sparse array. So, there are lighter memory ways to handle adjacencies. But let's not worry about that yet.

Do I understand the memmap model? The array is written to a file and then each process has read access to that file. So it uses hard disk space instead of memory to share the data. It looks like there is a Python standard library version of this called mmap. I wonder if that works well with dicts. But numpy is very fast so it might be better to just stick with memmapping.

Very cool!

@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jun 9, 2025

It might be possible to use a numpy matrix in a more memory efficient way than the adjacency matrix.
For example, if the max degree is M, and number of nodes is N, you could make an NxM numpy array and use it as an adjacency list. Make all entries -1 (or another "sentry" value) after the last valid neighbor. So that is NxM instead of NxN.

I like this idea of using an NxM numpy array as an adjacency list -- would be an efficient way of expressing it. I'll work my way with this. Thanks for the insights :)
Also, I thought it would be a good idea to move this to a separate PR to keep things aligned with the original focus (ref PR#119).

Do I understand the memmap model? The array is written to a file, and then each process has read access to that file. So it uses hard disk space instead of memory to share the data.

Yep, that’s pretty much it!

@dschult
Copy link
Member

dschult commented Jun 9, 2025

+1 for moving the adjacency stuff to a separate PR. :)

@akshitasure12 akshitasure12 marked this pull request as ready for review June 15, 2025 07:26
@akshitasure12
Copy link
Contributor Author

Do we still need timing_comparison.md, or can we delete that as well?

@dschult
Copy link
Member

dschult commented Jul 1, 2025

I think we can delete that file too.
(It is the output of timing_all_functions.py script right? I can't find any mention of it in the repo.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should remove this file-- because we do wish to someday have a heatmap (or some other visual) to showcase all(or most) of the algorithms' speedups. LMKWYT. Also, we still have the heatmap generated by this script - https://github.com/networkx/nx-parallel/blob/main/timing/heatmap_all_functions.png

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we manage the heatmap as the number of algorithms grow? Would we do something like regenerating this heatmap every time a new algorithm is added? I'm just wondering what the long-term plan may possibly be!

Copy link
Member

@Schefflera-Arboricola Schefflera-Arboricola Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant something like this - https://github.com/python-graphblas/graphblas-algorithms/blob/main/README.md -- doesn't have to be a heatmap

we don't have to showcase all algorithms-- only 1-2 algorithms of each kind(like shortest paths, centrality measures, etc) and on real world networks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this file(and populate it with all other heatmaps) because it contains important info like the machine's specifications and the heatmap results. I think here we can also mention other software specifications like backend="loky", etc.

maybe we should rename it to README.md (in the timing folder)-- so that it's visible as soon as someone opens the folder.

Also, currently, not all heatmaps are run on the same machine. These are very old machine specifications. LMKWYT.

Comment on lines +19 to +23
# Default Config
joblib.parallel_config(n_jobs=-1)
# To use NetworkX's parallel backend, set the following configuration.
# nx.config.backends.parallel.active = True
# nx.config.backends.parallel.n_jobs = -1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for these comments-- We expect that the reader of timing script knows how to use/configure nx-parallel. LMKWYT.

Suggested change
# Default Config
joblib.parallel_config(n_jobs=-1)
# To use NetworkX's parallel backend, set the following configuration.
# nx.config.backends.parallel.active = True
# nx.config.backends.parallel.n_jobs = -1
# Setting configs: Using all CPU cores
joblib.parallel_config(n_jobs=-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn’t need this anymore since we’re already updating the NetworkX configs to -1, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but that PR is not merged yet-- what merge order of these PRs would you prefer? should we merge this PR second?

stdTime = t2 - t1
tournament_funcs = ["is_reachable", "tournament_is_strongly_connected"]
bipartite_funcs = ["node_redundancy"]
random.seed(42)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set the seed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had included it for reproducibility, but happy to remove it if there's no need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but why do we need a global seed? is it being used by any function(s) below?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "line"(top line, bottom line) is a bit confusing-- because usually in the context of graphs(not networks) a line might mean line graph. So, "value" or "number" might be more apt. or we can also color code the values

Copy link
Member

@Schefflera-Arboricola Schefflera-Arboricola Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this script and I got a lot of findfont: Font family 'Arial Rounded MT Bold' not found. at the end

also for betweenness_centrality it ran once and then it started running again--

(nxp-dev) @Schefflera-Arboricola ➜ /nx-parallel (improve-timing) $ python timing/timing_individual_function.py
200
0.24375876500016602
0.4948986529998365
Finished <function betweenness_centrality at 0x7173ebb4d440>
400
2.0439088319999428
3.804219898000156
Finished <function betweenness_centrality at 0x7173ebb4d440>
800
17.221894170999803
29.955433169000116
Finished <function betweenness_centrality at 0x7173ebb4d440>
1600
139.0610084220002
237.87486764899995
Finished <function betweenness_centrality at 0x7173ebb4d440>
200
0.2752337769998121
0.5823823759992592
Finished <function betweenness_centrality at 0x7173ebb4d440>
400
2.3135327339996365
4.701841838999826
Finished <function betweenness_centrality at 0x7173ebb4d440>
800
19.94758787899991
...

Please make sure if the script runs with all kinds of algorithms i.e. all code blocks run fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because for algorithms like betweenness centrality, the script checks it for multiple probabilities of nodes, i.e., p = [1, 0.8, 0.6, 0.4, 0.2], which isn't the case for tournament graphs (where edge probability is always 1). I’m assuming this is what you meant when you said it started running again! Pls let me know if I’ve misunderstood anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thank you for clarifying -- it would be helpful to print out the edge probability in the output

ax.set_xlabel("Number of Vertices", fontweight="bold", fontsize=12)
ax.set_ylabel("Edge Probability", fontweight="bold", fontsize=12)

n_jobs = nxp.get_n_jobs()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd recommend using get_active_backend()[1] instead of get_n_jobs here-- bcoz the first one gives the n_jobs that joblib has stored and used while running the code(and we are using joblib's config system in this script).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

3 participants