-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory usage bottleneck #49
Comments
Sounds good to me! To clarify, does it currently take 200G for 40k seqlets even with this modification put in? |
For 40k seqlets per metacluster, currently the peak memory usage is ~120G with this modification put in, according to slurm seff. Thanks! |
Thanks Han.
Av - we still need to still bring this down by a lot - must fit in a
Google collab instance or around 12 GB max. usage
Maybe using a different implementation of Laden/Louvain might help? Is the
high memory usage a problem with the phenograph implementation. The
Louivain/laden implementation that Laksshman and Akshay use for the single
cell clustering (much larger number of entities) seems to be very efficient
with memory and speed. Maybe take a look at that.
…On Sat, Nov 23, 2019, 12:23 AM Han Yuan ***@***.***> wrote:
For 40k seqlets per metacluster, currently the peak memory usage is ~120G
with this modification put in, according to slurm seff. Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#49?email_source=notifications&email_token=AABDWEIWJRH3YJXTVROK73LQVDSBZA5CNFSM4JQXCFY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE7QJ6Y#issuecomment-557778171>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDWEP5HA6CIIQC5NQYUMTQVDSBZANCNFSM4JQXCFYQ>
.
|
Yes, agreed. I don't recall seeing any evidence that the implementation of Louvain/Leiden is causing the issue after Han's fix? Han's fix specifically addressed an issue in the borrowed-from-phenograph code that wrote the binary file that was subsequently called by Louvain. |
Hi Avanti,
I've been trying to figure out the memory bottleneck when using tfmodisco. It turns out the initially dense matrix created by seqlets2patterns doesn't take that much memory with 40k seqlets per metacluster (~40gb). I then narrow it down to the graph2binary() in modisco/cluster/phenograph/core.py. graph2binary() creates a really large list before writing it out to a binary file:
181 f.writelines([e for t in zip(ij, s) for e in t])
For a 4gb sparse matrix, the list can be ~60gb. avoid creating this list, I can run tfmodisco with 40k seqlets per metacluster with 200G memory.
I've submitted a PR to make a revision on this. I'm not too familiar with the codebase yet. So let me know if I miss anything.
The text was updated successfully, but these errors were encountered: