Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple parallelization framework #4501

Draft
wants to merge 67 commits into
base: master
Choose a base branch
from

Conversation

HechtiDerLachs
Copy link
Collaborator

@HechtiDerLachs HechtiDerLachs commented Jan 25, 2025

Based on #4162 by @antonydellavecchia.

At the moment this is WIP and still failing.

@HechtiDerLachs HechtiDerLachs marked this pull request as draft January 25, 2025 10:57
@antonydellavecchia antonydellavecchia mentioned this pull request Jan 25, 2025
4 tasks
@thofma
Copy link
Collaborator

thofma commented Jan 25, 2025

Out of curiosity, what is the overhead for of spawning a process/moving the data around? I know it depends on the application, but having the timings for a simple example as in the tests would be interesting.

@HechtiDerLachs
Copy link
Collaborator Author

Unfortunately it seems difficult to reproduce a working version of this right now. We had it yesterday at some point, but I don't seem to figure out what's going wrong now. Once we have it back running, I'll put some communication timings here. This will also be interesting for when it comes comparing with the native Singular serialization.

@HechtiDerLachs
Copy link
Collaborator Author

HechtiDerLachs commented Jan 29, 2025

The tests added here are running again, thanks to yesterday's work by @antonydellavecchia !

So here are some timings. As far as I understand, without starting a new process, parallel tasks are automatically executed on the parent process. I get these timings:

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000118 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000114 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000117 seconds (223 allocations: 12.016 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000114 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

When I start one other process, as indicated in the tests, I get the following:

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002743 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001832 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002679 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002704 seconds (1.74 k allocations: 90.705 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001806 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001831 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001891 seconds (1.69 k allocations: 93.752 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.003475 seconds (1.69 k allocations: 88.564 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

@antonydellavecchia : Do you happen to know whether the serialization and deserialization also takes place in the first case? Or do we benefit from some caching there? That would be interesting to know in order to estimate how much time goes into deserialization and how much time is actually spent in sending the things around.

Edit: Interestingly timings seem to go up when using more workers. With four processes spawned I get

julia> @time Oscar.parallel_all(a)
  0.007401 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007863 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007877 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007536 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.008121 seconds (1.90 k allocations: 104.127 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007908 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

@antonydellavecchia
Copy link
Collaborator

antonydellavecchia commented Jan 29, 2025

The tests added here are running again, thanks to yesterday's work by @antonydellavecchia !

So here are some timings. As far as I understand, without starting a new process, parallel tasks are automatically executed on the parent process. I get these timings:

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000118 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000114 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000117 seconds (223 allocations: 12.016 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.000114 seconds (220 allocations: 11.141 KiB)
(true, QQMPolyRingElem[y, 1, x])

When I start one other process, as indicated in the tests, I get the following:

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002743 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001832 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002679 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.002704 seconds (1.74 k allocations: 90.705 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001806 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001831 seconds (1.69 k allocations: 87.689 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.001891 seconds (1.69 k allocations: 93.752 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

julia> @time success, res1 = Oscar.parallel_all(a)
  0.003475 seconds (1.69 k allocations: 88.564 KiB, 2 lock conflicts)
(true, QQMPolyRingElem[y, 1, x])

@antonydellavecchia : Do you happen to know whether the serialization and deserialization also takes place in the first case? Or do we benefit from some caching there? That would be interesting to know in order to estimate how much time goes into deserialization and how much time is actually spent in sending the things around.

Edit: Interestingly timings seem to go up when using more workers. With four processes spawned I get

julia> @time Oscar.parallel_all(a)
  0.007401 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007863 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007877 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007536 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.008121 seconds (1.90 k allocations: 104.127 KiB)
(true, QQMPolyRingElem[y, 1, x])

julia> @time Oscar.parallel_all(a)
  0.007908 seconds (1.89 k allocations: 98.064 KiB)
(true, QQMPolyRingElem[y, 1, x])

If you repeated the experiment with the same Ring then yes the it will be cached on all processes.
Meaning you still send the messages but they aren't unpacked on either side.
We don't yet have a way to send messages only when we know they are necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants