-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do new pools "own" the current domain? #77
Comments
Thanks for bringing this up. The documentation attached to The confusion arises from the fact that during CC @Sudha247. |
I came here because I was bitten by the new API which I found confusing, and I was about to open an issue saying "hey maybe we should rename this parameter into The code I wrote would do let num_domains = Domain.recommended_domain_count () in
let pool = Domainslib.setup_pool ~num_domains () in
(* ... *) and @Sudha247 sharp-eyedly remarked that this was wrong. It may be wrong but it is also very easy to write, which suggests that the current API is problematic. Can we do another round of brainstorming on how to propose a clear API here? Remark 1: there is a fundamental difference between the status of pool domains ("worker domains") and the status of the "caller domain". Worker domains loop doing pool work forever. The caller domain wants to do the smallest amount of pool computation to get its result. It is correct to say that the tasks run on [num_domains + 1] domains, but the status of both sides of the addition is very different. Remark 2: currently the documentation and implementation of val get_num_domains : pool -> int
(** [get_num_domains pool] returns the total number of domains in [pool]
including the parent domain. *)
let get_num_domains pool =
let pd = get_pool_data pool in
Array.length pd.domains + 1 Short-term suggestions:
Medium-term suggestion. Taking a step back, I think that the problem comes from the fact that Domainslib.Task mixes two things together, an abstraction for a pool of worker domains and an abstraction for asynchronous tasks. I think that the API could be reworked to separate the two, on one side have an asynchronous tasks abstractions with explicit schedulers, and on the other a domain pool abstraction with an explicit (concurrent) work-channel (the interface for other domains to send work to the domain pool). The connection between the two is that the library offers a task scheduler that expects a pool work-channel as parameter. |
(If others agree that the current state is not really satisfying, I think the present issue could be reopened.) |
Would just add a small nit here, IMHO it's easier to only expose an API that gives you "maximum number of workers", and to not write code that expects to be a participating worker. As in: |
I agree that it's a simpler model to think about. If we wanted to go this way (present this as the expected usage mode), we could probably also simplify the implementation of Task (which currently I find a bit hairy) to follow this model. On the other hand: I can see how to use Domainslib to build "parallel building blocks" ( |
And I think this is fine, if you can't afford to pay one IPC or whatever to do a parallel computation on N Domains (where N is close to be normally 8 in 2022) I'd say something is already wrong. If that's too expensive than the computation should not be parallelized (it wouldn't pay off anyway). (full disclaimer, I'm not using Domainslib, so I might be talking out of line here) |
On an unrelated note: one benefit I would expect from rethinking the API a bit is that the Task operations ( |
As the "author" of this issue, I support rethinking the API. My original concern is really about the obvious inconsistencies (which are now fixed). Maybe |
There are also implementation issues arising from this API confusion. For example, currently if you share a pool between two independent domains and they both call |
Well, I think "previous documentation" here might be misleading. It was not clear (to me) before #78 whether the calling domain (which can be different from the domain setting up the pool) participates at all, and thus I made the PR to document the existing behaviors more precisely. Perhaps you meant the name |
Thanks everyone for the comments! Reopening the issue to continue discussion. @favonia is right in that the previous documentation was inconsistent with the implementation. I can see how the current one could be confusing too. |
I found Given the above, we should fix the buggy I envision task abstraction to be separate from the pool. You can submit tasks to the pool to be executed. I also agree with @gasche that, now that we have a top-level handler in I'm in support of revising the API. The types |
What I understand is that this could lead to the somewhat surprising behavior @gasche observed, that is, if a pool is assigned with two tasks by two different domains, say Domains 1 and 2, then Domain 1 could work on the task assigned by Domain 2. As mentioned above by others, I think it's conceptually easier to let the waiting domain idle (at least after ocaml/ocaml#11589) at some cost of efficiency, treating the pool as some cloud service. This would also rules out the arguably confusing "0-sized pools" that could not run any task in this model and thus will be rejected during the setup. By the way, I believe there are many other alternative designs for the argument
|
I propose an API sketch in #92. It's not magically solving problems, but I tried to separate the various concerns, and I believe that it would make things easier to think about. Note: it's interesting to wonder about whether it is possible, in the middle of an Task computation, to run certain tasks on a dedicated pool. (This is the Tezos use-case of using separate pools for IO vs. crypto, when the two can be found in the same computation.) I'm not sure it's worth bending over to support this use-case, but I realized after the fact that my API may support it by taking the not-so-intuitive step of nesting several Task handlers. |
This is sub-optimal and I am worried about this choice. Remember that idle domains don't sit idle and must participate in GC. So if I have 4 cores, and I create a pool with 4 domains with the initial domain sitting idle, during a GC, all 5 domains are expected to work on the GC. Given that the minor GC is a stop-the-world parallel GC, the cores will be overcommitted with 5 domains on 4 cores. This would lead to slower minor GC. OTOH, if you only create the pool with 3 domains, while the cores are not overcommitted during a GC, then will be under-committed during actual work. |
This is somewhat of a tangent, but do you have a rough idea of the performance cost of this "over-committing during the minor GC" issue that you described? It sounds like something interesting to keep in mind when writing multi-domain OCaml programs, specific to the current runtime choice. What are the ballpark figures to keep in mind? (In general, can we quantify the cost of having more domains than cores for typical workloads?) (Should I open an issue somewhere else than here to track this question?) Note that I would expect the issue to matter only for pooled computations that are relatively shorter than the time between two minor GCs. If the pool computation takes several minor GC cycles, then the forced-idle calling domain will have not allocated anything on the GC cycles after the first one, so its minor GC should return immediately and overcommitting should not have a cost, right? (Or maybe the domain would do "idle work" and doing nothing, and this is currently not overcomitting-friendly either?) |
When all the cores actively perform “work”, we’ve observed precipitous drop in performance when the cores are overcommitted. With an idle domain, i don’t have a good sense. We may try to run some of the parallel benchmarks from Sandmark with an extra idle domain to quantify the costs. My expectation is that the hit will be not be insignificant. The idle domain at least needs to scan its roots. Relatedly, I would also like any new proposed API not compromise parallel performance. I would like domainslib to be fast even if the API is hard to use. |
We could shortcut this part whenever the minor heap is completely empty. (But there is still other shared work to do, on the remembered set etc.; I don't know have a sense of whether root-scanning is really the dominant cost for compute-intensive multi-domain programs.) I don't regret asking you this here, as I realize that I don't have any intuition for these aspects of the multicore runtime performance characteristics. My own intuition would rather have been that overcommitting is generally fine, unless your workflow and hardware makes you care a lot about memory/core affinity (say on manycore systems with a complex memory hierarchy). Will someone from the Multicore team write a blog/Discuss post to give ballpark figures on this question by the time of the 5.0 release ? :-) :-) |
I just came to say I'm glad you asked, I had the same bias/intuition and reading KC response I realize I'm very likely wrong.
|
I've also observed the slowdown introduced by idle domains, presumably due to the shared GC work, as noted by @kayceesrk. I tried to quantify it by benchmarking with idle domains present during the entire course of the program, and with the idle domains doing no mutator work, as would be the case were the waiting domain be kept idle. I picked three examples for this, Game of LifeGC Activity No Idle Domain:
One idle Domain:
Two idle Domains:
Running Time
MinilightGC Activity No Idle Domain:
One Idle Domain:
Two Idle Domains:
Running Time
Binary TreesGC Activity No Idle Domain:
One Idle Domain:
Two Idle Domains:
Running Time
Idle domains are introducing more overheads in programs with high GC activity, it's practically neglible when the allocation is low, as observed in
This can probably work. It will be interesting to see results with this on if someone decided to hack on it. I recall @ctk21 also suggested something similar when we were discussing the idle-domain-introduced GC overheads. But, would the waiting domain's minor heap be entirely empty in this case? IIUC, it may not be the case when something else is going on before the waiting domain reached domainslib operations. |
Thanks! Be careful in interpreting the GC statistics above that I think the GC-statistics code is currently buggy. In fact many of the stats you show contain the bug that When you say "idle domains", what does the idle domain actually do? (Is it blocked on a lock, or looping with cpu_relax, or something else?) Do you have your test code somewhere? Also, what OS are you using to run these experiments? (I'll upload what I was playing with shortly, in case it can helps. My idle domains were blocked on a lock, and I didn't observe a notable slowdown. On the other hand I measured an around 2x slowdown when moving from |
I pushed my little benchmark at https://gitlab.com/gasche/idle-domains-bench On my Linux machine with "4 cores but 8 threads" (
each line represents a series of run with the same number of worker domains doing the same computational work (the Knuth-Bendix benchmark of the compiler testsuite). For example the The "x1.37" number reported corresponds to the ratio between the actual runtime of the run and the "expected time" for the run. The "expected time" is the same time as the 1-worker 0-idle run (the "reference run") for all tests up to 8 worker domains, and twice that time for all runs with strictly more than 8 worker domains (because at least one hardware thread needs to run two domains to completion). What I see on the benchmarks is a stark performance cliff when we move from 8 worker domains (my recommended-domain-coutn) to 9 worker domains, as predicted. (From 8s to 26s.) On the other hand, idle domains appear to have very little cost, even having 10 idle domains is barely noticeable. My guess is that the Linux thread scheduler does an excellent job with idle domains (really their backup thread), scheduling them often enough that they do not make the STW barrier much longer. I don't know why the results are so different from @Sudha247's; it could be a mistake in the benchmarking code, or a difference due to the workload of the worker domains, or a difference in the schedulers of our operating systems. |
My reasoning is that if the minor heap is not too large and minor collection runs often enough (which is not necessarily how your OCaml application is configured), and the task itself takes a long time to compute, then a blocked domain would see one non-empty minor collection and then many empty minor collections, so it would benefit from an empty-minor-heap optimization. |
Thanks for the numbers. They are interesting to read. @Sudha247 did you run them on the "tuned" machines? If so, these machines change the scheduling policy in order to schedule the programs on isolated cores. This may be the reason for the increased costs. We use
I assume you are running with the rest of the isolated cores (24 in total) available for the idle domain. Hence, it is odd that 1 core program with 1 idle domain would run so much slower. @gasche the numbers look good. We haven't clearly measured or optimised for idle domains while working on the multicore GC. Also, Simon Marlow mentioned at ICFP that they had to put in a bunch of work in GHC to address the problem that the program may have more HECs (GHC's equivalent of domains) than available cores. This can happen on, as you have mentioned, Firefox running alongside the programs. There are probably some lessons that we can borrow from GHC on this. With idle domains, what I am uneasy about is that we've recommended to our users that the number of domains spawned must be equal to the number of available cores. This recommendation has influenced how we think about the runtime system design as well as the design of domainslib. I don't have a specific example in mind here, but I'm sort of thinking that this assumption will have influenced the design. We're proposing to break this recommendation by having idle domains. I'm happy to be proved wrong that there is nothing to worry about having idle domains. But it feels odd that we're choosing to take a (possibly only small) runtime overhead since the API is better with idle domains. |
Just to be clear, I don't want to insist on having idle domains, and I think that your idea of having the API that maximizes performance, possibly at the cost of some complication, is sound. (I also wouldn't assume that other OSes have schedulers that are also good with idle domains; the assumption that they don't cost too much may not be portable.) But I was curious to quantify the cost to understand what is going on -- and it's very interesting to see two different benchmarks showing such different results. |
We should confirm that @Sudha247's results are sound. With our benchmarking machines, we're doing what one may expect a parallel scientific job run may do. Isolate cores, allocate 1 domain per isolated core and use round-robin scheduling. Under the assumption that the cores are not overcommitted, the scheduler does not have to do any work. This method was able to get 5-20% more performance than running on non-isolated cores on an otherwise idle machine. The performance also remains robust. In the past experiments, I've struggled with getting robust results with overcommitted and non-isolated cores with some domains doing a little work (they're not idle). I should also say that this experiment was not done systematically, and hence, if there is interest, we should do this properly. |
I am thinking that I could post my benchmark sources on Discuss and encourage interested people to run their own experiments. (We would get 50 results of 50 shoddy experiments, instead of one well-designed experiment, but it's better than my one shoddy experiment at least :-) What do you think?
I'm interested in understanding how to write code that does "a little work", it's not obvious. I have tried replacing my waiting-on-a-lock idle domains with looping-on-cpu-relax idle domains, and the results are extremely similar. I think that what you have in mind is closer to programs that don't max out the CPU (they spend enough time blocked in syscalls), and don't allocate too heavily, but still do a bit of both. I might try to write something like that. |
Here are results of my benchmarks with an implementation of "idle domains" that still does a bit of work;
Idle domain code: while Atomic.get idle_run do
List.init 10 (fun _ -> ())
|> Sys.opaque_identity
|> ignore;
Unix.sleepf 1E-5
done The results are worse than with waiting-on-a-lock idle domains -- we typically see around 20% slowdown caused by the presence of a mostly-idle domain -- but still not that bad. In particular having more domains than hardware threads is much less costly when some of those domains are idle than when they are heavy-work domains. |
It may be better to do rigorous benchmarking in a controlled setting than to let the community run their own experiments. At the end of the day, one has to collect the results and interpret them. This would be much easier if the experiments were run in a controlled setting. Taking a step back, I believe that moving both the API forwards and doing performance evaluation in parallel may not be ideal. In the other issue (#92), there are several useful improvements over the current API without impacting the performance. I would be interested to get those merged first before getting too deep into the performance results. Moreover, there are known issues with GC scheduling (ocaml/ocaml#11589) which will affect any of the experiments we do with idle domains. I would prefer to wait until at least 11589 is addressed before we spend time on the experiments. |
IIUC the question was about was when we were spawning more domains that
Right.. It was just to illustrate the difference in allocation/GC activity. I've run into some of them myself - not sure if it's the expected behaviour, but I think every domain used to increment minor GC count separately (this might be fixed now, or the expected behaviour, I haven't checked yet).
Okay, I'll admit I did the easiest possible thing and created a new pool to act as extra domain(s). It waits on a condition variable to receive tasks, and I believe doesn't do much work itself (it only sets up a worker). This was run on an Intel Xeon Gold machine with 28 Cores, on Ubuntu 20.04 (5.4.0-125-generic kernel). Just wanted to clarify some things for reference, overall I agree with KC that spending time on this makes sense after GC issues are addressed. |
I could not tell whether a newly created pool "owns" the current domain. In one of the test files, there are consecutive calls to
setup_pool
:domainslib/test/test_task.ml
Lines 40 to 41 in 15f04f3
Therefore, it seems okay to have multiple pools sharing the same domain (that is, the current domain). However, the documentation gave me the impression that I have to manually spawn new domains before setting up pools to avoid overlapping:
domainslib/lib/task.mli
Lines 10 to 13 in 15f04f3
I wonder what is the intended semantics? What is the ownership relation between pools and domains? This might be related to #55.
The text was updated successfully, but these errors were encountered: