-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: use distributed.Queue as storage for metadata poc #305
base: master
Are you sure you want to change the base?
Conversation
cc @marco-neumann-jdas @pacman82 in case you want to take a look at this |
What kind of resilience do you get here? Against which exact failure scenario does this change help? |
c = get_client() # noqa | ||
metadata_store = metadata_storage_factory() | ||
for _mp in mp.metapartitions: | ||
metadata_store.put(_mp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this task might get retried. Do you do any de-duplication of the results in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
- prevent duplicates
This would help when a worker which has stored the data dies while holding the metadata, and would imply re-computing all dependencies for that data. Since the metadata is quite small, we can just move all metadata objects to a central place (i.e. the scheduler) to avoid having to re-compute the graph after data has been stored |
How does that work then? Dask doesn't know that you're bypassing its dependency system, so it will re-trigger the computation. And you don't know that all you can safely abort that and run some kind of manual collect+store operation w/o looking very carefully at the scheduler dashboard, because you cannot know that all dependencies where executed at least once. In case the cluster start "flapping", you don't know which parts of the dependency chain are recomputed because the got lost and which parts did not get computed in the first place. And every manual guess makes very strong assumption about the scheduling logic (which IIRC is an implementation detail of dask/distributed) and risks data loss. |
I remember this coming up when I started to experiment on this but I guess I didn't address that very properly here. I'll have a look when I have some spare time. Appreciate your feedback |
This could only be addressed by letting every task check whether a key/result was already logged on the scheduler. This way, the task would be rescheduled but it would be a "no-op" We may also raise this as an issue in distributed. If this is nicely integrated in the scheduler that would be much better, of course. I can see various arguments pro/con so this will probably trigger a small debate. There are tasks in dask/distributed itself, though, which would benefit. Prime example is the parquet storage system of dask itself... OTOH, this topic occasionally pops up and the "dask way" of dealing with this would probably be a duplication of results across the cluster but that doesn't properly work atm |
Instead of adding more complexity to the payload code with store interactions, retries, no-op checks etc, could we rather implement this as some kind of graph rewrite? |
I guess that may be the "cleanest" given the current state of distributed, and probably shouldn't be too difficult Ideally, though, I would see this kind of functionality properly supported in distributed, as @fjetter mentions
|
With the intention of improving resilience,
cc @fjetter