Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asyncify the Tree Building #975

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Asyncify the Tree Building #975

wants to merge 13 commits into from

Conversation

benbierens
Copy link
Contributor

Investigation into node stability shows that tree-building is blocking the main execution loop for long periods of time in some use-cases. This PR breaks up the execution time by asyncifying the tree-building call chain.

@benbierens benbierens added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Oct 28, 2024
@benbierens benbierens self-assigned this Oct 28, 2024
@benbierens
Copy link
Contributor Author

Image from this branch passes upload/download dist-tests.
This branch + async-profile with defect-assert on segmentTimes larger than 100 milliseconds, passes upload/download dist-tests: AKA: There are no operations in the basic upload/download flow with blocking execution times longer than 100 milliseconds.

This means that the cause of failing downloads cannot be explained by long-blocking operations.
Marketplace-flow with async-profile for detection of long-blocking operations is still pending.

@benbierens benbierens marked this pull request as ready for review October 30, 2024 14:11
Copy link
Member

@gmega gmega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks fine, with the caveats that:

  1. it's hackish, and IMHO we should revert it once we can run those easily on separate threads, which would be the "proper" solution;
  2. I am not so sure about the new/init API. It doesn't seem to comply with the expected semantics;
  3. we're doing some behavior-changing transformations by catching the exceptions and returning results.

Would be wise to get another pair of eyes on this as we've discussed earlier today, I'm nevertheless approving it.

let
mhash = ? mcodec.mhash()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the heck did this = ? even do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 OK this raises an exception, so this change is actually modifying behavior. I suppose this is OK if it doesn't break expectations at callers, but we could simply preserve this behavior and propagate the exception using a Chronos raises clause. I see you've done this change in several places, and I'm assuming you did the due diligence to ensure this doesn't cause surprises, so I won't block your PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could simply preserve this behavior and propagate the exception using a Chronos raises clause

That would actually change the behaviour. = ? will either assign a value to mhash or it will return a Result with an error to the closure. By raising an exception, and specifying that exception type in {.async: (raises: [...]).}, we're not returning a Result, but raising an exception.

Imo, the proposed change here does not alter the behaviour.

@@ -305,6 +304,29 @@ proc buildManifest*[T, H](self: SlotsBuilder[T, H]): Future[?!Manifest] {.async.
self.cellSize,
self.strategy.strategyType)

proc init*[T, H](self: SlotsBuilder[T, H]): Future[?!void] {.async.} =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Hm... I think the fact that you need to call new and then init is not expected. According to the Status style guide and to some of our own code (e.g. here), init is supposed to return a constructed object (and not void), so it'd actually replace new in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that an init call can be confusing when not used for construction. In general, Codex follows the convention of using init for constructing object types and new for constructing ref types, as per @dryajov's comment: https://discord.com/channels/895609329053474826/895622391349272637/1086322836001534072

Maybe an easy fix is to call this initialise instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this seems like it was replacing let x = ? someResultFunc()... with without x =? someResultFunc(), which doesn't seem necessary to me. In other words, it feels like we should make the switch to async without changing any other semantics if possible. It's easy to get into trouble modifying lines "while we're here" as there can be unintended consequences, especially in nim and especially with this many changes in a PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have changed them if I didn't have to. = ? doesn't compile in async procs.

@@ -91,6 +91,8 @@ asyncchecksuite "Test Node - Host contracts":

protected = (await erasure.encode(manifest, 3, 2)).tryGet()
builder = Poseidon2Builder.new(localStore, protected).tryGet()
(await builder.init()).tryGet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... this looks quite weird (doing a get on a void) but OK, I don't have better ideas. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's odd... but it's a way to force an exception in case there is no value.

Another way (not sure if it's "better"):

if (await builder.init()).isErr: fail()

But this doesn't give much output if it does fail, so an even more explicit way is:

if error =? (await builder.init()).errorOption:
  echo error.msg
  fail()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(await ...).tryGet() calls are all over the testing code. It fails the test when the result is an error. How would you like to see this changed?

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've hit some significant issues using waitFor in the past, and we need to avoid it like the plague (cc @gmega). I outlined a pattern that might help avoid the usage.

@@ -145,16 +145,18 @@ func compress*(
mhash.coder(@x & @y & @[ key.byte ], digest)
success digest

func init*(
proc init*(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to avoid making this init async so we can avoid using waitFor later on... instead this is a good pattern to follow:
https://github.com/codex-storage/nim-ethers/blob/80b2ead97ce32cca74cef4a4501ab97106cb578b/ethers/providers/jsonrpc.nim#L50-L86

Then later, we can use CodexTree.init synchronously, and when needed, we can let layers = await tree.layers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern. If I applied it, SlotBuilder verifyTree and verifyRoot funcs would have to become async procs to await for the internal field. This would propagate the asycing through the codebase further. Is this desirable? Or is there a way around this?

Comment on lines +196 to +203
without mhash =? leaves[0].mhash.mapFailure, error:
return failure error

var hashes = newSeq[seq[byte]]()
for leaf in leaves:
without hash =? leaf.mhash.mapFailure, error:
return failure error
hashes.add(hash.digestBytes)
Copy link
Contributor

@emizzle emizzle Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this really need a rewrite? In other words, was there a compilation issue that forced this re-write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't figure out a cleaner way to unroll the results. If you've got one, I'm eager to learn.

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove my requested changes as I realise the call sites using waitFor are in the tests only. However, that is not to say that this pattern of using async init won't bite us later on. I just don't want to stand in the way of progress.

My overall comment for these changes is that because we are making so many changes here and it feels like we're trying to rush this in, we really should limit the scope of changes by changing as little as possible. If i understood correctly, the main idea of this PR was the make some things asynchronous. However, we've also replaced let x = ? someResultFunc()... with without x =? someResultFunc() in quite a few places. This is one of those things that seems innocuous, but can also bite us later on (especially in nim).

I am also not a big fan of "let's do this for now and revert later" kind of changes, mostly because there may be patches affecting this code in the meantime which are hard to merge, and also sometimes reversion are forgotten or partially done.

@emizzle emizzle self-requested a review November 4, 2024 06:48
@emizzle emizzle self-requested a review November 4, 2024 06:49
@emizzle
Copy link
Contributor

emizzle commented Nov 4, 2024

Ok, so I didn't approve these changes. I merely wanted to remove my changes requested 🤦

@benbierens
Copy link
Contributor Author

All right. Let's put a break on this one. Here's why:
My plan with this PR was never to create a quick-and-dirt solution intended to be a temporary fix. I realized that threading might come along and make these change unnecessary. (I'm not entirely convinced of this.) But I was never counting on that. If you guys find these changes hacky and only suitable for temporarily hiding an issue, then let's NOT merge this. I want to fix this correctly, as if it's going to be in production like this indefinitely.

The change I made is to make the tree-building async by adding a small async-sleep in the loop so the process can be interrupted for, for example, the handling of network traffic. That's it. All other changes I made are required downstream changes to make stuff compile and work again. Yes, there are a lot of those. There's not much we can do about that unless we are considering restructuring a lot more code, which is the opposite of making small changes.

The new init method is required because we don't have async new methods. The replacing of = ? with without is needed in methods that have become async.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants