-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Remote entity reservation v9 #18670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote entity reservation v9 #18670
Conversation
free and pending are now more distinct
cart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board for this! I think this approach is as close as we'll get to an ideal solve, given the constraints:
- keep world-access allocations as fast as possible while also allowing parallel allocations
- reuse entities across use cases
- don't require each parallel case to have its own reserved blocks of entities
- ideally no async
The implementation is clever! I'm going to run the benchmarks locally as one final sanity check. It seems like the entity_allocator_batched_and_freed / entity_allocator_fresh benches haven't been upstreamed. Can you add those to this PR so I can run them?
The bit layout is now explained and defended.
Heads up that we might be able to improve this with batching free calls without much trouble, but that's something to try later.
That's as expected. This doesn't really test the
Yeah, the new benchmarks would shed some light on the situation. My original benchmarks were just quick and dirty tests, and I don't have those anymore. I just kept them around long enough to test the theory. If you want to double check the perf change when the
Which route would you like me to take here, and I'm assuming I should do the benchmark changes in a different pr first? |
|
I prefer the second more realistic and robust option. Do merge the benchmarks in a seperate pr first though to give us a baseline! |
Yeah I think benching Benching similar scenarios for spawning (including some components) is also useful, but it would make it harder to zoom in on what we care about here, so I'm content with skipping them until later (unless you are feeling inclined to do them now too). |
|
Also we should really rename |
…22639) # Objective As per [this](#18670 (comment)) comment on #18670, this PR attempts to make entity related benchmarks more realistic by warming up the entity allocator. This helps test the freelist in the entity allocator. ## Solution This PR introduces a new `WorldBuilder` type that starts with `World::new`, allows configuration options for warming up the world via the builder pattern, and then builds the warmed up, realistic world. For now, the only available "warm up" is for entities, but we could also add functionality in the future to cache bundle info, pre-create tables, etc to make our benchmarks more realistic. That is, more closely match the performance of a running app, rather than an app at startup. The current implementation for entity warmups allocates some entities and frees them in a random order. It also spawns the highest allocated entity index to prepare `Entities`'s location storage, etc. This involves using `rng` (deterministically), but without this, the entities are allocated in a linear index order (0, 1, 2, ...), which is unrealistic and extremely cache friendly (so it probably makes an impact in performance not desirable for a benchmark). The major downsides here are that the benches take a little longer to run now and that startup/caching time is no longer benchmarked. That is for example, that benchmarking despawning only one entity used to tell us some information about performance of allocating the free list (amongst other one time actions). Now, that information is lost since the world is already warmed up. In practice, for N values of entities, it used to be the case that a higher N showed the performance of the operation, and a lower N showed the performance of the operation + any registration/caching costs. Now, the different N values only tell us more about how well the CPU accommodates a batch of the operation. Currently in Bevy, making a change might make the `...1_entity` benches much worse but the `...1000_entities` much much better because the change added some new caching. The inverse is also common. With this PR, that will no longer be the case, at least for entities and whatever else we add to the `WorldBuilder` in the future. And that change may or may not be desirable. ## Testing Ran a sampling of the benchmarks.
|
Alrighty here are the results on my machine! (filtered to benches with "spawn" or "entity_allocator" in the name) The "free" op (and therefore despawn) has definitely taken the biggest hit. But not to a worrying degree relative to what we gain. And in the context of a "full" entity despawn (and not just the free), the regression is relatively tame. |
# Objective As per [this](#18670 (comment)) comment on #18670, this adds benchmarks for direct access to the entity allocator. ## Solution Add 5 groups of benchmarks: - allocating fresh entities - allocating fresh entities in bulk - freeing entities - allocating reused entities - allocating reused entities in bulk ## Testing - CI and benches
# Objective #18670 introduced a small but fatal issue with batch spawning. Previous tests did not catch this bug because they tested `alloc` and `alloc_many` separately. ## Solution - Fix an off by one math mistake in `FreeBufferIterator::next`. - Fix the part where you need to `ptr.add(index)` if you want the element at `index`. Whoops. - Add a test to catch these issues next time ## Testing - CI - One new test - This was originally found [here](https://discord.com/channels/691052431525675048/749335865876021248/1464359124950319114), and the reproduction crashes on main but is fine on this branch.
# Objective bevyengine#18670 introduced a small but fatal issue with batch spawning. Previous tests did not catch this bug because they tested `alloc` and `alloc_many` separately. ## Solution - Fix an off by one math mistake in `FreeBufferIterator::next`. - Fix the part where you need to `ptr.add(index)` if you want the element at `index`. Whoops. - Add a test to catch these issues next time ## Testing - CI - One new test - This was originally found [here](https://discord.com/channels/691052431525675048/749335865876021248/1464359124950319114), and the reproduction crashes on main but is fine on this branch.

fixes #18003
Objective
It's version 9 of the same objective lol. For assets as entities, we need entities to be able to be reserved from any thread. Ideally, this can be done without depending on an async context, blocking, or waiting. Any of these compromises could hurt asset performance or discontinue the completely non-blocking nature of the asset system.
As a bonus, this PR makes allocating entities only need
&Entitiesinstead of&mut.Entities::flushis now completely optional, meaning none of theEntitiesmethods depends on the flush at all, and there is protection against flushing an entity twice.(If you're curious, v9 actually branched from v8. v8 was focused on #18577 (never flush entities), but this still includes flushing.)
Solution
Organizationally, I split off the underlying
EntityAllocatorfromEntities. This makes it easier to read, etc, now that it's more involved.The basic problem is that we need to be able to allocate an entity from any thread at any time. We also need to be able to free an entity. So at the allocator level, that's 3 operations:
free,alloc(For when you knowfreeisn't being called), andremote_alloc(can be called any time). None of these can require mutable access.The biggest challenge is having a list of entities that are
freeand waiting to be re-used. The list needs to be fully functional without mutable access, needs to be resizable, and needs to be pinned in memory. I ended up using a strategy similar toSplitVec. That dependency requiresstd, and knowing the max capacity ahead of time lets us simplify the implementation, so I made my own implementation here.Testing
No new tests right now. It might be worth using loom at some point, but that requires an additional dependency, test-specific loom feature flags, and giving this treatment to multiple crates, especially bevy_platform.
Future work
#18577 is still a good end game here IMO. Ultimately, (just like @maniwani said would happen), I decided that doing this all at once would be both too challenging and add too much complexity. However, v9 makes "never flush" much, much more approachable for the future. The biggest issues I ran into were that lots of places hold a reference to an entity's
Archetype(but the entity now might not have an archetype) and that checking archetypes everywhere might actually be less performant than flushing. Maybe.We can also potentially speed up a lot of different processes now that
alloccan be called without mutable access andfree(etc.) can be called without needing toflushfirst.Costs
Benchmarks
Interpreting benches:
In most places, v9 is on par with or even faster than main. Some notable exceptions are the "sized_commands" and "fake_commands" sections, but the regression there is purely due to
Entities::flushbeing slower, but we make up for that elsewhere. These commands don't actually do anything though, so this is not relevant to actual use cases. The benchmarks just exist to stress testCommandQueue.The only place where v9 causes a significant and real-world applicable regression is "spawn_commands", where v9 is roughly 15% slower than main. This is something that can be changed later now that
allocdoesn't need mutable access. I expect we can change this 15% regression to a 15% improvement given that "spawn_world" is roughly 20% faster on v9 than on main. For users that need really fast spawn commands though, they are already using some form of batch spawning or direct world access.Other regressions seem to be either minimal, unrealistic, easily corrected in the future, or wrong. I feel confident saying "wrong" since running them back to back can sometimes yield different results. I'm on a M2 Max, so there might be some things jumping from perf cores to efficiency cores or something. (I look forward to standardized benchmarking hardware.)
Wins: I was worried that, without "never flush", this would be an overall regression, but I am relived that that is not the case. Some very common operations, "insert_simple/unbatched" for example, are way faster on this branch than on main. Basically, on main,
allocalso addsEntityMetafor the entity immediately, but on this branch, we only do so inset. That seems to improve temporal cache locality and leads to this roughly 220% improvement. "added_arhcetype" sees 20%-80% improvements too, etc. "iter_simple/foreach_wide" also sees a 270% improvement.I think in practice, v9 will out perform main for real-world schedules. And I think moving towards "never flush" (even for only a few operations, like
Commands::spawn) will improve performance even more.