Add N64 recompiler block hashes & inline 64-bit ops #1640

ghost · 2024-09-08T14:54:12Z

Reworking recompiler's JIT block accesses to be more flexible so that we can bake more runtime info to them. This way runtime checks can be avoided in the generated code, making inlining the implementations more attractive since we don't need to generate calls to e.g. kernel mode checks for every 64-bit op. This is expected to increase recompiled code's performance eventually.

I expect these changes to have a small negative performance impact but haven't measured it.

Commit messages:

We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by computing a 32-bit key and using that as
a block's identifier instead of the last six physical address bits like
was done before.

Since we have now 32-bit instead of 6-bit keys, the block() function
hashes the keys before mapping them to one of the 64 pool rows. The hash
function was chosen arbitrarily to be better than a simple multiplicative
hash and is likely not the best choice for this exact task.

Pass JITContext down to leaf emit functions.

Emit inline implementations of basic 64-bit operations.

Use block compile-time information to elide kernel mode checks of
the now inlined operations.

rasky · 2024-09-09T10:43:10Z

ares/n64/cpu/cpu.cpp

+        .cop1Enabled = scc.status.enable.coprocessor1 > 0,
+        .floatingPointMode = scc.status.floatingPointMode > 0,
+        .is64bit = context.bits == 64,
+        });


I'm a bit concerned that we compose the context structure every block we run. I believe it might be better if we do this on changes instead. I understand it's harder to get right and can bring more bugs, but I believe it's going to be faster.

So let's say, have a function that recalculates the current context and its hash key and store them somewhere. At runtime, we just use the precalulcated context and the precalculated hash key.

Then, you need to call the function to recalculate the context in any codepath that can change one of the variables that affect it, os for instance mtc0 of the status register is a prime suspect, and there will be others of course, but maybe not dozen of them.

Thank you for the quick review. I moved JITContext inside the existing Context struct, now it's Context::JIT and its contents + bit vector representation are computed only when Context::setMode() is called. This is more often than strictly needed, for example exceptions change the mcc vector state and call setMode() but should be better than before.

I added separate update and toBits member functions to Context::JIT for later debugging; they can be used to check for staleness bugs in debug builds.

ghost · 2024-09-10T19:08:57Z

The GDB::server.hasBreakpoints() check is a bit problematic because right now it gets updated only when Context::setMode is called even though breakpoints might be added or removed at other times as well. I suppose it could be polled for every block perhaps? I considered it part of the execution context but before it wasn't taken into account at all when looking up blocks. To me it seems like breakpoints only worked after all Pool instances where flushed.

invertego · 2024-10-10T04:22:49Z

Sorry for the delay. I will try to review this in the next few days.

invertego

Looks good aside from the issues mentioned. How much more are you planning to do with this before taking it out of draft status?

ares/n64/cpu/recompiler.cpp

ares/n64/cpu/context.cpp

ares/n64/cpu/cpu.hpp

ares/n64/cpu/recompiler.cpp

invertego · 2024-10-18T08:09:19Z

ares/n64/cpu/cpu.hpp

+        Block* block;
+        u32 tag;
+      };
+      Row rows[1 << 6];


How did you arrive at 64 rows? Were other numbers tried?

It's the same number of blocks the old Pool had:

struct Pool { Block* blocks[1 << 6]; };

I didn't try any other sizes yet. Seems like it's too small though since I see a CPU allocator flush every few seconds when idling in SM64.

Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:

Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).

In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.

Yeah 64 blocks is way too tiny so it must be constantly thrashing.

Looking at rsp/recompiler.cpp, I can see it has a two-level cache: a direct lookup from context via address (a direct-mapped cache) and then a second hash table lookup from blocks, with the key (address, size, self.pipeline). Apparently the first one guarantees pipeline correctness through the saved block.pipeline object. It's unclear to me how the second, hashset<BlockHashPair> blocks, guarantees that hash collisions don't result in invalid blocks being executed. Also, the set size doesn't seem to be capped.

Since RDRAM is much larger than 1024 words (RSP's IMEM size), the direct-mapped cache wouldn't obviously work. But perhaps it could be extended a simple hash table like N-way cache, like in CPUs? Like this:

struct Pool { struct Row { struct Way { Block* block; u32 tag; } ways[2]; // <-- tune this size }; Row rows[1 << 6]; // <-- tune this size };

And then construct the row index from low bits of the address, put high bits and the state bitvector to tag. Wouldn't need hashing and is somewhat resistant to conflicts. Checking both ways should be efficient.

I'm not sure if this is superior to a large hash table though. I'll test them both :)

I now realize the "cache" idea I suggested is how it works right now with the "pool" of "blocks", just with a single cache "way".

ghost · 2025-01-19T11:02:41Z

Noting here that I've transferred to a new GitHub account: @pekkavaa
I'd like to continue ares JIT performance work at some point but maybe with a solid benchmarking setup first. Even small changes should get reliably measured.

Edit: And thanks for the review, invertego!

pekkavaa · 2025-05-05T14:41:19Z

Hello again 👋 I rebased the branch and addressed all the review comments. I also squashed the two commits together since it got annoying to move changes in between them. Things seem to work still and it passes N64 systemtest. The test ROM looks different now than what I recall though:

How much more are you planning to do with this before taking it out of draft status?

I'd like to get an idea on the performance impact and then adjust at least the block pool size accordingly. Any ideas how to benchmark ares are welcome :)

invertego

Looking better. Regarding performance tuning, there is no formal process in place, currently. I usually boot a few titles (with audio/video sync disabled) to look for noticeable impact on framerate, which is a pretty superficial way to analyze things. The starting area outside the castle in SM64 gets a lot of attention, but I can make no claim to it being representative of the performance of the catalog as a whole.

ares/n64/cpu/recompiler.cpp

invertego · 2025-05-07T04:38:32Z

ares/n64/cpu/cpu.hpp

+        Block* block;
+        u32 tag;
+      };
+      Row rows[1 << 6];


Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:

Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).

In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.

ares/n64/cpu/cpu.hpp

pekkavaa · 2025-05-08T20:58:01Z

I simplified the implementation. The cached CPU state has only the bits enabled that are actually used by the recompiler. I also got rid of the hash and store the tags in their own array.

Performance is slightly degraded in SM64 starting area: 139 VPS vs 144 VPS of master. I'd like to get it to be even or better before merging.

One idea I got was to reference blocks with an 32-bit offset instead of a full Block*. I got it working on top of master (commit) but it didn't improve the frame timings, but perhaps it would alleviate the now bloated Pool struct.

rasky · 2025-09-14T20:25:14Z

This needs rebasing after the calling convention PR went in.

I'd still like to merge this even if it seems to regress on performance because it does two things that seem on the good path: 1) create the jitcontext thing which is mandatory for good performance
2) inline many ops

I also notice that there's always a performance downgrade when you start inlining stuff rather than calling external function. So we either assume all jits in other emulators are wrong, or this trend has to tip at some point and then somehow switch into an improvement. It might be that we still bloat blocks too much because of per-instruction overhead. Fixing that might eventually recover what seems lost performance now.

pekkavaa · 2025-09-15T13:41:08Z

Rebased against master and confirmed that n64-systemtest still passes.

I looked at the implementation again now and see that the computed tag doesn't affect the index of a computed block. In practice this means the recompiler's cache will get thrashed if the tag bits change constantly. I recall commenting out the unused bits so it doesn't happen anymore, but recommend coming up with a smarter scheme in the future.

rasky · 2025-09-16T11:25:26Z

Ok that might explain the reason why there is a speed pessimization rather than speedup. The context key should partecipare with the address as index for the cache, so that you basically LRU among blocks whose keys are (context,address).

BTW: another rebase now should green the CI.

pekkavaa · 2025-09-20T18:22:59Z

I updated the logic, see the commit message. Basically one layer of indirection more for Blocks and a cache. The old Pool objects store the latest instance. Similar to what was discussed on Discord.

Performance is now almost on par with master. I'm happy with it.

pekkavaa · 2025-09-20T18:28:41Z

Also I took some "traces" of the block behavior. Here are some quick numbers (data in the same repo): https://github.com/pekkavaa/2025-09-20-recompiler-block-tracing/blob/master/docs/1_trace_stats.md

For example in SM64 castle grounds, there are 33.21k compiled block lookups per frame but only 4093 unique addresses. 36% of block cache hits are to the same address (idle loop?). So the keys are sparse and accesses unevenly distributed. This validates the old design (sparse array, very fast lookups) so I tried to stay close to it even with the state bits added.

One optimization idea: the PoolRow struct could be shrunk to 64 bits if the Block* block ptr is replaced with an u32 offset.

ares/n64/cpu/recompiler.cpp

invertego · 2025-09-22T01:08:35Z

ares/n64/cpu/recompiler.cpp

+  bits |= singleInstruction ? 1 << 1 : 0;
+  bits |= endian ? 1 << 2 : 0;
+  bits |= (mode & 0x03) << 3;
+  bits |= cop1Enabled ? 1 << 4 : 0;


Including cop1Enabled is causing a lot of code to be compiled twice because the code that runs with cop1 disabled is effectively random. I believe this is because it gets disabled during thread context switches (to allow for lazy switching of FPU state).

A lot of this duplication will occur in blocks that use no FPU instructions, making it an entirely pointless expansion of the code cache. This could be addressed by ignoring the cop1 bit for such blocks.

This is effectively the only thing that blockCache helps with at the moment, and there only partially. It prevents code from being recompiled more than twice.

Tracking COP1 changes in the hash is indeed not useful right now. I've also understood that libultra flip flops the cop1Enabled flag on and off on thread switches.

I'll remove the COP1Enabled bit from the hash for now. It can be added back once its used during code generation. And perhaps the design can be revisited at that point too.

The block cache is not really meant to be hit often and it's functioning as a convenience to avoid manually writing "if kernel mode gets disabled, purge all blocks" logic. Right now it makes it possible to elide kernel and "dual mode enabled" checks in CPU::Recompiler::emit.

I actually measured that in SM64 the hit rate of the cache is about 99.185% (with COP1 bit included in context), which works out to

(1-(99.185/100))*33210 = 270,6615

unordered_map lookups per frame (based on the 33.21k/frame lookups measurement from earlier). So performance-wise it's actually fine but the point about blocks being compiled twice still stands.

I also tested SM64, and the hit rate is 0% when the COP1 bit excluded. That's what I meant when I said that it's the only thing that it helps with.

invertego · 2025-09-22T01:10:44Z

ares/n64/cpu/recompiler.cpp

+        assert(row->block);
+        return row->block;
+      } else {
+        auto blockIt = blockCache.find(blockCacheKey(address, state));


This is reusing any previously compiled block with the same start address and the same CPU context. However, the instructions could be entirely different! This is why the RSP recompiler incorporates a hash of the instructions in the lookup.

Yikes. Yeah I forgot to clean up the block cache in invalidatePool(address). I added an inverse mapping from a pool index to a list of cache keys but it's kind of cheesy. I'm not really happy with it. I wonder if there are other possible solutions other than a block content hash and this.

You could maintain a singly linked list of all blocks allocated for a given pool. The head pointer goes in Pool and next pointer goes in Block. You could walk the list in invalidatePool() instead of using poolBlockKeys. Honestly though, you could probably remove both blockCache and poolBlockKeys and just walk the list for lookup. The list shouldn't ever get that long.

Adding to my last comment, you could actually keep the lists separate for each pool row to make the lookups even faster.

invertego

At a minimum, the issue of blockCache lookup must be addressed.

We'd like the recompiler to take the execution context such as kernel mode into account when compiling blocks. That's why it's necessary to identify blocks not just by address but all the information used at compile time. This is done by packing relevant CPU state to an u32 and storing it alongside the block pointer. The execution state and its representation as bit vector are recomputed only when needed, at the moment that means - each time Context::setMode() is called, happens on powerup, - on both MTC0 and MFC0 instructions, - on exceptions, - when the GDB server requests a block cleanup, and - when system state is deserialized ("load state"). Recompiler's block pool is now extended with the state bits and a new compiled block cache. The logic is similar to the old implementation: a physical address is split into high and low bits, high bits are used to index a Pool, and the low bits an individual row inside the pool. The difference: each row represents the most recently used pair of (Block*, state) instead of just a Block*. The recompiler updates a row if the state bits don't match by looking up the corresponding block in the new compiled block cache, which is just unordered_map<u64, Block*>. This is not supposed to happen often, so even a relatively slow STL container works fine for the purpose. Also, there's a new optimization: inlined dual mode instructions. The recompiler now emits inline implementations of basic 64-bit ops without runtime kernel mode checks. This is made possible by the block specialization logic.

invertego · 2025-09-23T00:58:09Z

ares/n64/cpu/recompiler.cpp

+  bits |= endian ? 1 << 2 : 0;
+  bits |= (mode & 0x03) << 3;
+  bits |= cop1Enabled ? 1 << 5 : 0;
+  // bits |= floatingPointMode ? 1 << 6 : 0;


In a comment you mentioned you would exclude the cop1Enabled bit, but you removed the floatingPointMode bit instead. Was this a mistake?

Sigh, yes. I'll fix it.

invertego · 2025-09-23T01:27:02Z

ares/n64/cpu/cpu.hpp


+    struct PoolRow {
+      Block* block;
+      u32 stateBits;


Consider moving stateBits to Block and getting rid of PoolRow to save an extra pointer indirection.

Well it could be done but both the PoolRow and Block are allocated from the same bump allocator and usually are even neighboring, so it's not like the pointer indirection is random access to to heap.

ghost marked this pull request as draft September 8, 2024 15:08

LukeUsher requested review from invertego and rasky September 9, 2024 09:00

rasky reviewed Sep 9, 2024

View reviewed changes

ghost force-pushed the add-block-hashes branch from 382c633 to ba4504a Compare September 10, 2024 19:02

invertego requested changes Oct 18, 2024

View reviewed changes

pekkavaa force-pushed the add-block-hashes branch 3 times, most recently from 64224b2 to 041811c Compare May 5, 2025 14:38

invertego reviewed May 7, 2025

View reviewed changes

pekkavaa force-pushed the add-block-hashes branch from 041811c to 0bbd077 Compare May 8, 2025 20:51

pekkavaa mentioned this pull request Sep 15, 2025

recompiler: remove static_assert(false) from a template #2219

Merged

pekkavaa force-pushed the add-block-hashes branch from 0bbd077 to d1dacb8 Compare September 15, 2025 13:37

pekkavaa force-pushed the add-block-hashes branch 2 times, most recently from 0bbd077 to fa96b99 Compare September 20, 2025 18:20

invertego reviewed Sep 21, 2025

View reviewed changes

ares/n64/cpu/recompiler.cpp Outdated Show resolved Hide resolved

invertego reviewed Sep 22, 2025

View reviewed changes

invertego requested changes Sep 22, 2025

View reviewed changes

pekkavaa force-pushed the add-block-hashes branch from fa96b99 to d77faf2 Compare September 22, 2025 16:21

pekkavaa force-pushed the add-block-hashes branch from d77faf2 to 414fa22 Compare September 22, 2025 16:26

invertego reviewed Sep 23, 2025

View reviewed changes

Add N64 recompiler block hashes & inline 64-bit ops #1640

Are you sure you want to change the base?

Add N64 recompiler block hashes & inline 64-bit ops #1640

Uh oh!

Conversation

ghost commented Sep 8, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Sep 10, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

invertego commented Oct 10, 2024

Uh oh!

invertego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pekkavaa May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pekkavaa May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Jan 19, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pekkavaa commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

invertego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pekkavaa commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rasky commented Sep 14, 2025

Uh oh!

pekkavaa commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rasky commented Sep 16, 2025

Uh oh!

pekkavaa commented Sep 20, 2025

Uh oh!

pekkavaa commented Sep 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ghost commented Sep 8, 2024 •

edited by ghost

Loading

ghost commented Sep 10, 2024 •

edited by ghost

Loading

pekkavaa May 5, 2025 •

edited

Loading

pekkavaa May 8, 2025 •

edited

Loading

ghost commented Jan 19, 2025 •

edited by ghost

Loading

pekkavaa commented May 5, 2025 •

edited

Loading

pekkavaa commented May 8, 2025 •

edited

Loading

pekkavaa commented Sep 15, 2025 •

edited

Loading