Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Sep 8, 2024

Reworking recompiler's JIT block accesses to be more flexible so that we can bake more runtime info to them. This way runtime checks can be avoided in the generated code, making inlining the implementations more attractive since we don't need to generate calls to e.g. kernel mode checks for every 64-bit op. This is expected to increase recompiled code's performance eventually.

I expect these changes to have a small negative performance impact but haven't measured it.

Commit messages:

We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by computing a 32-bit key and using that as
a block's identifier instead of the last six physical address bits like
was done before.

Since we have now 32-bit instead of 6-bit keys, the block() function
hashes the keys before mapping them to one of the 64 pool rows. The hash
function was chosen arbitrarily to be better than a simple multiplicative
hash and is likely not the best choice for this exact task.

  • Pass JITContext down to leaf emit functions.
  • Emit inline implementations of basic 64-bit operations.
  • Use block compile-time information to elide kernel mode checks of
    the now inlined operations.

@ghost ghost marked this pull request as draft September 8, 2024 15:08
@LukeUsher LukeUsher requested review from invertego and rasky September 9, 2024 09:00
.cop1Enabled = scc.status.enable.coprocessor1 > 0,
.floatingPointMode = scc.status.floatingPointMode > 0,
.is64bit = context.bits == 64,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that we compose the context structure every block we run. I believe it might be better if we do this on changes instead. I understand it's harder to get right and can bring more bugs, but I believe it's going to be faster.

So let's say, have a function that recalculates the current context and its hash key and store them somewhere. At runtime, we just use the precalulcated context and the precalculated hash key.

Then, you need to call the function to recalculate the context in any codepath that can change one of the variables that affect it, os for instance mtc0 of the status register is a prime suspect, and there will be others of course, but maybe not dozen of them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick review. I moved JITContext inside the existing Context struct, now it's Context::JIT and its contents + bit vector representation are computed only when Context::setMode() is called. This is more often than strictly needed, for example exceptions change the mcc vector state and call setMode() but should be better than before.

I added separate update and toBits member functions to Context::JIT for later debugging; they can be used to check for staleness bugs in debug builds.

@ghost ghost force-pushed the add-block-hashes branch from 382c633 to ba4504a Compare September 10, 2024 19:02
@ghost
Copy link
Author

ghost commented Sep 10, 2024

The GDB::server.hasBreakpoints() check is a bit problematic because right now it gets updated only when Context::setMode is called even though breakpoints might be added or removed at other times as well. I suppose it could be polled for every block perhaps? I considered it part of the execution context but before it wasn't taken into account at all when looking up blocks. To me it seems like breakpoints only worked after all Pool instances where flushed.

@invertego
Copy link
Contributor

Sorry for the delay. I will try to review this in the next few days.

Copy link
Contributor

@invertego invertego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from the issues mentioned. How much more are you planning to do with this before taking it out of draft status?

Block* block;
u32 tag;
};
Row rows[1 << 6];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at 64 rows? Were other numbers tried?

Copy link
Contributor

@pekkavaa pekkavaa May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same number of blocks the old Pool had:

struct Pool {
      Block* blocks[1 << 6];
    };

I didn't try any other sizes yet. Seems like it's too small though since I see a CPU allocator flush every few seconds when idling in SM64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:

  1. Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).
  2. In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 64 blocks is way too tiny so it must be constantly thrashing.

Looking at rsp/recompiler.cpp, I can see it has a two-level cache: a direct lookup from context via address (a direct-mapped cache) and then a second hash table lookup from blocks, with the key (address, size, self.pipeline). Apparently the first one guarantees pipeline correctness through the saved block.pipeline object. It's unclear to me how the second, hashset<BlockHashPair> blocks, guarantees that hash collisions don't result in invalid blocks being executed. Also, the set size doesn't seem to be capped.

Since RDRAM is much larger than 1024 words (RSP's IMEM size), the direct-mapped cache wouldn't obviously work. But perhaps it could be extended a simple hash table like N-way cache, like in CPUs? Like this:

    struct Pool {
      struct Row {
        struct Way {
          Block* block;
          u32 tag;
        } ways[2]; // <-- tune this size
      };
      Row rows[1 << 6]; // <-- tune this size
    };

And then construct the row index from low bits of the address, put high bits and the state bitvector to tag. Wouldn't need hashing and is somewhat resistant to conflicts. Checking both ways should be efficient.

I'm not sure if this is superior to a large hash table though. I'll test them both :)

Copy link
Contributor

@pekkavaa pekkavaa May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now realize the "cache" idea I suggested is how it works right now with the "pool" of "blocks", just with a single cache "way".

@ghost
Copy link
Author

ghost commented Jan 19, 2025

Noting here that I've transferred to a new GitHub account: @pekkavaa
I'd like to continue ares JIT performance work at some point but maybe with a solid benchmarking setup first. Even small changes should get reliably measured.

Edit: And thanks for the review, invertego!

@pekkavaa pekkavaa force-pushed the add-block-hashes branch 3 times, most recently from 64224b2 to 041811c Compare May 5, 2025 14:38
@pekkavaa
Copy link
Contributor

pekkavaa commented May 5, 2025

Hello again 👋 I rebased the branch and addressed all the review comments. I also squashed the two commits together since it got annoying to move changes in between them. Things seem to work still and it passes N64 systemtest. The test ROM looks different now than what I recall though:
image

How much more are you planning to do with this before taking it out of draft status?

I'd like to get an idea on the performance impact and then adjust at least the block pool size accordingly. Any ideas how to benchmark ares are welcome :)

Copy link
Contributor

@invertego invertego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better. Regarding performance tuning, there is no formal process in place, currently. I usually boot a few titles (with audio/video sync disabled) to look for noticeable impact on framerate, which is a pretty superficial way to analyze things. The starting area outside the castle in SM64 gets a lot of attention, but I can make no claim to it being representative of the performance of the catalog as a whole.

Block* block;
u32 tag;
};
Row rows[1 << 6];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:

  1. Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).
  2. In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.

@pekkavaa pekkavaa force-pushed the add-block-hashes branch from 041811c to 0bbd077 Compare May 8, 2025 20:51
@pekkavaa
Copy link
Contributor

pekkavaa commented May 8, 2025

I simplified the implementation. The cached CPU state has only the bits enabled that are actually used by the recompiler. I also got rid of the hash and store the tags in their own array.

Performance is slightly degraded in SM64 starting area: 139 VPS vs 144 VPS of master. I'd like to get it to be even or better before merging.

One idea I got was to reference blocks with an 32-bit offset instead of a full Block*. I got it working on top of master (commit) but it didn't improve the frame timings, but perhaps it would alleviate the now bloated Pool struct.

@rasky
Copy link
Collaborator

rasky commented Sep 14, 2025

This needs rebasing after the calling convention PR went in.

I'd still like to merge this even if it seems to regress on performance because it does two things that seem on the good path: 1) create the jitcontext thing which is mandatory for good performance
2) inline many ops

I also notice that there's always a performance downgrade when you start inlining stuff rather than calling external function. So we either assume all jits in other emulators are wrong, or this trend has to tip at some point and then somehow switch into an improvement. It might be that we still bloat blocks too much because of per-instruction overhead. Fixing that might eventually recover what seems lost performance now.

@pekkavaa
Copy link
Contributor

pekkavaa commented Sep 15, 2025

Rebased against master and confirmed that n64-systemtest still passes.

I looked at the implementation again now and see that the computed tag doesn't affect the index of a computed block. In practice this means the recompiler's cache will get thrashed if the tag bits change constantly. I recall commenting out the unused bits so it doesn't happen anymore, but recommend coming up with a smarter scheme in the future.

@rasky
Copy link
Collaborator

rasky commented Sep 16, 2025

Ok that might explain the reason why there is a speed pessimization rather than speedup. The context key should partecipare with the address as index for the cache, so that you basically LRU among blocks whose keys are (context,address).

BTW: another rebase now should green the CI.

@pekkavaa pekkavaa force-pushed the add-block-hashes branch 2 times, most recently from 0bbd077 to fa96b99 Compare September 20, 2025 18:20
@pekkavaa
Copy link
Contributor

I updated the logic, see the commit message. Basically one layer of indirection more for Blocks and a cache. The old Pool objects store the latest instance. Similar to what was discussed on Discord.

Performance is now almost on par with master. I'm happy with it.

@pekkavaa
Copy link
Contributor

Also I took some "traces" of the block behavior. Here are some quick numbers (data in the same repo): https://github.com/pekkavaa/2025-09-20-recompiler-block-tracing/blob/master/docs/1_trace_stats.md

For example in SM64 castle grounds, there are 33.21k compiled block lookups per frame but only 4093 unique addresses. 36% of block cache hits are to the same address (idle loop?). So the keys are sparse and accesses unevenly distributed. This validates the old design (sparse array, very fast lookups) so I tried to stay close to it even with the state bits added.

One optimization idea: the PoolRow struct could be shrunk to 64 bits if the Block* block ptr is replaced with an u32 offset.

bits |= singleInstruction ? 1 << 1 : 0;
bits |= endian ? 1 << 2 : 0;
bits |= (mode & 0x03) << 3;
bits |= cop1Enabled ? 1 << 4 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including cop1Enabled is causing a lot of code to be compiled twice because the code that runs with cop1 disabled is effectively random. I believe this is because it gets disabled during thread context switches (to allow for lazy switching of FPU state).

A lot of this duplication will occur in blocks that use no FPU instructions, making it an entirely pointless expansion of the code cache. This could be addressed by ignoring the cop1 bit for such blocks.

This is effectively the only thing that blockCache helps with at the moment, and there only partially. It prevents code from being recompiled more than twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking COP1 changes in the hash is indeed not useful right now. I've also understood that libultra flip flops the cop1Enabled flag on and off on thread switches.

I'll remove the COP1Enabled bit from the hash for now. It can be added back once its used during code generation. And perhaps the design can be revisited at that point too.

The block cache is not really meant to be hit often and it's functioning as a convenience to avoid manually writing "if kernel mode gets disabled, purge all blocks" logic. Right now it makes it possible to elide kernel and "dual mode enabled" checks in CPU::Recompiler::emit.

I actually measured that in SM64 the hit rate of the cache is about 99.185% (with COP1 bit included in context), which works out to

(1-(99.185/100))*33210
= 270,6615

unordered_map lookups per frame (based on the 33.21k/frame lookups measurement from earlier). So performance-wise it's actually fine but the point about blocks being compiled twice still stands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tested SM64, and the hit rate is 0% when the COP1 bit excluded. That's what I meant when I said that it's the only thing that it helps with.

assert(row->block);
return row->block;
} else {
auto blockIt = blockCache.find(blockCacheKey(address, state));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reusing any previously compiled block with the same start address and the same CPU context. However, the instructions could be entirely different! This is why the RSP recompiler incorporates a hash of the instructions in the lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. Yeah I forgot to clean up the block cache in invalidatePool(address). I added an inverse mapping from a pool index to a list of cache keys but it's kind of cheesy. I'm not really happy with it. I wonder if there are other possible solutions other than a block content hash and this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. Yeah I forgot to clean up the block cache in invalidatePool(address). I added an inverse mapping from a pool index to a list of cache keys but it's kind of cheesy. I'm not really happy with it. I wonder if there are other possible solutions other than a block content hash and this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could maintain a singly linked list of all blocks allocated for a given pool. The head pointer goes in Pool and next pointer goes in Block. You could walk the list in invalidatePool() instead of using poolBlockKeys. Honestly though, you could probably remove both blockCache and poolBlockKeys and just walk the list for lookup. The list shouldn't ever get that long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to my last comment, you could actually keep the lists separate for each pool row to make the lookups even faster.

Copy link
Contributor

@invertego invertego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, the issue of blockCache lookup must be addressed.

We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by packing relevant CPU state to an u32 and
storing it alongside the block pointer.

The execution state and its representation as bit vector are recomputed
only when needed, at the moment that means

- each time Context::setMode() is called, happens on powerup,
- on both MTC0 and MFC0 instructions,
- on exceptions,
- when the GDB server requests a block cleanup, and
- when system state is deserialized ("load state").

Recompiler's block pool is now extended with the state bits and a new
compiled block cache. The logic is similar to the old implementation:
a physical address is split into high and low bits, high bits are used
to index a Pool, and the low bits an individual row inside the pool.

The difference: each row represents the most recently used pair of
(Block*, state) instead of just a Block*. The recompiler updates a row
if the state bits don't match by looking up the corresponding block in
the new compiled block cache, which is just unordered_map<u64, Block*>.
This is not supposed to happen often, so even a relatively slow STL
container works fine for the purpose.

Also, there's a new optimization: inlined dual mode instructions.
The recompiler now emits inline implementations of basic 64-bit ops
without runtime kernel mode checks. This is made possible by the block
specialization logic.
bits |= endian ? 1 << 2 : 0;
bits |= (mode & 0x03) << 3;
bits |= cop1Enabled ? 1 << 5 : 0;
// bits |= floatingPointMode ? 1 << 6 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a comment you mentioned you would exclude the cop1Enabled bit, but you removed the floatingPointMode bit instead. Was this a mistake?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, yes. I'll fix it.


struct PoolRow {
Block* block;
u32 stateBits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving stateBits to Block and getting rid of PoolRow to save an extra pointer indirection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it could be done but both the PoolRow and Block are allocated from the same bump allocator and usually are even neighboring, so it's not like the pointer indirection is random access to to heap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants