- 
                Notifications
    You must be signed in to change notification settings 
- Fork 160
Add N64 recompiler block hashes & inline 64-bit ops #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
        
          
                ares/n64/cpu/cpu.cpp
              
                Outdated
          
        
      | .cop1Enabled = scc.status.enable.coprocessor1 > 0, | ||
| .floatingPointMode = scc.status.floatingPointMode > 0, | ||
| .is64bit = context.bits == 64, | ||
| }); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that we compose the context structure every block we run. I believe it might be better if we do this on changes instead. I understand it's harder to get right and can bring more bugs, but I believe it's going to be faster.
So let's say, have a function that recalculates the current context and its hash key and store them somewhere. At runtime, we just use the precalulcated context and the precalculated hash key.
Then, you need to call the function to recalculate the context in any codepath that can change one of the variables that affect it, os for instance mtc0 of the status register is a prime suspect, and there will be others of course, but maybe not dozen of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick review. I moved JITContext inside the existing Context struct, now it's Context::JIT and its contents + bit vector representation are computed only when Context::setMode() is called. This is more often than strictly needed, for example exceptions change the mcc vector state and call setMode() but should be better than before.
I added separate update and toBits member functions to Context::JIT for later debugging; they can be used to check for staleness bugs in debug builds.
382c633    to
    ba4504a      
    Compare
  
    | The  | 
| Sorry for the delay. I will try to review this in the next few days. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from the issues mentioned. How much more are you planning to do with this before taking it out of draft status?
        
          
                ares/n64/cpu/cpu.hpp
              
                Outdated
          
        
      | Block* block; | ||
| u32 tag; | ||
| }; | ||
| Row rows[1 << 6]; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at 64 rows? Were other numbers tried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same number of blocks the old Pool had:
struct Pool {
      Block* blocks[1 << 6];
    };I didn't try any other sizes yet. Seems like it's too small though since I see a CPU allocator flush every few seconds when idling in SM64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:
- Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).
- In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah 64 blocks is way too tiny so it must be constantly thrashing.
Looking at rsp/recompiler.cpp, I can see it has a two-level cache: a direct lookup from context via address (a direct-mapped cache) and then a second hash table lookup from blocks, with the key (address, size, self.pipeline). Apparently the first one guarantees pipeline correctness through the saved block.pipeline object. It's unclear to me how the second, hashset<BlockHashPair> blocks, guarantees that hash collisions don't result in invalid blocks being executed. Also, the set size doesn't seem to be capped.
Since RDRAM is much larger than 1024 words (RSP's IMEM size), the direct-mapped cache wouldn't obviously work. But perhaps it could be extended a simple hash table like N-way cache, like in CPUs? Like this:
    struct Pool {
      struct Row {
        struct Way {
          Block* block;
          u32 tag;
        } ways[2]; // <-- tune this size
      };
      Row rows[1 << 6]; // <-- tune this size
    };And then construct the row index from low bits of the address, put high bits and the state bitvector to tag. Wouldn't need hashing and is somewhat resistant to conflicts. Checking both ways should be efficient.
I'm not sure if this is superior to a large hash table though. I'll test them both :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now realize the "cache" idea I suggested is how it works right now with the "pool" of "blocks", just with a single cache "way".
| Noting here that I've transferred to a new GitHub account: @pekkavaa Edit: And thanks for the review, invertego! | 
64224b2    to
    041811c      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better. Regarding performance tuning, there is no formal process in place, currently. I usually boot a few titles (with audio/video sync disabled) to look for noticeable impact on framerate, which is a pretty superficial way to analyze things. The starting area outside the castle in SM64 gets a lot of attention, but I can make no claim to it being representative of the performance of the catalog as a whole.
        
          
                ares/n64/cpu/cpu.hpp
              
                Outdated
          
        
      | Block* block; | ||
| u32 tag; | ||
| }; | ||
| Row rows[1 << 6]; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, originally it was directly indexing the instructions in the block. Now that it's indexing by hash, collisions are an issue. Two potential solutions:
- Increase the row count. This will reduce the likelihood of collisions at the expense of memory (and data cache efficiency).
- In the event of a mismatch, check if the block was previously compiled and if so, reuse it. The RSP and SH2 recompilers do this, so you can check those out for inspiration.
| I simplified the implementation. The cached CPU state has only the bits enabled that are actually used by the recompiler. I also got rid of the hash and store the tags in their own array. Performance is slightly degraded in SM64 starting area: 139 VPS vs 144 VPS of master. I'd like to get it to be even or better before merging. One idea I got was to reference blocks with an 32-bit offset instead of a full  | 
| This needs rebasing after the calling convention PR went in. I'd still like to merge this even if it seems to regress on performance because it does two things that seem on the good path: 1) create the jitcontext thing which is mandatory for good performance I also notice that there's always a performance downgrade when you start inlining stuff rather than calling external function. So we either assume all jits in other emulators are wrong, or this trend has to tip at some point and then somehow switch into an improvement. It might be that we still bloat blocks too much because of per-instruction overhead. Fixing that might eventually recover what seems lost performance now. | 
0bbd077    to
    d1dacb8      
    Compare
  
    | Rebased against master and confirmed that n64-systemtest still passes. I looked at the implementation again now and see that the computed tag doesn't affect the index of a computed block. In practice this means the recompiler's cache will get thrashed if the tag bits change constantly. I recall commenting out the unused bits so it doesn't happen anymore, but recommend coming up with a smarter scheme in the future. | 
| Ok that might explain the reason why there is a speed pessimization rather than speedup. The context key should partecipare with the address as index for the cache, so that you basically LRU among blocks whose keys are (context,address). BTW: another rebase now should green the CI. | 
0bbd077    to
    fa96b99      
    Compare
  
    | I updated the logic, see the commit message. Basically one layer of indirection more for Blocks and a cache. The old Pool objects store the latest instance. Similar to what was discussed on Discord. Performance is now almost on par with master. I'm happy with it. | 
| Also I took some "traces" of the block behavior. Here are some quick numbers (data in the same repo): https://github.com/pekkavaa/2025-09-20-recompiler-block-tracing/blob/master/docs/1_trace_stats.md For example in SM64 castle grounds, there are 33.21k compiled block lookups per frame but only 4093 unique addresses. 36% of block cache hits are to the same address (idle loop?). So the keys are sparse and accesses unevenly distributed. This validates the old design (sparse array, very fast lookups) so I tried to stay close to it even with the state bits added. One optimization idea: the  | 
        
          
                ares/n64/cpu/recompiler.cpp
              
                Outdated
          
        
      | bits |= singleInstruction ? 1 << 1 : 0; | ||
| bits |= endian ? 1 << 2 : 0; | ||
| bits |= (mode & 0x03) << 3; | ||
| bits |= cop1Enabled ? 1 << 4 : 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including cop1Enabled is causing a lot of code to be compiled twice because the code that runs with cop1 disabled is effectively random. I believe this is because it gets disabled during thread context switches (to allow for lazy switching of FPU state).
A lot of this duplication will occur in blocks that use no FPU instructions, making it an entirely pointless expansion of the code cache. This could be addressed by ignoring the cop1 bit for such blocks.
This is effectively the only thing that blockCache helps with at the moment, and there only partially. It prevents code from being recompiled more than twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking COP1 changes in the hash is indeed not useful right now. I've also understood that libultra flip flops the cop1Enabled flag on and off on thread switches.
I'll remove the COP1Enabled bit from the hash for now. It can be added back once its used during code generation. And perhaps the design can be revisited at that point too.
The block cache is not really meant to be hit often and it's functioning as a convenience to avoid manually writing "if kernel mode gets disabled, purge all blocks" logic. Right now it makes it possible to elide kernel and "dual mode enabled" checks in CPU::Recompiler::emit.
I actually measured that in SM64 the hit rate of the cache is about 99.185% (with COP1 bit included in context), which works out to
(1-(99.185/100))*33210
= 270,6615
unordered_map lookups per frame (based on the 33.21k/frame lookups measurement from earlier). So performance-wise it's actually fine but the point about blocks being compiled twice still stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tested SM64, and the hit rate is 0% when the COP1 bit excluded. That's what I meant when I said that it's the only thing that it helps with.
| assert(row->block); | ||
| return row->block; | ||
| } else { | ||
| auto blockIt = blockCache.find(blockCacheKey(address, state)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reusing any previously compiled block with the same start address and the same CPU context. However, the instructions could be entirely different! This is why the RSP recompiler incorporates a hash of the instructions in the lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. Yeah I forgot to clean up the block cache in invalidatePool(address). I added an inverse mapping from a pool index to a list of cache keys but it's kind of cheesy. I'm not really happy with it. I wonder if there are other possible solutions other than a block content hash and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. Yeah I forgot to clean up the block cache in invalidatePool(address). I added an inverse mapping from a pool index to a list of cache keys but it's kind of cheesy. I'm not really happy with it. I wonder if there are other possible solutions other than a block content hash and this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could maintain a singly linked list of all blocks allocated for a given pool. The head pointer goes in Pool and next pointer goes in Block. You could walk the list in invalidatePool() instead of using poolBlockKeys. Honestly though, you could probably remove both blockCache and poolBlockKeys and just walk the list for lookup. The list shouldn't ever get that long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to my last comment, you could actually keep the lists separate for each pool row to make the lookups even faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, the issue of blockCache lookup must be addressed.
fa96b99    to
    d77faf2      
    Compare
  
    We'd like the recompiler to take the execution context such as kernel
mode into account when compiling blocks. That's why it's necessary to
identify blocks not just by address but all the information used at
compile time. This is done by packing relevant CPU state to an u32 and
storing it alongside the block pointer.
The execution state and its representation as bit vector are recomputed
only when needed, at the moment that means
- each time Context::setMode() is called, happens on powerup,
- on both MTC0 and MFC0 instructions,
- on exceptions,
- when the GDB server requests a block cleanup, and
- when system state is deserialized ("load state").
Recompiler's block pool is now extended with the state bits and a new
compiled block cache. The logic is similar to the old implementation:
a physical address is split into high and low bits, high bits are used
to index a Pool, and the low bits an individual row inside the pool.
The difference: each row represents the most recently used pair of
(Block*, state) instead of just a Block*. The recompiler updates a row
if the state bits don't match by looking up the corresponding block in
the new compiled block cache, which is just unordered_map<u64, Block*>.
This is not supposed to happen often, so even a relatively slow STL
container works fine for the purpose.
Also, there's a new optimization: inlined dual mode instructions.
The recompiler now emits inline implementations of basic 64-bit ops
without runtime kernel mode checks. This is made possible by the block
specialization logic.
    d77faf2    to
    414fa22      
    Compare
  
    | bits |= endian ? 1 << 2 : 0; | ||
| bits |= (mode & 0x03) << 3; | ||
| bits |= cop1Enabled ? 1 << 5 : 0; | ||
| // bits |= floatingPointMode ? 1 << 6 : 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a comment you mentioned you would exclude the cop1Enabled bit, but you removed the floatingPointMode bit instead. Was this a mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh, yes. I'll fix it.
|  | ||
| struct PoolRow { | ||
| Block* block; | ||
| u32 stateBits; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving stateBits to Block and getting rid of PoolRow to save an extra pointer indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it could be done but both the PoolRow and Block are allocated from the same bump allocator and usually are even neighboring, so it's not like the pointer indirection is random access to to heap.

Reworking recompiler's JIT block accesses to be more flexible so that we can bake more runtime info to them. This way runtime checks can be avoided in the generated code, making inlining the implementations more attractive since we don't need to generate calls to e.g. kernel mode checks for every 64-bit op. This is expected to increase recompiled code's performance eventually.
I expect these changes to have a small negative performance impact but haven't measured it.
Commit messages: