Description
CHERIoT has two types of tags, capability tags and revocation tags. There is one capability tag per 64-bit memory region that a capability can be stored to and one revocation tag per 64-bit memory region of memory that can be revoked. Unlike capability tags, revocation tags are memory mapped so that the memory allocator can manipulate these bits. The way that revocation works is that when a memory is revoked any capability that is loaded from memory is checked against the revocation bit of the base of that capability. To make sure you revoke a whole piece of memory you must thus revoke the memory from start to finish. This is called a load-side barrier and is enforced by the hardware. This revocation gives the system time to sweep the capability tagged memory for revoked capabilities and once a sweep is done that memory can then be re-used knowing that there are no old capabilities left in the system pointing to this region. The purpose of this document is to identify potential techniques that could be used to accelerate the search for stale capabilities in hardware.
Sweeping
Looking between a start and an end address for capabilities that are revoked. The initial approach is to load each 64-bit region and check whether it is revoked or not. There are a number of improvements that are described in the next few sections.
Epoch
This can be combined with an epoch, where regions revoked before a certain epoch are guaranteed to have been resolved when the next epoch is hit. This allows multiple revocation runs to be interleaved.
Hierarchical tagging
Instead of looking at each individual 64-bit region, you can group regions together and mark complete swathes of memory as untagged to help accelerate to search for stale capabilities. In systems with virtual memory this can exclude complete pages by knowing that there are no capabilities in that page, but it also applies to CHERIoT systems where there is no virtual memory mapping. For example, in a system with 128 KiB of RAM you can use an additional 32 bits each to represent whether a 4 KiB region has at least one capability or not. You can also create a complete binary tree at the cost of doubling your capability tag region from 2 to 4 KiB. Anywhere in between works as well.
Integration with the LSU
In the open source CHERIoT Ibex repository, the hardware revocation sweep acceleration is integrated in the LSU so that it only injects reads and writes on the bus when the LSU is idle. The downside is that the accelerator’s action is dependent on the LSU’s request output.
Compartmentalize the RAM
Instead of interleaving requests, we can actually split up the RAM into multiple blocks. When the LSU is addressing one part of the RAM, the hardware revocation sweep accelerator can access the other parts. The simplest solution here is to split it in half, but you may have more contention here. The downside here is that the revocation accelerator is dependent on the address that the LSU outputs.
Dual port RAM
Using dual port RAM where CPU requests are routed to one port and the accelerator’s requests are routed to the other.
Bus arbitration
Instead of integrating too much with the LSU and creating a separate bus you can have a simple bus arbitrator that decides whether the LSU or the accelerator has priority which is already necessary on a multi-host bus.
Caching
Adding caches on either or both the instruction and data channels can help with optimizations since revocation checks can be done on the interface between cache and RAM instead of between the CPU and the cache. Caches will inherently have the hierarchical tagging property mentioned above so it is much easier to revoke all memory in the cache while deferring the revocation of RAM to the hardware accelerator on the bus
Stack zeroing
A different area where hardware acceleration is possible is stack zero-izing. To avoid capabilities leaking through the stack, the top of the stack must be zeroed before being reused. This stack zeroing can be done using a hardware accelerator and can use similar techniques as the revoker for performance improvements.
Separate capability tag memory for focussed sweeps
Rather than arrange a system SRAM as a 64 +1 bit memory, instead have a normal 64-bit/32-bit wide arrangement and store the capability tags in an entirely separate memory arranged similarly to the revocation bitmap, i.e. you can read out 32 or 64 bit chunks at a time. Over the main memory interface this is accessed in parallel to the main SRAM to give the single cap tag, however it would have a backdoor interface the revoker could use directly (e.g. a second port or arbitrated access to a single port). The revoker can then identify what memory locations actually contain capabilities without having to do a full memory sweep. This will cut down on the number of main bus transactions the revoker requires. It could look at the capability bits in groups of 2/4/8/16 etc and in a single cycle queue up an appropriate number of requests in some internal buffer to be later sent. Where capabilities are sparse in memory it will greatly reduce the time taken to do a single pass.