-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Fix Xtensa big-endian instruction decoding #8544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@cyrozap Your changes appear inconsistent with the bit-reversal mapping that the Xtensa uses and we document within Your changes violate this mapping. It is easy to fall prey to the |
Yes, my changes do violate this mapping, but the mapping is not how Xtensa big-endian instructions are encoded. This mapping is just a mapping of the big-endian bit numbering (used in the manual to describe the big-endian instruction formats) to the little-endian bit numbering used by Ghidra. But this can be misleading when decoding individual bits in a subset of an instruction field, because Xtensa big-endian instructions are not simply bit-reversed versions of little-endian instructions. Rather, to swap between big-endian and little-endian instruction encodings, it's the order of the fields in the instruction word that are reversed, and then the order of the bytes of the instruction word in memory are also reversed. This line takes care of the byte order swap, but we need to handle the field order swap in the token definitions. And while that code correctly handles the field order swap for instructions that use the whole fields, it does not correctly handle it when an instruction uses subsets of a field, because unlike what you might expect from the way the ISA manual is written, the order of the bits in each field in the instruction word is always the same regardless of the endianness of the processor. For example, take a look at the encoding for SLLI. On page 610 of the ISA manual, the little-endian encoding of the instruction is shown. SLLI uses the RRR instruction format, described on page 656. Based on this information, you can see that in little-endian encoding,
As I stated previously, this is not entirely correct--Xtensa does not use bit-reversal, it uses field-reversal and byte-reversal. I think the confusion stems from the fact that, aside from using big-endian bit numbering, the ISA manual doesn't really make it obvious that the bit order within fields in an instruction word is always preserved. But when you realize that bit 0 of the big-endian instruction word is the most-significant bit and bit 23 is the least-significant bit, the bit ordering within each field becomes obvious--the lower the bit number, the more-significant the bit, and the higher the bit number, the less-significant the bit. Also, I think I should note that I discovered this instruction decoding bug because I was trying to analyze a big-endian Xtensa binary in Ghidra, and while Ghidra was able to disassemble most instructions correctly, many instructions could not be disassembled at all. I knew the binary itself wasn't the problem because the binary ran 100% correctly on real hardware, so the problem had to be in Ghidra. And without the changes I've made in this PR, the binary cannot be disassembled correctly by Ghidra. |
|
@cyrozap Could you please indicate what compiler toolchain and version was used to produce your test binary. |
|
The documented formats within the Xtensa manual certainlydo not reflect conventional endianess byte-swap. |
|
I do agree that when defining a token field which corresponds to a specific bit within a format-defined field such as your sa4 example, we cannot apply the bit mapping which only applies to a fully intact format-defined field range. |
The binary I'm analyzing is not something I built myself and there are no hints in the binary as to what toolchain was used to build it. If you're asking about the test binaries I included in #8537 that could not be disassembled without this change, those instructions were all copied byte-for-byte out of the binary I'm analyzing. Here's some screenshots of code from the binary, disassembled by Ghidra (after being patched with the changes in this PR): SLLI example: MOVI.N example:
Yeah, it's certainly very strange. I can only assume they made it this way to ensure that the major opcode is always in the first byte when read sequentially, since this is an ISA with mixed 16-bit and 24-bit instruction words. But yes it's not just a simple byte swap or bit reversal.
Yes, I think we might be able to make it more clear in the token definition by renaming the fields based on their field names (e.g., Currently, the only commit that could affect LE decoding at all is this one, which adds |
|
I think we may need to do an overhaul on the token field naming to reflect |
Is that a blocker to getting this PR merged? Or can that renaming be done in another PR? |
|
The PR is great as it is. We might have a separate task to update the token fields, but for now getting everything working correctly is a great addition. |




Fixes #8537
The Xtensa processor module was unable to correctly disassemble several instructions in big-endian mode due to bit-ordering issues. Essentially, because Xtensa reverses the order of the fields in big-endian instructions, not just the bytes, simply doing a 1:1 mapping of big-endian bits to little-endian bits as suggested in the comments is not enough. And this is because for certain instructions some of their fields are split into subfields, and the order of those subfields is always the same in the instruction word regardless of whether the instruction is big-endian or little-endian.
To fix this issue, first I went through all the instructions in the ISA manual one by one to identify instructions where subsets of bits in one or more instruction fields were used. Then I went through each of those instructions and corrected each of their field offsets, where necessary. Now there should be no instructions supported by this processor module that won't be decoded correctly in big-endian mode.
I've patched my locally-built version of Ghidra with these changes and have confirmed that the decoding issues I reported are now fixed.