Skip to content

Add a lexer for RISC-V assembly #2145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Timmmm
Copy link

@Timmmm Timmmm commented Jul 15, 2025

This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly.

I have tested it on several random samples from the internet and it seems to be ok though.

The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example

@Timmmm
Copy link
Author

Timmmm commented Jul 15, 2025

One thing I wasn't quite sure about - there's armasm which claims .s files, but they can also be RISC-V assembly (or any kind of assembly). Is it ok to have multiple lexers claiming the same extension? I saw one part of the docs saying this is for when the extension unambiguously identifies the language which isn't the case here.

Also more testing & suggestions very welcome!

#
# Then it will silently ignore it!
#
rule %r/^[ \t]*#[ \t]*(:?#{RiscvAsm.preproc_directive.join('|')})\b/, Comment::Preproc, :preprocessor_directive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to try and avoid constructing huge regexes with string manipulation like this. Instead, you could use the block syntax perhaps:

rule %r/^\s*#\s*(\w+)\b/ do |m|
  if self.class.preproc_directive.include?(m[1])
    token Comment::Preproc
    push :preprocessor_directive
  else
    token Comment
  end
end
  

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. I think I copied that technique from the ARM asm lexer. I can add a comment there so nobody else does that...

Should I change all of the .join('|')s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any reason for this preference? I think the inline regex would be faster (at least in well designed languages; not sure about Ruby).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's still a ton of lexers that do it, but changing the old ones would be a huge lift.

In terms of perf, it depends how long it is. I did some benchmarks a while back and a simplified regex with one Set inclusion check by hash was better in at least one version of Ruby (to be fair, that was joining like 200 keywords). Beyond that though there's just a lot more that can go wrong in the general case - if someone adds a keyword that includes a * or a parenthesis, etc, the whole thing will break horribly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to use sets. One significant downside though is that they are less specific than using .join('|'). With the join version the regex only matches for those specific words, whereas with (\w+) it matches for all words, and there's no way to discard a match as far as I could see.

So for example your above change was actually not quite correct because e.g. if you're passing # hello world it would parse # hello as a Comment but then world as an instruction.

I was able to work around that by adding some more states and things though, and I took the opportunity to improve it a bit so e.g. registers are highlighted in preprocessor definitions.

Also your point about adding special characters is a good one - I did actually have v0.t and didn't notice! Fairly benign but still. Probably all the .join('|')s should be changed to use something that escapes special characters.

Timmmm added 2 commits July 28, 2025 11:00
This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly.

I have tested it on several random samples from the internet and it seems to be ok though.

The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example
Sets are the preferred method. I also reorganised the states a bit to make things work slightly more nicely (e.g. it highlights registers in preprocessor definitions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants