Add a lexer for RISC-V assembly #2145

Timmmm · 2025-07-15T17:44:33Z

This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly.

I have tested it on several random samples from the internet and it seems to be ok though.

The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example

Timmmm · 2025-07-15T17:45:50Z

One thing I wasn't quite sure about - there's armasm which claims .s files, but they can also be RISC-V assembly (or any kind of assembly). Is it ok to have multiple lexers claiming the same extension? I saw one part of the docs saying this is for when the extension unambiguously identifies the language which isn't the case here.

Also more testing & suggestions very welcome!

jneen · 2025-07-26T15:52:34Z

lib/rouge/lexers/riscvasm.rb

+        #
+        # Then it will silently ignore it!
+        #
+        rule %r/^[ \t]*#[ \t]*(:?#{RiscvAsm.preproc_directive.join('|')})\b/, Comment::Preproc, :preprocessor_directive


We tend to try and avoid constructing huge regexes with string manipulation like this. Instead, you could use the block syntax perhaps:

rule %r/^\s*#\s*(\w+)\b/ do |m| if self.class.preproc_directive.include?(m[1]) token Comment::Preproc push :preprocessor_directive else token Comment end end

Ah interesting. I think I copied that technique from the ARM asm lexer. I can add a comment there so nobody else does that...

Should I change all of the .join('|')s?

Also, any reason for this preference? I think the inline regex would be faster (at least in well designed languages; not sure about Ruby).

Yeah, there's still a ton of lexers that do it, but changing the old ones would be a huge lift.

In terms of perf, it depends how long it is. I did some benchmarks a while back and a simplified regex with one Set inclusion check by hash was better in at least one version of Ruby (to be fair, that was joining like 200 keywords). Beyond that though there's just a lot more that can go wrong in the general case - if someone adds a keyword that includes a * or a parenthesis, etc, the whole thing will break horribly.

I updated it to use sets. One significant downside though is that they are less specific than using .join('|'). With the join version the regex only matches for those specific words, whereas with (\w+) it matches for all words, and there's no way to discard a match as far as I could see.

So for example your above change was actually not quite correct because e.g. if you're passing # hello world it would parse # hello as a Comment but then world as an instruction.

I was able to work around that by adding some more states and things though, and I took the opportunity to improve it a bit so e.g. registers are highlighted in preprocessor definitions.

Also your point about adding special characters is a good one - I did actually have v0.t and didn't notice! Fairly benign but still. Probably all the .join('|')s should be changed to use something that escapes special characters.

This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly. I have tested it on several random samples from the internet and it seems to be ok though. The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example

Sets are the preferred method. I also reorganised the states a bit to make things work slightly more nicely (e.g. it highlights registers in preprocessor definitions).

Timmmm force-pushed the riscv branch from 4ec0926 to ed7d88d Compare July 16, 2025 08:14

jneen reviewed Jul 26, 2025

View reviewed changes

Timmmm added 2 commits July 28, 2025 11:00

Switch to using sets for RISC-V assembly & improve organisation

2fa4ab3

Sets are the preferred method. I also reorganised the states a bit to make things work slightly more nicely (e.g. it highlights registers in preprocessor definitions).

Timmmm force-pushed the riscv branch from ed7d88d to 2fa4ab3 Compare July 28, 2025 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a lexer for RISC-V assembly #2145

Add a lexer for RISC-V assembly #2145

Timmmm commented Jul 15, 2025

Uh oh!

Timmmm commented Jul 15, 2025

Uh oh!

jneen Jul 26, 2025

Uh oh!

Timmmm Jul 26, 2025

Uh oh!

Timmmm Jul 26, 2025

Uh oh!

jneen Jul 27, 2025

Uh oh!

Timmmm Jul 28, 2025

Uh oh!

Uh oh!

Add a lexer for RISC-V assembly #2145

Are you sure you want to change the base?

Add a lexer for RISC-V assembly #2145

Conversation

Timmmm commented Jul 15, 2025

Uh oh!

Timmmm commented Jul 15, 2025

Uh oh!

jneen Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

Timmmm Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

Timmmm Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

jneen Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

Timmmm Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!