-
Notifications
You must be signed in to change notification settings - Fork 771
Add a lexer for RISC-V assembly #2145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
One thing I wasn't quite sure about - there's Also more testing & suggestions very welcome! |
lib/rouge/lexers/riscvasm.rb
Outdated
# | ||
# Then it will silently ignore it! | ||
# | ||
rule %r/^[ \t]*#[ \t]*(:?#{RiscvAsm.preproc_directive.join('|')})\b/, Comment::Preproc, :preprocessor_directive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to try and avoid constructing huge regexes with string manipulation like this. Instead, you could use the block syntax perhaps:
rule %r/^\s*#\s*(\w+)\b/ do |m|
if self.class.preproc_directive.include?(m[1])
token Comment::Preproc
push :preprocessor_directive
else
token Comment
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. I think I copied that technique from the ARM asm lexer. I can add a comment there so nobody else does that...
Should I change all of the .join('|')
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, any reason for this preference? I think the inline regex would be faster (at least in well designed languages; not sure about Ruby).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's still a ton of lexers that do it, but changing the old ones would be a huge lift.
In terms of perf, it depends how long it is. I did some benchmarks a while back and a simplified regex with one Set inclusion check by hash was better in at least one version of Ruby (to be fair, that was joining like 200 keywords). Beyond that though there's just a lot more that can go wrong in the general case - if someone adds a keyword that includes a *
or a parenthesis, etc, the whole thing will break horribly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to use sets. One significant downside though is that they are less specific than using .join('|')
. With the join version the regex only matches for those specific words, whereas with (\w+)
it matches for all words, and there's no way to discard a match as far as I could see.
So for example your above change was actually not quite correct because e.g. if you're passing # hello world
it would parse # hello
as a Comment
but then world
as an instruction.
I was able to work around that by adding some more states and things though, and I took the opportunity to improve it a bit so e.g. registers are highlighted in preprocessor definitions.
Also your point about adding special characters is a good one - I did actually have v0.t
and didn't notice! Fairly benign but still. Probably all the .join('|')
s should be changed to use something that escapes special characters.
This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly. I have tested it on several random samples from the internet and it seems to be ok though. The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example
Sets are the preferred method. I also reorganised the states a bit to make things work slightly more nicely (e.g. it highlights registers in preprocessor definitions).
This adds a basic lexer for RISC-V assembly. Like all assembly (as far as I know) there isn't really a formal grammar, and compilers just kind of do whatever, so this is a best effort. There may be valid assembly it does not highlight properly.
I have tested it on several random samples from the internet and it seems to be ok though.
The included demo is from the RISC-V ISA manual: https://riscv-specs.timhutt.co.uk/spec/20240411/unpriv-isa-asciidoc.html#_sgemm_example