Best way of handling languages that have unicode identifiers?

I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers. (The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)

I see a couple ways to get this to work:

* I could take all the unicode codepoints making up a general category, per the Ada specification ([SEC. 2.3](https://ada-lang.io/docs/arm/AA-2/AA-2.3)), generate rules matching them, and then for each compound codepoint split it into `char`s.
* Write a lexer to generate tokens, and parse on those.

The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those. But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?

I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one). If I could do that, I could just import  [nim-unicodedb](https://github.com/nitely/nim-unicodedb?tab=readme-ov-file) and check against it's version of the UCD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best way of handling languages that have unicode identifiers? #72

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Best way of handling languages that have unicode identifiers? #72

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions