-
Notifications
You must be signed in to change notification settings - Fork 21
Description
I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers. (The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)
I see a couple ways to get this to work:
- I could take all the unicode codepoints making up a general category, per the Ada specification (SEC. 2.3), generate rules matching them, and then for each compound codepoint split it into
chars. - Write a lexer to generate tokens, and parse on those.
The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those. But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?
I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one). If I could do that, I could just import nim-unicodedb and check against it's version of the UCD.