Skip to content

Best way of handling languages that have unicode identifiers? #72

@ethindp

Description

@ethindp

I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers. (The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)

I see a couple ways to get this to work:

  • I could take all the unicode codepoints making up a general category, per the Ada specification (SEC. 2.3), generate rules matching them, and then for each compound codepoint split it into chars.
  • Write a lexer to generate tokens, and parse on those.

The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those. But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?

I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one). If I could do that, I could just import nim-unicodedb and check against it's version of the UCD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions