Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic Mode Recovery at End of File #38

Open
alanbarr opened this issue Aug 4, 2019 · 1 comment
Open

Panic Mode Recovery at End of File #38

alanbarr opened this issue Aug 4, 2019 · 1 comment

Comments

@alanbarr
Copy link

alanbarr commented Aug 4, 2019

Background

Ideally I want to be able to parse out some specially formatted C++ comments and
the function which they are documenting. (Think a bespoke form of Doxygen).

After some reading it sounded a lot like using a Lexer/Parser had already solved
the hard part of this.

Possible problem is I'm trying to be lazy and ignore all the surrounding C++ code.
So, outside of my golden comment blocks (and later the function being documented)
there's a sea of syntax errors.

I was hoping I could easily pull out the interesting parts and ignore everything
else. I'm starting to think this might be outside intended operating conditions
of such a parser though...

Sly

I've been testing out Sly which I've proved will easily do what I want when there is
no unexpected text.

However, I can't quite seem to get the rather extreme error handling to do what I'd like.
Currently the problem appears to be when the unexpected text is between a valid
statement and the EOF.

Looking at the state debugfile, it looks like I need to get either a
COMMENT_OPEN or an $end to reduce what should be a complete expression on
the stack. However, I'm entering error() handling before hitting the end of the
file and I wonder if I need to be signaling this somehow?

I've got some simplified test code below.

Test Code

#! /usr/bin/env python3

from sly import Parser
from sly import Lexer
from pprint import pprint


class CommentLexer(Lexer):
    tokens = {COMMENT_OPEN, COMMENT_CLOSE, WORD, SEMI}

    COMMENT_OPEN = r"/\* COMMENT:"
    COMMENT_CLOSE = r"\*/"
    WORD = r"[^; \*\t\n\r\f\v]+"
    SEMI = r";"

    ignore_astrix = r"\*"
    ignore_newline = r"\n"
    ignore_space = r" "

    def ignore_newline(self, t):
        self.lineno += t.value.count("\n")

    def error(self, t):
        print("Line %d: Bad character %r" % (self.lineno, t.value[0]))
        self.index += 1


class CommentParser(Parser):
    tokens = CommentLexer.tokens
    debugfile = "comment_parser.out"

    def __init__(self):
        self.comments = []

    @_("comment_doc comment_doc")
    def comment_doc(self, p):
        pass

    @_("COMMENT_OPEN string COMMENT_CLOSE")
    def comment_doc(self, p):
        print("#########")
        print(f"Got: {p.string}")
        print("#########")
        self.comments.append(p.string)
        return p.string

    @_("string string")
    def string(self, p):
        return p[0] + " " + p[1]

    @_("WORD")
    def string(self, p):
        return p.WORD

    def error(self, p):
        pprint(p)

        if not p:
            print("Hit the end of the file!")
            return

        print(f"Syntax error at type: {p.type} value: {p.value} line: {p.lineno}")
        while True:
            tok = next(self.tokens, None)

            if tok == None:
                print("Error Tok: Hit None")
                return tok

            if tok.type == "COMMENT_OPEN":
                print("Error Tok: Found new comment")
                return tok

            print(f"Ignoring: {tok.type}")


def test_one_comment_recovery_after():
    lexer = CommentLexer()

    test_data = """
    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    /* I don't care about this one. */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1


def test_one_comment_recovery_before():
    lexer = CommentLexer()

    test_data = """
    /* I don't care about this one. */

    /* COMMENT: This is the
       only comment string I'd
       like to parse out
    */

    """

    parser = CommentParser()
    parser.parse(lexer.tokenize(test_data))
    assert len(parser.comments) == 1
@alberth
Copy link

alberth commented Feb 11, 2020

Trying to continue parsing after a syntax error is going to be messy, your better chance is in tokenizing everything, and discard what you don't need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants