Python backend #485

AiStudent · 2024-08-27T13:57:51Z

Python backend using Python Lex Yacc, PLY.

1 - Requires python3.10+ and ply.

2 - Getting started documentation added to docs/user_guide.rst.

3 - I'll not pledge to be an active maintainer for 3 years, but who really knows.

4 - Questions (especially tricky or critical) are welcome.

Testsuite results:
The example grammars work (no layout supported).
The regression tests yield:

Failures:
- Parameterized tests:Python:Python:100_coercion_lists
- Parameterized tests:Python:Python:249_unicode
- Parameterized tests:Python:Python:70_WhiteSpaceSeparator
- Parameterized tests:Python:Python:235_SymbolsOverlapTokens
Errors:
- Parameterized tests:Python:Python:289_LexerKeywords
- Parameterized tests:Python:Python:222_IntegerList
- Parameterized tests:Python:Python:256_Regex
- Parameterized tests:Python:Python:278_Keywords
- Parameterized tests:Python:Python:358_MixFixLists
- Parameterized tests:Python:Python:266_define

andreasabel · 2024-10-30T17:46:09Z

@AiStudent Thanks for this PR. My apologies for not evaluating it earlier.
I checked it out now and try to experiment with it.
But the dependency ply seems hard to install, and looks badly maintained already.

Homebrew discontinued support for its formula: https://formulae.brew.sh/formula/python-ply

Error: python-ply has been disabled because it does not meet homebrew/core's requirements for Python library formulae! It was disabled on 2024-08-15.
In general, it looks abandoned. It last release was for python 3.11 in 2018 but now we are at 3.13. https://pypi.org/project/ply/#history
It might not be fully abandoned, but seems to be superseded by SLY, according to this comment by the author of PLY:
SLY and PLY dabeaz/ply#228 (comment)

SLY is the more modern project and has more interesting features (such as recently added support for some EBNF grammar extensions).
PLY was originally written in 2001 ... there will be no future pip-installable version of PLY. So, if people want to use it, they'd need to copy the code here.

It does not seem wise to add a new backend for a parser generator that is already phased out.

Could you instead target a parser generator that has a more promising future?

SLY by the same author also does not seem well-maintained, there is no release to the Python package repo: https://pypi.org/project/sly/

CC @aarneranta

andreasabel

PLY seems outdated. Please target a well-maintained python parser generator instead.

andreasabel · 2024-10-30T17:59:44Z

Alternatives?

https://github.com/lark-parser/lark (LALR, Early)
https://github.com/antlr/antlr4 (LL, but maybe ok, we already have an ANTLR backend for Java)

AiStudent · 2024-11-05T15:20:32Z

Retargeted to Lark with LALR. @andreasabel

Testsuite results:
The example grammars work (no layout supported and internal has no effect).
The regression tests yield:

Failures:

Parameterized tests:Python:Python:100_coercion_lists
Parameterized tests:Python:Python:70_WhiteSpaceSeparator
Parameterized tests:Python:Python:358_MixFixLists
Parameterized tests:Python:Python:479_LabelsCaseSensitive
Current parameterized test:Python:479_LabelsCaseSensitive

Errors:

Parameterized tests:Python:Python:289_LexerKeywords
Parameterized tests:Python:Python:222_IntegerList
Parameterized tests:Python:Python:256_Regex
Parameterized tests:Python:Python:278_Keywords
Parameterized tests:Python:Python:235_SymbolsOverlapTokens
Parameterized tests:Python:Python:266_define

andreasabel · 2024-11-06T14:26:26Z

Great!
So the failure list is unchanged over the PLY target.
I'll have another look.

AiStudent · 2024-11-06T14:34:20Z

So the failure list is unchanged over the PLY target.

Note it is slightly different: 249_unicode -> 358_MixFixLists for example.

andreasabel · 2024-11-06T14:57:00Z

N.B. The CI failures are caused by dependencies:

8.2.2: Diff-1.0.1 fails to build with GHC 8.2 seereason/Diff#16
8.6.5 / 8.8.4: doctest-0.23 fails with GHC 8.6 and 8.8 when it succeeds with 8.0-4 and 9.0-10 sol/doctest#460

andreasabel · 2024-11-06T18:10:49Z

Testsuite results: The example grammars work (no layout supported and internal has no effect). The regression tests yield:

I went through all of the failed tests:

Parameterized tests:Python:Python:100_coercion_lists

* expected: bar 42 . 42 . (41 + 1)

* but got:  bar 42 . 42 . 41 + 1

Since bar takes an Exp2 list, an Exp1 like 41 + 1 needs to be printed in parentheses.

Parameterized tests:Python:Python:70_WhiteSpaceSeparator

Seems that whitespace in separators is not honored by printer

* expected: 1  2  3  4  5  6  7  8  9

* but got:  1 2 3 4 5 6 7 8 9

Parameterized tests:Python:Python:358_MixFixLists

The generated printer seems to emit too many spaces:

-"A" . "B" . "QED" . "line1\n\tline2" . "\"%s\" is not an int" . semi-exprs: 1 + 1;
+"A"  . "B"  . "QED"  . "line1\n\tline2"  . "\"%s\" is not an int"  . semi-exprs: 1 + 1;

Parameterized tests:Python:Python:479_LabelsCaseSensitive

* expected: 2 ^ (2 ** 2)

* but got:  2 ** (2 ** 2)

Grammar is:

Eexp.  Exp  ::= Exp "^"  Exp1 ;
EExp.  Exp  ::= Exp "**" Exp1 ;

The reason for failure is that the generated method names in ParsingDefs.py lose the case:

grammar = r"""
...
  ?exp: exp "^" exp1 -> r_eexp
  | exp "**" exp1 -> r_eexp
  | exp1
...
"""
...
#transformer
class TreeTransformer(Transformer):
  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return Eexp(exp_1, exp_2)

  @v_args(inline=True)
  def r_eexp(self, exp_1, exp_2):
    return EExp(exp_1, exp_2)

Parameterized tests:Python:Python:289_LexerKeywords

Same problem as in 278_Keywords (see below).
Clashes with pythons True and False are not avoided.

Parameterized tests:Python:Python:222_IntegerList

Missing python module name sanitization:

from bnfcPyGen222_Integer-List-1.ParsingDefs import *

Parameterized tests:Python:Python:256_Regex

The grammar has:

token SpaceInAlts '\'' ["01x "]* '\'' ;
token Triple upper letter ["]-\'"] '\'';
token Name (char - [ "(){};.@\"" ] - [ " \n\t" ]) + ;

This is translated to:

  SPACEINALTS: /\'(\ |0|1|x)*\'/
  TRIPLE: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

Start symbol is:

Main ::= SpaceInAlts Name "/\\" Name "and" Triple;

Input is:

' x10 0' [x:4] /\ 'funky+identifier-umlaute:äöüß//\%#FOO and Un]'

Error is:

Unexpected token Token('NAME', "'") at line 1, column 1.
Expected one of: 
	* SPACEINALTS

I can get the test accepted if I give SPACEINALTS and TRIPLE a higher prio than NAME:

  SPACEINALTS.1: /\'(\ |0|1|x)*\'/
  TRIPLE.1: /[A-Z][A-Za-z](\'|\-|\])\'/
  NAME: /[^\t\n\ \"\(\)\.\;\@\{\}]+/

That is weird, because the docs say that the order is
https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:
1. Highest priority first (priority is specified as: TERM.number: …)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name

So if there is just one prio, the length of the match should win. Clearly, SPACEINALTS can match a longer string (' x10 0') than NAME (').

Is this a bug in LARK or its docu?

Parameterized tests:Python:Python:278_Keywords

This testcase aims at name clashes with the host language:

bnfcPyGentest/Absyn.py", line 143
    class False:
          ^^^^^
SyntaxError: invalid syntax

You need to make sure you sanitise python keywords and reserved identifiers.

Parameterized tests:Python:Python:235_SymbolsOverlapTokens

Running this test reports:

Unexpected token Token('MYPOSID', 'Foo') at line 1, column 4160.
Expected one of: 
	* IDENT

The grammar has start symbol:

Init.        Main ::= ... Ident MyId MyPosId ...

and the input is:

... Foo_ Bar1 Bar! ...

The generated lexer lexes Foo_ as MYPOSID probably because the regex generated for IDENT is not correct. IDENT is valid Haskell identifiers which includes underscores.

  MYID: /[A-Za-z](\d|[A-Za-z])*/
  MYPOSID: /[A-Za-z](\!|(\d|[A-Za-z]))*/
  IDENT: /[A-Za-z]\w*/

Parameterized tests:Python:Python:266_define

This produces some ill-formed python:

ParsingDefs.py", line 104, in d2_r_esqrt4
    return where(where(exp_))
           ^^^^^
NameError: name 'where' is not defined

andreasabel

Thanks for the port to LARK!

Looking through the failed tests, I guess some more refinements are needed concerning

name sanitisation (to avoid python syntax errors and name clashes),
case sensitivity of rule names,
whitespace handling in printing and
precedence handling in printing lists.

There is one problem with token precedence that I don't know whether it is a bug in LARK. (See test case 256_Regex.) Also, LARK claims to not respect the order of token definitions when resolving ambiguities. Usually, the earlier definitions have higher prio when the same length of input is matched. LARK allows to manually assign prios but these seem to override the "length" prio, according to the definition.
Not sure how this mismatch is resolved.

What is your take on this?

Frankly, I am a bit confused by the LARK docs, e.g. I don't know what the Name in the prio list means. Probably it means alphabetical, but it just doesn't make sense to me.

andreasabel · 2024-11-06T16:37:22Z

source/src/BNFC/Backend/Python.hs

+makefile pkgName optMakefileName basename = vcat
+  [
+    Makefile.mkRule "all" []
+      [ " " ]


Since no compilation is needed for python, maybe the all goal could simply explain this, e.g.

@echo "Doing nothing: No compilation of the parser needed."

andreasabel · 2024-11-06T16:56:57Z

source/src/BNFC/Backend/Python.hs

+-- | Entrypoint for BNFC to use the Python backend.
+makePython :: SharedOptions -> CF -> MkFiles ()
+makePython opts cf = do
+    let pkgName =  "bnfcPyGen" ++ name


The failure of test 222_IntegerList is caused by the lack of filename sanitisation, the grammar file is named here 222_Integer-List-1 which yield an illegal python module name:

from bnfcPyGen222_Integer-List-1.ParsingDefs import *

…nd to the lex prio not working as expected

AiStudent · 2024-11-11T02:33:23Z

Thanks for the extensive feedback. Currently there exists at least two issues.

Testsuite results:

Errors:
- 222_IntegerList - Recursion error, a list with 10k elements breaks the recursive list concatenation.
- 256_Regex - Token precedence
- 235_SymbolsOverlapTokens - Token precedence

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens. In the below test program I try to tokenize "ab" with the literals A or AB:

from lark import Lark
grammar = r"""
    start: A
     | AB
      
    A:  /a(c)*/
    AB: /ab/
"""
parser = Lark(grammar, start='start', parser='lalr', lexer='basic', priority='auto')
print(list(parser.lex("ab")))

Which yields:

lark.exceptions.UnexpectedCharacters: No terminal matches 'b' in the current parser context, at line 1 col 2

ab
 ^
Expected one of: 
	* A
	* AB

Previous tokens: Token('A', 'a')

In order to make AB have higher precedence, one may: add priority level number, such as AB.1, or make the regex of AB longer than A, like AB: /ab(c)*/ or removing (c)* from A. I have not found a way to make the lexer use the longest matched expression; one alternative, for the one who dares, seems to be to utilize Lark's custom lexer option.

andreasabel · 2024-11-11T09:35:43Z

@AiStudent wrote:

It seems that the length of the regex decides the precedence, not the lengths of the matched tokens.

But this seems then to be a bug in LARK, at least according to the specification:
https://lark-parser.readthedocs.io/en/stable/grammar.html#notes-for-when-using-a-lexer

Literals are matched according to the following precedence:

Highest priority first (priority is specified as: TERM.number: …)

Length of match (for regexps, the longest theoretical match is used)

Or maybe, this is what is meant by

(for regexps, the longest theoretical match is used)

In any case, not prioritising the actual longest match seems awkward and not in line with standard lexing tools.
I wonder whether others have bumped into this issue. Is there any discussion of this issue on bug trackers etc.?

If you have really nailed the issue, it might be worthwhile raising it to the developers of LARK.

@AiStudent wrote:

Recursion error, a list with 10k elements breaks the recursive list concatenation.

Interesting, is this because Python has a limit on recursion depth?
Anyway, doing lots of concatenations is usually done not directly but via concepts like StringBuilder / StringBuffer, for performance reasons. (In Haskell, one can use difference lists, that's what the Haskell backend does.)
Is there a blessed way in Python for assembling strings from many pieces?

AiStudent and others added 5 commits August 27, 2024 14:54

Added Python backend for use with PLY

bdcc8f7

Update user_guide.rst

5918160

changed python3.10 commands to python3

1f9a917

Merge branch 'master' of github.com:AiStudent/bnfcPullRequest

55d5af6

Cleanup

8ee25b3

andreasabel requested changes Oct 30, 2024

View reviewed changes

AiStudent added 2 commits November 5, 2024 13:45

Updated to Lark

c8d5ede

Retargeted to Lark

283dfd8

andreasabel requested changes Nov 6, 2024

View reviewed changes

0 failures and 3 errors left due to an example with 1000+ recursion a…

f49fa38

…nd to the lex prio not working as expected

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python backend #485

Python backend #485

AiStudent commented Aug 27, 2024 •

edited

Loading

andreasabel commented Oct 30, 2024 •

edited

Loading

andreasabel left a comment

andreasabel commented Oct 30, 2024

AiStudent commented Nov 5, 2024 •

edited

Loading

andreasabel commented Nov 6, 2024

AiStudent commented Nov 6, 2024

andreasabel commented Nov 6, 2024

andreasabel commented Nov 6, 2024

andreasabel left a comment •

edited

Loading

andreasabel Nov 6, 2024

andreasabel Nov 6, 2024

AiStudent commented Nov 11, 2024 •

edited

Loading

andreasabel commented Nov 11, 2024

Python backend #485

Are you sure you want to change the base?

Python backend #485

Conversation

AiStudent commented Aug 27, 2024 • edited Loading

andreasabel commented Oct 30, 2024 • edited Loading

andreasabel left a comment

Choose a reason for hiding this comment

andreasabel commented Oct 30, 2024

AiStudent commented Nov 5, 2024 • edited Loading

andreasabel commented Nov 6, 2024

AiStudent commented Nov 6, 2024

andreasabel commented Nov 6, 2024

andreasabel commented Nov 6, 2024

Parameterized tests:Python:Python:100_coercion_lists

Parameterized tests:Python:Python:70_WhiteSpaceSeparator

Parameterized tests:Python:Python:358_MixFixLists

Parameterized tests:Python:Python:479_LabelsCaseSensitive

Parameterized tests:Python:Python:289_LexerKeywords

Parameterized tests:Python:Python:222_IntegerList

Parameterized tests:Python:Python:256_Regex

Parameterized tests:Python:Python:235_SymbolsOverlapTokens

Parameterized tests:Python:Python:266_define

andreasabel left a comment • edited Loading

Choose a reason for hiding this comment

andreasabel Nov 6, 2024

Choose a reason for hiding this comment

andreasabel Nov 6, 2024

Choose a reason for hiding this comment

AiStudent commented Nov 11, 2024 • edited Loading

andreasabel commented Nov 11, 2024

AiStudent commented Aug 27, 2024 •

edited

Loading

andreasabel commented Oct 30, 2024 •

edited

Loading

AiStudent commented Nov 5, 2024 •

edited

Loading

andreasabel left a comment •

edited

Loading

AiStudent commented Nov 11, 2024 •

edited

Loading