Skip to content

Commit dbd950a

Browse files
committed
more explanation in parsing
1 parent a7f4b56 commit dbd950a

File tree

3 files changed

+89
-95
lines changed

3 files changed

+89
-95
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
LATEXMK= latexmk -pdf
55

66
all:
7-
$(LATEXMK) book
7+
$(LATEXMK) -f book
88

99
cont: continuous
1010
continuous:

book.bib

Lines changed: 4 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,35 @@
11
@book{Tomita:1985qr,
2-
address = {Norwell, MA, USA},
32
author = {Masaru Tomita},
4-
date-added = {2008-12-02 14:16:33 -0700},
5-
date-modified = {2008-12-02 14:16:39 -0700},
6-
isbn = {0898382025},
73
publisher = {Kluwer Academic Publishers},
84
title = {Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems},
95
year = {1985}}
106

117
@article{Earley:1970ly,
12-
acmid = {362035},
13-
address = {New York, NY, USA},
148
author = {Earley, Jay},
15-
date-added = {2011-05-28 11:31:46 -0600},
16-
date-modified = {2011-05-28 11:31:48 -0600},
17-
doi = {http://doi.acm.org/10.1145/362007.362035},
18-
issn = {0001-0782},
199
issue = {2},
2010
journal = {Commun. ACM},
21-
keywords = {compilers, computational complexity, context-free grammar, parsing, syntax analysis},
2211
month = {February},
2312
numpages = {9},
2413
pages = {94--102},
2514
publisher = {ACM},
2615
title = {An efficient context-free parsing algorithm},
27-
url = {http://doi.acm.org/10.1145/362007.362035},
2816
volume = {13},
29-
year = {1970},
30-
Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoXnA5NC1lYXJsZXkucGRmCAsYAAAAAAAAAQEAAAAAAAAAAwAAAAAAAAAAAAAAAAAAACc=},
31-
Bdsk-Url-1 = {http://doi.acm.org/10.1145/362007.362035}}
17+
year = {1970}}
3218

33-
@Book{Hopcroft06:_automata,
19+
@book{Hopcroft06:_automata,
3420
author = {John Hopcroft and Rajeev Motwani and Jeffrey Ullman},
3521
title = {Introduction to Automata Theory, Languages, and Computation},
3622
publisher = {Pearson},
3723
year = 2006}
3824

3925
@techreport{Lesk:1975uq,
4026
author = {M. E. Lesk and E. Schmidt},
41-
date-added = {2007-08-27 13:37:27 -0600},
42-
date-modified = {2009-08-25 22:28:17 -0600},
4327
institution = {Bell Laboratories},
4428
month = {July},
4529
title = {Lex - A Lexical Analyzer Generator},
46-
year = {1975},
47-
Bdsk-File-1 = {YnBsaXN0MDDRAQJccmVsYXRpdmVQYXRoV2xleC5wZGYICxgAAAAAAAABAQAAAAAAAAADAAAAAAAAAAAAAAAAAAAAIA==}}
30+
year = {1975}}
4831

49-
@Misc{shinan20:_lark_docs,
32+
@misc{shinan20:_lark_docs,
5033
author = {Erez Shinan},
5134
title = {Lark Documentation},
5235
url = {https://lark-parser.readthedocs.io/en/latest/index.html},

book.tex

Lines changed: 84 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@
196196

197197
%\listoftables
198198

199+
199200
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
200201
\chapter*{Preface}
201202
\addcontentsline{toc}{fmbm}{Preface}
@@ -247,7 +248,7 @@ \chapter*{Preface}
247248
the fundamental tools of compiler construction: \emph{abstract
248249
syntax trees} and \emph{recursive functions}.
249250
{\if\edition\pythonEd
250-
\item In Chapter~\ref{ch:parsing-Lvar} we learn how to use the Lark
251+
\item In Chapter~\ref{ch:parsing} we learn how to use the Lark
251252
parser generator to create a parser for the language of integer
252253
arithmetic and local variables. We learn about the parsing
253254
algorithms inside Lark, including Earley and LALR(1).
@@ -307,14 +308,13 @@ \chapter*{Preface}
307308
mathematics.
308309
%
309310
At the beginning of the course, students form groups of two to four
310-
people. The groups complete one chapter every two weeks, starting
311-
with chapter~\ref{ch:Lvar} and finishing with
312-
chapter~\ref{ch:Llambda}. Many chapters include a challenge problem
313-
that we assign to the graduate students. The last two weeks of the
311+
people. The groups complete approximately one chapter every two
312+
weeks, starting with chapter~\ref{ch:Lvar}. The last two weeks of the
314313
course involve a final project in which students design and implement
315314
a compiler extension of their choosing. The last few chapters can be
316-
used in support of these projects. For compiler courses at
317-
universities on the quarter system (about ten weeks in length), we
315+
used in support of these projects. Many chapters include a challenge
316+
problem that we assign to the graduate students. For compiler courses
317+
at universities on the quarter system (about ten weeks in length), we
318318
recommend completing the course through chapter~\ref{ch:Lvec} or
319319
chapter~\ref{ch:Lfun} and providing some scaffolding code to the
320320
students for each compiler pass.
@@ -337,7 +337,6 @@ \chapter*{Preface}
337337
Technology, University of Freiburg, University of Massachusetts
338338
Lowell, and the University of Vermont.
339339

340-
341340
\begin{figure}[tp]
342341
\begin{tcolorbox}[colback=white]
343342
{\if\edition\racketEd
@@ -370,32 +369,35 @@ \chapter*{Preface}
370369
\fi}
371370
{\if\edition\pythonEd
372371
\begin{tikzpicture}[baseline=(current bounding box.center)]
373-
\node (C1) at (0,1.5) {\small Ch.~\ref{ch:trees-recur} Preliminaries};
374-
\node (C2) at (4,1.5) {\small Ch.~\ref{ch:Lvar} Variables};
375-
\node (C3) at (8,1.5) {\small Ch.~\ref{ch:register-allocation-Lvar} Registers};
376-
\node (C4) at (0,0) {\small Ch.~\ref{ch:Lif} Conditionals};
377-
\node (C5) at (4,0) {\small Ch.~\ref{ch:Lvec} Tuples};
378-
\node (C6) at (8,0) {\small Ch.~\ref{ch:Lfun} Functions};
379-
\node (C9) at (0,-1.5) {\small Ch.~\ref{ch:Lwhile} Loops};
380-
\node (C8) at (4,-1.5) {\small Ch.~\ref{ch:Ldyn} Dynamic};
372+
\node (Prelim) at (0,1.5) {\small Ch.~\ref{ch:trees-recur} Preliminaries};
373+
\node (Var) at (4,1.5) {\small Ch.~\ref{ch:Lvar} Variables};
374+
\node (Parse) at (8,1.5) {\small Ch.~\ref{ch:parsing} Parsing};
375+
\node (Reg) at (0,0) {\small Ch.~\ref{ch:register-allocation-Lvar} Registers};
376+
\node (Cond) at (4,0) {\small Ch.~\ref{ch:Lif} Conditionals};
377+
\node (Loop) at (8,0) {\small Ch.~\ref{ch:Lwhile} Loops};
378+
\node (Fun) at (0,-1.5) {\small Ch.~\ref{ch:Lfun} Functions};
379+
\node (Tuple) at (4,-1.5) {\small Ch.~\ref{ch:Lvec} Tuples};
380+
\node (Dyn) at (8,-1.5) {\small Ch.~\ref{ch:Ldyn} Dynamic};
381381
% \node (CO) at (0,-3) {\small Ch.~\ref{ch:Lobject} Objects};
382-
\node (C7) at (8,-1.5) {\small Ch.~\ref{ch:Llambda} Lambda};
383-
\node (C10) at (4,-3) {\small Ch.~\ref{ch:Lgrad} Gradual Typing};
384-
\node (C11) at (8,-3) {\small Ch.~\ref{ch:Lpoly} Generics};
385-
386-
\path[->] (C1) edge [above] node {} (C2);
387-
\path[->] (C2) edge [above] node {} (C3);
388-
\path[->] (C3) edge [above] node {} (C4);
389-
\path[->] (C4) edge [above] node {} (C5);
390-
\path[->,style=dotted] (C5) edge [above] node {} (C6);
391-
\path[->] (C5) edge [above] node {} (C7);
392-
\path[->] (C6) edge [above] node {} (C7);
393-
\path[->] (C4) edge [above] node {} (C8);
394-
\path[->] (C4) edge [above] node {} (C9);
395-
\path[->] (C7) edge [above] node {} (C10);
396-
\path[->] (C8) edge [above] node {} (C10);
397-
% \path[->] (C8) edge [above] node {} (CO);
398-
\path[->] (C10) edge [above] node {} (C11);
382+
\node (Lam) at (0,-3) {\small Ch.~\ref{ch:Llambda} Lambda};
383+
\node (Gradual) at (4,-3) {\small Ch.~\ref{ch:Lgrad} Gradual Typing};
384+
\node (Generic) at (8,-3) {\small Ch.~\ref{ch:Lpoly} Generics};
385+
386+
\path[->] (Prelim) edge [above] node {} (Var);
387+
\path[->] (Var) edge [above] node {} (Reg);
388+
\path[->] (Var) edge [above] node {} (Parse);
389+
\path[->] (Reg) edge [above] node {} (Cond);
390+
\path[->] (Cond) edge [above] node {} (Tuple);
391+
\path[->,style=dotted] (Tuple) edge [above] node {} (Fun);
392+
\path[->] (Cond) edge [above] node {} (Fun);
393+
\path[->] (Tuple) edge [above] node {} (Lam);
394+
\path[->] (Fun) edge [above] node {} (Lam);
395+
\path[->] (Cond) edge [above] node {} (Dyn);
396+
\path[->] (Cond) edge [above] node {} (Loop);
397+
\path[->] (Lam) edge [above] node {} (Gradual);
398+
\path[->] (Dyn) edge [above] node {} (Gradual);
399+
% \path[->] (Dyn) edge [above] node {} (CO);
400+
\path[->] (Gradual) edge [above] node {} (Generic);
399401
\end{tikzpicture}
400402
\fi}
401403
\end{tcolorbox}
@@ -506,9 +508,11 @@ \chapter{Preliminaries}
506508
syntax}\index{subject}{abstract syntax
507509
tree}\index{subject}{AST}\index{subject}{program}\index{subject}{parse}
508510
The process of translating from concrete syntax to abstract syntax is
509-
called \emph{parsing}~\citep{Aho:2006wb}\python{ and is studied in
510-
chapter~\ref{ch:parsing-Lvar}}.
511-
\racket{This book does not cover the theory and implementation of parsing.}%
511+
called \emph{parsing}\python{ and is studied in
512+
chapter~\ref{ch:parsing}}.
513+
\racket{This book does not cover the theory and implementation of parsing.
514+
We refer the readers interested in parsing to the thorough treatment
515+
of parsing by \citet{Aho:2006wb}.}%
512516
%
513517
\racket{A parser is provided in the support code for translating from
514518
concrete to abstract syntax.}%
@@ -4090,23 +4094,23 @@ \section{Challenge: Partial Evaluator for \LangVar{}}
40904094
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
40914095
{\if\edition\pythonEd
40924096
\chapter{Parsing}
4093-
\label{ch:parsing-Lvar}
4097+
\label{ch:parsing}
40944098
\setcounter{footnote}{0}
40954099
\index{subject}{parsing}
40964100

40974101
In this chapter we learn how to use the Lark parser
4098-
generator~\citep{shinan20:_lark_docs} to translate the concrete syntax
4102+
framework~\citep{shinan20:_lark_docs} to translate the concrete syntax
40994103
of \LangInt{} (a sequence of characters) into an abstract syntax tree.
41004104
You will then be asked to use Lark to create a parser for \LangVar{}.
4101-
We then learn about the parsing algorithms used inside Lark, studying
4102-
the \citet{Earley:1970ly} and LALR algorithms.
4105+
We also describe the parsing algorithms used inside Lark, studying the
4106+
\citet{Earley:1970ly} and LALR(1) algorithms.
41034107

4104-
A parser generator takes in a specification of the concrete syntax and
4105-
produces a parser. Even though a parser generator does most of the
4106-
work for us, using one properly requires some knowledge. In
4107-
particular, we must learn about the specification languages used by
4108-
parser generators and we must learn how to deal with ambiguity in our
4109-
language specifications.
4108+
A parser framework such as Lark takes in a specification of the
4109+
concrete syntax and the input program and produces a parse tree. Even
4110+
though a parser framework does most of the work for us, using one
4111+
properly requires some knowledge. In particular, we must learn about
4112+
its specification languages and we must learn how to deal with
4113+
ambiguity in our language specifications.
41104114

41114115
The process of parsing is traditionally subdivided into two phases:
41124116
\emph{lexical analysis} (also called scanning) and \emph{syntax
@@ -4119,16 +4123,16 @@ \chapter{Parsing}
41194123
the use of a faster but less powerful algorithm for lexical analysis
41204124
and the use of a slower but more powerful algorithm for parsing.
41214125
%
4122-
Likewise, parser generators typical come in pairs, with separate
4123-
generators for the lexical analyzer (or lexer for short) and for the
4124-
parser. A paricularly influential pair of generators were
4125-
\texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
4126-
by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
4127-
written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
4128-
Compiler Compiler.
4129-
4130-
The Lark parse generator that we use in this chapter includes both a
4131-
lexical analyzer and a parser. The next section discusses lexical
4126+
%% Likewise, parser generators typical come in pairs, with separate
4127+
%% generators for the lexical analyzer (or lexer for short) and for the
4128+
%% parser. A paricularly influential pair of generators were
4129+
%% \texttt{lex} and \texttt{yacc}. The \texttt{lex} generator was written
4130+
%% by \citet{Lesk:1975uq} at Bell Labs. The \texttt{yacc} generator was
4131+
%% written by \citet{Johnson:1979qy} at AT\&T and stands for Yet Another
4132+
%% Compiler Compiler.
4133+
%
4134+
The Lark parse framwork that we use in this chapter includes both
4135+
lexical analyzers and parsers. The next section discusses lexical
41324136
analysis and the remainder of the chapter discusses parsing.
41334137

41344138

@@ -4522,10 +4526,13 @@ \section{The Earley Algorithm}
45224526
more efficient but can only handle a subset of the context-free
45234527
grammars.
45244528

4525-
The Earley algorithm uses a data structure called a
4526-
\emph{chart}\index{subject}{chart} to keep track of its progress. The
4527-
chart is an array with one slot for each position in the input string,
4528-
where position $0$ is before the first character and position $n$ is
4529+
The Earley algorithm can be viewed as an interpreter; it treats the
4530+
grammar as the program being interpreted and it treats the concrete
4531+
syntax of the program-to-be-parsed as its input. The Earley algorithm
4532+
uses a data structure called a \emph{chart}\index{subject}{chart} to
4533+
keep track of its progress and to memoize its results. The chart is an
4534+
array with one slot for each position in the input string, where
4535+
position $0$ is before the first character and position $n$ is
45294536
immediately after the last character. So the array has length $n+1$
45304537
for an input string of length $n$. Each slot in the chart contains a
45314538
set of \emph{dotted rules}. A dotted rule is simply a grammar rule
@@ -4553,8 +4560,8 @@ \section{The Earley Algorithm}
45534560
\begin{lstlisting}
45544561
lang_int: . stmt_list (0)
45554562
\end{lstlisting}
4556-
in slot $0$ of the chart. The algorithm then proceeds to its
4557-
\emph{prediction} phase in which it adds more dotted rules to the
4563+
in slot $0$ of the chart. The algorithm then proceeds to with
4564+
\emph{prediction} actions in which it adds more dotted rules to the
45584565
chart based on which nonterminal come after a period. In the above,
45594566
the nonterminal \code{stmt\_list} appears after a period, so we add all
45604567
the rules for \code{stmt\_list} to slot $0$, with a period at the
@@ -4767,13 +4774,15 @@ \section{The Earley Algorithm}
47674774
\section{The LALR(1) Algorithm}
47684775
\label{sec:lalr}
47694776

4770-
The LALR(1) algorithm consists of a finite automata and a stack to
4771-
record its progress in parsing the input string. Each element of the
4772-
stack is a pair: a state number and a grammar symbol (a terminal or
4773-
nonterminal). The symbol characterizes the input that has been parsed
4774-
so-far and the state number is used to remember how to proceed once
4775-
the next symbol-worth of input has been parsed. Each state in the
4776-
finite automata represents where the parser stands in the parsing
4777+
The LALR(1) algorithm can be viewed as a two phase approach in which
4778+
it first compiles the grammar into a state machine and then runs the
4779+
state machine to parse the input string. The state machine also uses
4780+
a stack to record its progress in parsing the input string. Each
4781+
element of the stack is a pair: a state number and a grammar symbol (a
4782+
terminal or nonterminal). The symbol characterizes the input that has
4783+
been parsed so-far and the state number is used to remember how to
4784+
proceed once the next symbol-worth of input has been parsed. Each
4785+
state in the machine represents where the parser stands in the parsing
47774786
process with respect to certain grammar rules. In particular, each
47784787
state is associated with a set of dotted rules.
47794788

@@ -4797,7 +4806,7 @@ \section{The LALR(1) Algorithm}
47974806
\emph{item}. There are several rules that could apply next, both rule
47984807
2 and 3, so state 1 also shows those rules with a period at the
47994808
beginning of their right-hand sides. The edges between states indicate
4800-
which transitions the automata should make depending on the next input
4809+
which transitions the machine should make depending on the next input
48014810
token. So, for example, if the next input token is \code{INT} then the
48024811
parser will push \code{INT} and the target state 4 on the stack and
48034812
transition to state 4. Suppose we are now at the end of the input. In
@@ -10155,7 +10164,7 @@ \subsection{Optimize Blocks}
1015510164
the constant \TRUE{} in \code{explicate\_pred}, in which we discard the
1015610165
\code{els} continuation.
1015710166
%
10158-
{\if\edition\racketEd
10167+
{\if\edition\racketEd
1015910168
The following example program falls into this
1016010169
case, and it creates two unused blocks.
1016110170
\begin{center}
@@ -10277,11 +10286,12 @@ \subsection{Optimize Blocks}
1027710286
[else
1027810287
(let ([label (gensym 'block)])
1027910288
(set! basic-blocks (cons (cons label t) basic-blocks))
10280-
(Goto label))]))
10289+
(Goto label))])))
1028110290
\end{lstlisting}
1028210291
\end{minipage}
1028310292
\end{center}
1028410293
\fi}
10294+
1028510295
{\if\edition\pythonEd
1028610296
%
1028710297
Here is the new version of the \code{create\_block} auxiliary function
@@ -20663,6 +20673,7 @@ \section{Type Checking \LangGrad{}}
2066320673

2066420674
\fi}
2066520675

20676+
2066620677
\clearpage
2066720678

2066820679
\section{Interpreting \LangCast{}}
@@ -20780,7 +20791,7 @@ \section{Interpreting \LangCast{}}
2078020791
from \CANYTY{} to \INTTY{}.
2078120792
}
2078220793
\python{
20783-
For the subscript \code{v[i]} in \code{f([v[i])} of \code{map\_inplace},
20794+
For the subscript \code{v[i]} in \code{f(v[i])} of \code{map\_inplace},
2078420795
the proxy casts the integer from \INTTY{} to \CANYTY{}.
2078520796
For the subscript on the left of the assignment,
2078620797
the proxy casts the tagged value from \CANYTY{} to \INTTY{}.

0 commit comments

Comments
 (0)