Skip to content

Commit 3b4e7c7

Browse files
committed
final-edits
1 parent 45e21b7 commit 3b4e7c7

File tree

4 files changed

+36
-39
lines changed

4 files changed

+36
-39
lines changed

abstract.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
yet new algorithms appear frequently. However, there is
77
no satisfactory methodology for testing such matchers.
88

9-
We propose such a methodology which is based on generating positive as
10-
well as negative examples of words in the language. To this end, we
9+
We propose a testing methodology which is based on generating positive
10+
as well as negative examples of words in the language. To this end, we
1111
present a new algorithm to generate the language described by a
1212
generalized regular expression with intersection and complement
1313
operators. The complement operator allows us to generate both
1414
positive and negative example words from a given regular expression.
15-
We implement our generator in Haskell and OCaml, and show that its
15+
We implement our generator in Haskell and OCaml and show that its
1616
performance is more than adequate for testing.
1717

1818
%%% Local Variables:

bench.tex

Lines changed: 31 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ \section{Benchmarks}
44
% Remove indentation on all itemize lists.
55
\setlist[itemize]{leftmargin=*}
66

7-
We now compare the performances of our implementations in two dimensions:
7+
We consider the performance of our implementations in two dimensions:
88
first the successive algorithmic refinements in the \haskell
99
implementation presented in \cref{sec:gener-cross-sect,sec:improvements},
10-
then the various segments representations in \ocaml
10+
then the various segment representations in \ocaml
1111
as described in \cref{sec:ocaml}.
1212

13-
Benchmarks were done on a ThinkPad T470 with an i5-7200U CPU and 12G of memory.
13+
Benchmarks were executed on a ThinkPad T470 with an i5-7200U CPU and 12G of memory.
1414
The \haskell benchmarks use the Stackage LTS 10.8 release and the \texttt{-O2} option.
1515
The OCaml benchmarks use \ocaml 4.06.1 with the flambda optimizer and the
1616
\texttt{-O3} option.
@@ -20,14 +20,13 @@ \section{Benchmarks}
2020

2121
\subsection{Comparing Algorithms in the \haskell Implementation}
2222

23-
\cref{sec:gener-cross-sect,sec:improvements} develop the
24-
algorithm for generating languages in a sequence of changes applied to
25-
a naive baseline algorithm. We now evaluate the impact of these
26-
changes on performance, which we plan to measure in terms of
27-
generation speed in words per second. It turns out that this speed
28-
depends heavily on the characteristics of the the regular expression
29-
considered. We thus choose three representative regular expressions to highlight the
30-
strengths and weaknesses of the different approaches.
23+
\cref{sec:gener-cross-sect,sec:improvements} develop the algorithm for
24+
generating languages in a sequence of changes applied to a baseline
25+
algorithm. We evaluate the impact of these changes on performance by
26+
measuring the generation speed in words per second. This speed depends
27+
heavily on the particular regular expression. Thus, we select four
28+
representative regular expressions to highlight the strengths and
29+
weaknesses of the different approaches.
3130
\begin{itemize}
3231
\item $\Rstar a$: This expression describes a very small language with $P (w\in L) = 0$.
3332
Nevertheless, it puts a lot of stress on the underlying
@@ -36,7 +35,7 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
3635
the output language contain exactly one element. This combination
3736
highlights the usefulness of sparse indexing and maps.
3837
\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$: On the opposite end of the
39-
spectrum, the language of this regular expression is fairly large
38+
spectrum, the language of this regular expression is large
4039
with $P (w\in L)=0.5$. The expression applies \code{star} to a
4140
language where segment $n+1$ consists of the word $ab^n$. Its
4241
evaluation measures the performance of \code{star} on a non-sparse
@@ -60,10 +59,9 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
6059
\label{bench:haskell:all}
6160
\end{figure}
6261

63-
In the evaluation, we consider five variants of the Haskell implementation.
62+
We consider five variants of the Haskell implementation.
6463
\begin{itemize}
65-
\item \textbf{McIlroy} is our implementation of
66-
the algorithm by \citet{DBLP:journals/jfp/McIlroy99}.
64+
\item \textbf{McIlroy} our implementation of \citet{DBLP:journals/jfp/McIlroy99}.
6765
\item The \textbf{seg} implementation uses the infinite list-based segmented
6866
representation throughout (\cref{sec:segm-repr}).
6967
\item The \textbf{segConv} implementation additionally
@@ -75,12 +73,11 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
7573
symbolic segments (\cref{sec:segm-repr,sec:more-finite-repr}) with the convolution approach.
7674
\end{itemize}
7775

78-
Performances are evaluated by iterating through the stream of words
76+
Performance is evaluated by iterating through the stream of words
7977
produced by the generator, forcing their evaluation\footnote{In
8078
Haskell, forcing is done using \lstinline{Control.DeepSeq}.}
81-
and recording the elapsed timed every 20 words.
82-
We stop the iteration after 5 seconds.
83-
The resulting graph plots the time (x-axis) against the number of words (y-axis) produced so far. The slope of the graph indicates the generation speed of the plotted algorithm, high slope is correlated to high generation speed. \cref{bench:haskell:all} contains the results for the Haskell implementations.
79+
and recording the elapsed timed every 20 words for 5 seconds.
80+
The resulting graph plots the time (x-axis) against the number of words (y-axis) produced so far. The slope of the graph indicates the generation speed of the algorithm, high slope is correlated to high generation speed. \cref{bench:haskell:all} contains the results for the Haskell implementations.
8481

8582
Most algorithms generate between $1.3\cdot10^3$ and $1.4\cdot10^6$ words in the first
8683
second, which seems sufficient for testing purposes.
@@ -97,15 +94,14 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
9794
remarks:
9895
\begin{itemize}[leftmargin=*]
9996
\item All implementations are equally fast on $\Rstar a$ except
100-
\textbf{McIlroy}, which relies on list lookups without
101-
sparse indexing.
97+
\textbf{McIlroy}, which implements star inefficiently.
10298
\item The graph of some implementations
10399
has the shape of ``skewed stairs''. We believe this phenomenon is due to
104100
insufficient laziness: when arriving at a new segment, part of the
105101
work is done eagerly which causes a plateau. When that part is done,
106102
the enumeration proceeds lazily. As laziness and GHC
107103
optimizations are hard to control, we did not attempt to correct this.
108-
\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ demonstrates that
104+
\item The expression $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ demonstrates that
109105
the convolution technique presented in \cref{sec:convolution}
110106
leads to significant improvements when applying \code{star} to non-sparse languages.
111107
\item The \textbf{refConv} algorithm is
@@ -117,9 +113,10 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
117113
is $\Lang{b}$, which is also represented finitely by
118114
\textbf{segConv} and should thus benefit from the convolution
119115
improvement in the same way as \textbf{refConv}.
120-
\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ shows that all our algorithm have similar
121-
performance profiles on set-operation. They are also significantly
122-
faster than \textbf{McIlroy}.
116+
\item The expression $\Rintersect{\Rcomplement{(\Rstar{a})}}{\Rcomplement{(\Rstar{b})}}$
117+
shows that all our algorithm have similar performance profiles on
118+
set-operation. They are also significantly faster than
119+
\textbf{McIlroy}.
123120
\end{itemize}
124121

125122

@@ -135,9 +132,9 @@ \subsection{Comparing Data Structures in the \ocaml Implementation}
135132

136133
We have now established that the \textbf{refConv} algorithm
137134
provides the best overall performance. The \haskell implementation,
138-
however, only uses lazy lists to represent segments.
135+
however, uses lazy lists to represent segments.
139136
To measure the influence of strictness and data structures on
140-
performances, we consider the functorized \ocaml implementation, as it facilitates such experimentation.
137+
performances, we conduct experiments with the functorized \ocaml implementation.
141138
We follow the same methodology as the \haskell evaluation using the
142139
regular expressions $\Rstar a$, $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ and
143140
$\Rconcat{\Rcomplement{(\Rstar{a})}}{b}$. The results are shown in
@@ -147,15 +144,15 @@ \subsection{Comparing Data Structures in the \ocaml Implementation}
147144
among the data structures.
148145
Lazy and Thunk lists, with or without memoizations, are the most ``well-rounded''
149146
implementations and perform decently on most languages.
150-
We can however make the following remarks.
147+
% We can however make the following remarks.
151148
\begin{itemize}[leftmargin=*]
152149
\item The \code{Trie} module
153150
is very fast thanks to its efficient concatenation.
154-
It however performs badly on $\Rstar a$
151+
It performs badly on $\Rstar a$
155152
due to the lack of path compression:
156153
in the case of $\Rstar a$, where each segment contains only one word, the
157154
trie degenerates to a list of characters.
158-
We believe an implementation of tries with path compressions would perform
155+
We believe an implementation of tries with path compression would perform
159156
significantly better.
160157
\item The other data structures exhibit a very pronounced slowdown on $\Rstar a$
161158
when reaching 150000 words.
@@ -168,11 +165,11 @@ \subsection{Comparing Data Structures in the \ocaml Implementation}
168165
\ocaml. These results also demonstrate that strict data structures
169166
should only be used when all elements up to a given length are
170167
needed. In such a case the stair pattern causes no problems.
171-
\item Memoization for thunk lists does not significantly increase performances.
168+
\item Memoization for thunk lists does not significantly improve performance.
172169
It seems that the linear cost of memoizing the thunk list and
173170
allocating the vectors
174-
is often higher than simply recomputing lists.
175-
\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ shows that sorted enumerations and tries
171+
is often higher than simply recomputing the lists.
172+
\item The expression $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ shows that sorted enumerations and tries
176173
perform well on set-operations, even compared to strict sets.
177174
\end{itemize}
178175

@@ -193,7 +190,7 @@ \subsection{The Influence of Regular Expressions on Performance}
193190
Before presenting the results, a word of warning:
194191
We do not claim to offer a fair comparison between languages!
195192
The two implementations are not exactly the same and we made no attempt
196-
to measure both language under exactly the same conditions.
193+
to measure both languages under exactly the same conditions.
197194
%
198195
\cref{bench:langs} contains the results with a
199196
logarithmic scale for the word count as it enables better comparison

intro.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ \section{Introduction}
5757
Source code for implementations in Haskell and in OCaml is available
5858
on GitHub. Examples can be run in a Web App.\footnote{%
5959
The Web App is available at
60-
\url{https://regex-generate.github.io/regenerate/}. Some links may
60+
\url{https://regex-generate.github.io/regenerate/}. Links in the App may
6161
reveal author identity, just using the App does not. }. Although
6262
not tuned for efficiency they generate languages at a rate
6363
between $1.3\cdot10^3$ and $1.4\cdot10^6$ strings per second, for

main.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@
4949
\usepackage{textcomp}
5050

5151
\usepackage[noabbrev,nameinlink,capitalize]{cleveref}
52-
52+
\crefname{section}{\S}{\S}
5353
% Custom macros
5454
\input{prelude}
5555

0 commit comments

Comments
 (0)