regex-generate
diff --git a/‎bench.tex
Lines changed: 31 additions & 50 deletions b/‎bench.tex
Lines changed: 31 additions & 50 deletions
diff --git a/‎conclusions.tex
Lines changed: 1 addition & 1 deletion b/‎conclusions.tex
Lines changed: 1 addition & 1 deletion
diff --git a/‎measure/haskell_all.gnuplot
Lines changed: 6 additions & 5 deletions b/‎measure/haskell_all.gnuplot
Lines changed: 6 additions & 5 deletions
diff --git a/‎measure/haskell_all.png
25.1 KB b/‎measure/haskell_all.png
25.1 KB
diff --git a/‎measure/haskell_csv.sh
Lines changed: 6 additions & 6 deletions b/‎measure/haskell_csv.sh
Lines changed: 6 additions & 6 deletions
diff --git a/‎measure/haskell_langs.gnuplot
Lines changed: 1 addition & 1 deletion b/‎measure/haskell_langs.gnuplot
Lines changed: 1 addition & 1 deletion
diff --git a/‎measure/haskell_langs.png
-5.38 KB b/‎measure/haskell_langs.png
-5.38 KB
diff --git a/‎measure/langs.gnuplot
Lines changed: 2 additions & 2 deletions b/‎measure/langs.gnuplot
Lines changed: 2 additions & 2 deletions
diff --git a/‎measure/langs.png
5.49 KB b/‎measure/langs.png
5.49 KB
diff --git a/‎measure/ocaml_all.gnuplot
Lines changed: 3 additions & 3 deletions b/‎measure/ocaml_all.gnuplot
Lines changed: 3 additions & 3 deletions
diff --git a/‎measure/ocaml_all.png
-65.7 KB b/‎measure/ocaml_all.png
-65.7 KB
diff --git a/‎measure/ocaml_csv.sh
Lines changed: 4 additions & 6 deletions b/‎measure/ocaml_csv.sh
Lines changed: 4 additions & 6 deletions
diff --git a/‎measure/ocaml_langs.png
-11.1 KB b/‎measure/ocaml_langs.png
-11.1 KB
diff --git a/‎ocaml.tex
Lines changed: 27 additions & 25 deletions b/‎ocaml.tex
Lines changed: 27 additions & 25 deletions
@@ -42,11 +42,15 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
   evaluation measures the performance of \code{star} on a non-sparse
   language and of {concatenation} applied to a finite and an infinite
   language.
-\item $\Rconcat{\Rcomplement{(\Rstar{a})}}{b}$: Finally, this regular
+\item $\Rconcat{\Rcomplement{(\Rstar{a})}}{b}$: This regular
   expression exercises the complement operation and tests the
   concatenation of a very large language, 
   $P (w\in \Lang{\Rcomplement{(\Rstar a)}}) = 1$, to a much smaller
   language.
+\item $\Rintersect{\Rcomplement{(\Rstar{a})}}{\Rcomplement{(\Rstar{b})}}$:
+  This regular expression applies {intersection} to two large languages
+  and make use of the complement. Its goal is to measure the efficiency
+  of set operations.
 \end{itemize}
 
 \begin{figure}[!t]
@@ -58,30 +62,27 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
 
 In the evaluation, we consider five variants of the Haskell implementation.
 \begin{itemize}
-\item The \textbf{naive} implementation corresponds to the code developed by
-  the end of Section~\ref{sec:motivation}. It transforms to and from
-  segments on the fly and uses plain list indexing.
+\item \textbf{McIlroy} is our implementation of
+  the algorithm by \citet{DBLP:journals/jfp/McIlroy99}.
 \item The \textbf{seg} implementation uses the infinite list-based segmented
-  representation throughout (\cref{sec:segm-repr}). Moreover,
-  it relies on maps and sparse indexing for concatenation and closure
-  (\cref{sec:sparse-indexing}).
+  representation throughout (\cref{sec:segm-repr}).
 \item The \textbf{segConv} implementation additionally
-  applies the convolution approach presented in \cref{sec:convolution}.
-\item The \textbf{ref} implementation uses symbolic segments
-  from \cref{sec:more-finite-repr} combined with
-  maps and sparse indexing.
+  applies the convolution approach (\cref{sec:convolution,sec:faster-closure}).
+% \item The \textbf{ref} implementation uses symbolic segments
+%   from \cref{sec:more-finite-repr} combined with
+%   maps and sparse indexing.
 \item The \textbf{refConv} implementation combines
-   symbolic segments, sparse indexing, and the convolution approach.
+   symbolic segments (\cref{sec:segm-repr,sec:more-finite-repr}) with the convolution approach.
 \end{itemize}
 
 Performances are evaluated by iterating through the stream of words
-produced by the generator, forcing their evaluation.\footnote{In
+produced by the generator, forcing their evaluation\footnote{In
   Haskell, forcing is done using \lstinline{Control.DeepSeq}.}
 and recording the elapsed timed every 20 words.
 We stop the iteration after 5 seconds.
-The resulting graph plots the time (x-axis) against the number of words (y-axis) produced so far. The slope of the graph in indicates the generation speed of the plotted algorithm, high slope is correlated to high generation speed.  \cref{bench:haskell:all} contains the results for the Haskell implementations.
-
-Most algorithms generate between 3000 and 150000 words in the first
+The resulting graph plots the time (x-axis) against the number of words (y-axis) produced so far. The slope of the graph indicates the generation speed of the plotted algorithm, high slope is correlated to high generation speed.  \cref{bench:haskell:all} contains the results for the Haskell implementations.
+ 
+Most algorithms generate between $1.3\cdot10^3$ and $1.4\cdot10^6$ words in the first
 second, which seems more than sufficient for testing purposes.
 The \textbf{refConv} implementation
 which uses symbolic segments and convolutions is consistently in the
@@ -91,24 +92,23 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
 This observation validates that the
 changes proposed in \cref{sec:improvements} actually lead to
 improvements.
-
-Looking at each graph in more detail, we can make the following
+%
+Looking at each graph in detail, we can make the following
 remarks:
 \begin{itemize}[leftmargin=*]
 \item All implementations are equally fast on $\Rstar a$ except
-  the naive implementation, which  relies on list lookups without
+  \textbf{McIlroy}, which relies on list lookups without
   sparse indexing.
-\item For $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ and
-  $\Rconcat{\Rcomplement{(\Rstar{a})}}{b}$, the graph of some implementations
+\item The graph of some implementations
   has the shape of ``skewed stairs''. We believe this phenomenon is due to
   insufficient laziness: when arriving at a new segment, part of the
   work is done eagerly which causes a plateau. When that part is done,
   the enumeration proceeds lazily.  As laziness and GHC
   optimizations are hard to control, we did not attempt to correct this.
-\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ demonstrates that sparse indexing
-  does degrade performance when applying \code{star} to non-sparse languages.
-  Using the convolution technique presented in \cref{sec:convolution} resolves this problem.
-\item The \textbf{ref} and \textbf{refConv} algorithms are
+\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ demonstrates that
+  the convolution technique presented in \cref{sec:convolution}
+  leads to significant improvements when applying \code{star} to non-sparse languages.
+\item The \textbf{refConv} algorithm is
   significantly faster on $\Rconcat{\Rcomplement{(\Rstar{a})}}{b}$
   compared to \textbf{seg} and \textbf{segConv}. We have no good
   explanation for this behavior as the code is identical up to the
@@ -117,12 +117,15 @@ \subsection{Comparing Algorithms in the \haskell Implementation}
   is $\Lang{b}$, which is also represented finitely by
   \textbf{segConv} and should thus benefit from the convolution
   improvement in the same way as \textbf{refConv}.
+\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ shows that all our algorithm have similar
+  performance profiles on set-operation. They are also significantly
+  faster than \textbf{McIlroy}.
 \end{itemize}
 
 
 \subsection{Comparing Data Structures in the \ocaml Implementation}
 \label{sec:bench:ocaml}
-\begin{figure}[!p]
+\begin{figure}[!tb]
   \centering
   \includegraphics[width=\linewidth]{measure/ocaml_all.png}
   \caption{Benchmark for the \ocaml implementation with various data-structures}
@@ -169,32 +172,10 @@ \subsection{Comparing Data Structures in the \ocaml Implementation}
   It seems that the linear cost of memoizing the thunk list and
   allocating the vectors
   is often higher than simply recomputing lists.
+\item $\Rstar{(\Rconcat{a}{\Rstar{b}})}$ shows that sorted enumerations and tries
+  perform well on set-operations, even compared to strict sets.
 \end{itemize}
 
-
-
-While the regular expressions presented previously do exercise
-\code{concatenation} and \code{star}, they do not exercise set
-operations.  To test set operations on non-trivial segments (that are
-neither full nor empty), we consider the language of words with at
-least one $a$ and one $b$. This language can be built in two ways:
-$\Rintersect{\Rcomplement{(\Rstar{a})}}{\Rcomplement{(\Rstar{b})}}$
-and $\Rconcat{(\Runion{a\Rstar{a}b}{b\Rstar{b}a})}{\Rstar{\Sigma}}$.
-The first expression applies {intersection} to two large languages,
-the second expression takes the union of smaller languages, but uses a
-concatenation.  The results are shown \cref{bench:ocaml:union}.
-Lazy and thunk lists, with or without
-memoization, perform well on unions and intersections but less so
-on concatenations. Performance of
-strict sets is surprisingly poor.
-Tries are very efficient on concatenations.
-
-% \begin{figure}[!t]
-%   \includegraphics[width=\linewidth]{measure/ocaml_union.png}
-%   \caption{Benchmarking \texttt{union} in the \ocaml data-structures}
-%   \label{bench:ocaml:union}
-% \end{figure}
-
 \subsection{The Influence of Regular Expressions on Performance}
 \begin{figure*}[!tp]
   \centering
 
@@ -13,7 +13,7 @@ \section{Conclusions and Future Work}
 Even though our implementations are not heavily optimized, our approach generates
 languages at a rate that is more than sufficient for testing
 purposes, between $1.3\cdot10^3$ and $1.4\cdot10^6$ strings per seconds.
-We can now to combine our generator with property based testing
+We can then combine our generator with property based testing
 to test regular expression parsers on randomly-generated regular expressions.
 While our approach eliminated the need for an oracle, the burden of correctness
 now lies on the language generator. We would like to implement our algorithm
 
@@ -2,7 +2,7 @@
 
 # set terminal x11 size 1500,500 font 'Deja Vu Sans Mono,14' persist
 
-set terminal pngcairo transparent size 1000,1650 rounded font 'Deja Vu Sans,19'
+set terminal pngcairo transparent size 1000,2000 rounded font 'Deja Vu Sans,19'
 set output 'haskell_all.png'
 
 # set terminal tikz standalone size 15,6 textscale 0.5
@@ -29,7 +29,7 @@ set yrange [0:]
 # Put the legend at the bottom left of the plot
 set key left top
 
-set multiplot layout 3,1 columnsfirst scale 1,1 spacing 1,1
+set multiplot layout 4,1 columnsfirst scale 1,1 spacing 1,1
 
 set lmargin at screen 0.15; set rmargin at screen 0.98
 set tmargin 0.3
@@ -42,8 +42,9 @@ set style line 5 lt 1 lc rgb "#66a61e" lw 4 pt 7 ps 1.5 dt "_. "
 set style line 6 lt 1 lc rgb "#e6ab02" lw 4 pt 7 ps 1.5 dt ". "
 set style line 7 lt 1 lc rgb "#a6761d" lw 4 pt 7 ps 1.5 dt "-"
 
-re = '(ab*)* a* ~(a*)b'
-algo = "naive ref refConv seg segConv"
+re = '(ab*)* a* ~(a*)b ~(a*)&~(b*)'
+file = "McIlroy segStar segConvStar refConvStar"
+algo = "McIlroy seg segConv refConv"
 
 last = words(re)
 
@@ -61,7 +62,7 @@ do for [i = 1:last] {
         unset key
     }
     set label 1 word(re,i) noenhanced center at graph 0.5,0.95 font ',21'
-    plot for [j = 1:words(algo)] word(re,i)."_".word(algo,j)."_haskell.csv" using 2:($1/10000) title word(algo,j) noenhanced with lines ls j
+    plot for [j = 1:words(algo)] word(re,i)."_".word(file,j)."_haskell.csv" using 2:($1/10000) title word(algo,j) noenhanced with lines ls j
 }
 
 unset multiplot
@@ -5,18 +5,18 @@ IFS=$'\n\t'
 set -f #Disable globing
 
 
-ReBase=('a*' '(ab*)*' '~(a*)b')
+ReBase=('a*' '(ab*)*' '~(a*)b' '~(a*)&~(b*)')
 ReMore=('a*' 'a*b' 'ba*' '(ab*)*' '~(a*)b' '((a|b)(a|b))*' '(1(01*0)*1|0)*' '~(a*)&~(b*)')
 
-BackendBase=("naive" "ref" "seg" "refConv")
-BackendMore=("segConv")
+BackendBase=("McIlroy" "segStar" "segConvStar")
+BackendMore=("refConvStar")
 
 function genH {
     file="$2_$1_haskell.csv"
     echo "Regex $2 on backend $1 to $file"
     re-generate-exe \
         --alphabet "ab" -s20 \
-        -b "${1^}Star" \
+        -b "${1^}" \
         "$2" > "$file"
 }
 
@@ -38,5 +38,5 @@ go BackendMore ReMore
 
 echo "Gnuploting to haskell_all.png!"
 gnuplot haskell_all.gnuplot
-echo "Gnuploting to haskell_langs.png!"
-gnuplot haskell_langs.gnuplot
+# echo "Gnuploting to haskell_langs.png!"
+# gnuplot haskell_langs.gnuplot
@@ -39,6 +39,6 @@ set style line 6 lt 1 lc rgb "#e6ab02" lw 4 pt 7 ps 1.5 dt ". "
 set style line 7 lt 1 lc rgb "#a6761d" lw 4 pt 7 ps 1.5 dt "-"
 
 re = 'a* a*b ba* (ab*)* ~(a*)b ((a|b)(a|b))* (1(01*0)*1|0)* ~(a*)&~(b*)'
-algo = "segConv"
+algo = "refConvStar"
 
 plot for [i = 1:words(re)] word(re,i)."_".algo."_haskell.csv" using 2:($1/10000) title word(re,i) noenhanced with lines ls i
@@ -42,9 +42,9 @@ set multiplot layout 1,2 columnsfirst scale 1,1
 unset key
 
 re = 'a* a*b ba* (ab*)* ~(a*)b ((a|b)(a|b))* (1(01*0)*1|0)* ~(a*)&~(b*)'
-algo="segConv"
+algo="refConvStar"
 
-set title "Haskell with segConv"
+set title "Haskell with refConv"
 plot for [i = 1:words(re)] word(re,i)."_".algo."_haskell.csv" using 2:($1/10000) title word(re,i) noenhanced with lines ls i
 
 
 
@@ -2,7 +2,7 @@
 
 # set terminal x11 size 1500,500 font 'Deja Vu Sans Mono,14' persist
 
-set terminal pngcairo transparent size 1000,2550 rounded font 'Deja Vu Sans,19'
+set terminal pngcairo transparent size 1000,2000 rounded font 'Deja Vu Sans,19'
 set output 'ocaml_all.png'
 
 # set terminal tikz standalone size 15,6 textscale 0.5
@@ -29,7 +29,7 @@ set yrange [0:]
 # Put the legend at the bottom left of the plot
 set key left top
 
-set multiplot layout 5,1 columnsfirst scale 1,1 spacing 0,0
+set multiplot layout 4,1 columnsfirst scale 1,1 spacing 0,0
 
 set lmargin at screen 0.15; set rmargin at screen 0.98
 set tmargin 0.3
@@ -42,7 +42,7 @@ set style line 5 lt 1 lc rgb "#66a61e" lw 4 pt 7 ps 1.5 dt "_. "
 set style line 6 lt 1 lc rgb "#e6ab02" lw 4 pt 7 ps 1.5 dt ". "
 set style line 7 lt 1 lc rgb "#a6761d" lw 4 pt 7 ps 1.5 dt "-"
 
-re = '(ab*)* a* ~(a*)b ~(a*)&~(b*) (aa*b|bb*a)(a|b)*'
+re = '(ab*)* a* ~(a*)b ~(a*)&~(b*)'
 algo = "ThunkList ThunkListMemo LazyList StrictSet Trie"
 
 last = words(re)
 
@@ -5,8 +5,8 @@ IFS=$'\n\t'
 set -f #Disable globing
 
 
-ReBase=('a*' '(ab*)*' '~(a*)b' '~(a*)&~(b*)' '(aa*b|bb*a)(a|b)*')
-ReMore=('a*' 'a*b' 'ba*' '(ab*)*' '~(a*)b' '((a|b)(a|b))*' '(1(01*0)*1|0)*'  '~(a*)&~(b*)' '(aa*b|bb*a)(a|b)*')
+ReBase=('a*' '(ab*)*' '~(a*)b' '~(a*)&~(b*)')
+ReMore=('a*' 'a*b' 'ba*' '(ab*)*' '~(a*)b' '((a|b)(a|b))*' '(1(01*0)*1|0)*'  '~(a*)&~(b*)')
 
 BackendBase=("ThunkListMemo" "ThunkList" "StrictSet" "Trie")
 BackendMore=("LazyList")
@@ -38,7 +38,5 @@ go BackendMore ReMore
 
 echo "Gnuploting to ocaml_all.png!"
 gnuplot ocaml_all.gnuplot
-echo "Gnuploting to ocaml_langs.png!"
-gnuplot ocaml_langs.gnuplot
-echo "Gnuploting to ocaml_union.png!"
-gnuplot ocaml_union.gnuplot
+# echo "Gnuploting to ocaml_langs.png!"
+# gnuplot ocaml_langs.gnuplot
@@ -142,10 +142,11 @@ \subsection{Core Algorithm}
 
 An enumeration is represented by a function which takes a unit argument and returns
 a node. A node, in turn, is either \code{Nothing} or a \code{Cons} of an
-element and the tail of the sequence. The empty enumeration, for instance, is
-represented as \code{fun () -> Nothing}.
-This representation is lazy, fast, lightweight, and almost as easy to
-manipulate as regular lists.
+element and the tail of the sequence.
+% The empty enumeration, for instance, is
+% represented as \code{fun () -> Nothing}.
+% This representation is lazy, fast, lightweight, and almost as easy to
+% manipulate as regular lists.
 % \footnote{See
 %   \url{https://github.com/ocaml/ocaml/pull/1002} for a long discussion
 %   on the topic.}
@@ -156,7 +157,7 @@ \subsection{Core Algorithm}
 As an example, the implementation of language union is shown below.
 The trailing unit argument \code{()} drive the evaluation of the
 sequence lazily. With this definition, \code{union s1 s2} cause no
-evaluation before is applied to \code{()}.
+evaluation before it is applied to \code{()}.
 \begin{lstlisting}
 let rec union s1 s2 () = match s1(), s2() with
   | Nothing, x | x, Nothing -> x
@@ -230,12 +231,12 @@ \subsection{Data Structures}
 data structures for segments. We present several possibilities before
 comparing their performance.
 
-\subsubsection{Ordered enumerations}
+\paragraph{Ordered enumerations}
 
 Ordered enumerations, represented by thunk-lists, make
 for a light-weight set representation.
-To use an order, we require a comparison and an
-\code{append} function on words.  The \code{OrderedMonoid} signature
+To use an order, we require \code{compare} and
+\code{append} on words.  The \code{OrderedMonoid} signature
 captures these requirements. \autoref{code:thunklist} shows the
 resulting functor \code{ThunkList}.
 
@@ -288,14 +289,15 @@ \subsubsection{Ordered enumerations}
   \label{code:thunklist}
 \end{figure}
 
-\subsubsection{Transience and Memoization}
+\paragraph{Transience and Memoization}
 
 During concatenation and star, we iterate over segments multiple times.
 As thunk lists are transient, iterating multiple times over the same list
 will compute it multiple times. To avoid this recomputation, we can implement memoization
-over thunk lists by pushing the elements in a growing vector as they are
-computed. Before evaluating a new thunk, we first check if it is already available
-in the vector. Otherwise, evaluate it, push it into the vector and return it.
+over thunk lists by using a growing vector as cache.
+% pushing the elements in a growing vector as they are
+% computed. Before evaluating a new thunk, we first check if it is already available
+% in the vector. Otherwise, evaluate it, push it into the vector and return it.
 % \begin{lstlisting}
 % let memoize f =
 %   let r = CCVector.create () in
@@ -318,21 +320,21 @@ \subsubsection{Transience and Memoization}
 \code{ThunkList} where memoization is the identity and \code{ThunkListMemo}
 with the implementation described above.
 
-\subsubsection{Lazy Lists}
+\paragraph{Lazy Lists}
 
-\ocaml also supports regular lazy lists using the builtin \code{Lazy.t} type:
-
-\begin{lstlisting}
-type 'a node =
-  | Nil
-  | Cons of 'a * 'a lazylist
-type 'a lazylist = 'a node Lazy.t
-\end{lstlisting}
-
-We implemented a \code{LazyList} functor which is identical to the
+\ocaml also supports regular lazy lists using the builtin \code{Lazy.t} type.
+%
+% \begin{lstlisting}
+% type 'a node =
+%   | Nil
+%   | Cons of 'a * 'a lazylist
+% type 'a lazylist = 'a node Lazy.t
+% \end{lstlisting}
+%
+We implemented a \code{LazyList} functor which is identical to
 \code{ThunkList} but uses lazy lists.
 
-\subsubsection{Strict Sets}
+\paragraph{Strict Sets}
 
 As the main operations on segments are set operations, one might 
 expect a set implementation to perform well. We implemented segments as sets
@@ -342,7 +344,7 @@ \subsubsection{Strict Sets}
 the n-way merge and the product.
 %, which can be implemented using folds and unions.
 
-\subsubsection{Tries}
+\paragraph{Tries}
 
 Tries \cite{Fredkin1960} are prefix trees where each branch is labeled
 with a character and each node may contain a value. Tries are commonly used