Skip to content

Commit 4505443

Browse files
committed
snapshot
1 parent 423d698 commit 4505443

8 files changed

+649
-631
lines changed

abstract.tex

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,19 @@
11
Regular expressions are part of every programmer's toolbox. They are
22
used for a wide variety of language-related tasks and there are many algorithms for
33
manipulating them. In particular, matching algorithms that detect
4-
whether a word belong to the language described by a regular
5-
expression is both a well explored area,
6-
yet one that still receives frequent contributions. However, there is
7-
no satisfactory solution for testing such matchers, which would
8-
require generating positive as well as negative examples for the language.
4+
whether a word belongs to the language described by a regular
5+
expression are well explored,
6+
yet new algorithms appear frequently. However, there is
7+
no satisfactory methodology for testing such matchers.
98

10-
We propose an algorithm to generate a language matched by a generalized
11-
regular expression with intersection and complement operators.
12-
The complement operator allows us to generate both positive and
13-
negative example words
14-
from a given regular expression.
15-
We implement our generator in Haskell and OCaml,
16-
and show that its performance is more than
17-
adequate for the purpose of testing.
9+
We propose such a methodology which is based on generating positive as
10+
well as negative examples of words in the language. To this end, we
11+
present a new algorithm to generate the language described by a
12+
generalized regular expression with intersection and complement
13+
operators. The complement operator allows us to generate both
14+
positive and negative example words from a given regular expression.
15+
We implement our generator in Haskell and OCaml, and show that its
16+
performance is more than adequate for testing.
1817

1918
%%% Local Variables:
2019
%%% mode: latex

deleted-stuff.tex

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
2+
3+
4+
contains an implementation
5+
of a finite version of the generator module according to the preceding
6+
discussion. The implementation of \texttt{star} follows the definition
7+
of $U^*$ literally. It first recursively creates a list \texttt{lxs}
8+
of the iterates $U_\bullet^i$ where
9+
$U_\bullet = U \setminus \{\varepsilon\}$, concatenates all of
10+
them\footnote{It is easy to see that $U^* = U_\bullet^*$.}, removes
11+
the duplicates, and imposes the limit. Removing duplicates introduces
12+
quadratic worst-case time complexity in the size of the output.
13+
14+
The implementation of \texttt{complement} generates the language
15+
$\Sigma^*$ analogously to the construction of $U^*$ in
16+
\texttt{star}, uses the list difference operator
17+
\texttt{L.\textbackslash\textbackslash} to remove elements of
18+
\texttt{lx}, and finally imposes the limit. Its worst-case run time is
19+
$O(m\cdot n)$ where $m$ is the limit and $n = \code{length lx}$.
20+
21+
In summary, the naive approach in
22+
Figure~\ref{fig:regular-operators-0} can generate infinite languages,
23+
but has many drawbacks that lead to duplication and incompleteness
24+
(words in the language are not enumerated). Moreover, the complement
25+
is not computable for this approach.
26+
27+
The finite approach in Figure~\ref{fig:finite-regular-operators}
28+
imposes an arbitrary limit on the number of generated words. This
29+
limit can lead to omitting words in nonempty languages where $P (w\in
30+
R) = 0$. Moreover, there are many places (in \texttt{union},
31+
\texttt{concatenate}, and \texttt{star}) with quadratic worst-case
32+
time complexity.
33+
34+
At this point, the question is: Can we do better? Can we come up with
35+
a generator that supports finite as well as infinite languages
36+
efficiently without incurring extraneous quadratic behavior?
37+
38+
39+
\subsection{Ordered Enumeration}
40+
\label{sec:ordered-enumeration}
41+
\begin{figure}[tp]
42+
\begin{lstlisting}
43+
union :: (Ord t) => [t] -> [t] -> [t]
44+
union xs@(x:xs') ys@(y:ys') =
45+
case compare x y of
46+
EQ -> x : union xs' ys'
47+
LT -> x : union xs' ys
48+
GT -> y : union xs ys'
49+
union xs ys = xs ++ ys
50+
\end{lstlisting}
51+
52+
\begin{lstlisting}
53+
intersect :: (Ord t) => [t] -> [t] -> [t]
54+
intersect xs@(x:xs') ys@(y:ys') =
55+
case compare x y of
56+
EQ -> x : intersect xs' ys'
57+
LT -> intersect xs' ys
58+
GT -> intersect xs ys'
59+
intersect xs ys = []
60+
\end{lstlisting}
61+
62+
\begin{lstlisting}
63+
difference :: (Ord t) => [t] -> [t] -> [t]
64+
difference xs@(x:xs') ys@(y:ys') =
65+
case compare x y of
66+
EQ -> difference xs' ys'
67+
LT -> x : difference xs' ys
68+
GT -> difference xs ys'
69+
difference xs ys = xs
70+
\end{lstlisting}
71+
\caption{Union, intersection, and difference by merging lists}
72+
\label{fig:merging-lists}
73+
\end{figure}
74+
75+
First, we concentrate on improving on the quadratic behavior. The key
76+
to improve the complexity of union, intersection, and complement lies
77+
in representing a language by a strictly increasingly sorted list.
78+
In this case, the three operations can be implemented by variations of
79+
the merge operation on lists as shown in
80+
Figure~\ref{fig:merging-lists}.
81+
82+
The merge-based operations run in linear time on finite
83+
lists. However, the operations in Figure~\ref{fig:merging-lists} are
84+
incomplete on infinite lists. As
85+
an example of the incompleteness, consider the languages $U = a\cdot
86+
(a+b)^*$ and the singleton language $V = \{b\}$ where $\Sigma = \{a, b\}$
87+
with $a<b$, represented as strictly increasing lists, the infinite
88+
list \code{lu} and the singleton list \code{lv}. The problem is that
89+
the list \code{union lu lv} does not contain the word $b$; more
90+
precisely, \code{T.singleton 'b' `elem` union lu lv} does not
91+
terminate whereas \code{u `elem` union lu lv} yields \code{True} for
92+
all \code{u} in \code{lu}. The source
93+
of the problem is that Haskell's standard ordering of lists and
94+
\code{Text} is the \emph{lexicographic ordering}, which we call
95+
$\le$. It relies on an underlying total ordering on $\Sigma$ and is
96+
defined inductively:
97+
\begin{mathpar}
98+
\inferrule{}{\varepsilon \le v}
99+
100+
\inferrule{u \le v}{au \le av}
101+
102+
\inferrule{a < b}{au \le bv}
103+
\end{mathpar}
104+
105+
This total ordering
106+
is often used for $\Sigma^*$, but it has the property that
107+
there are words $v<w$ such that there are infinitely many words $u$
108+
with $v<u$ and $u<w$. For example, $v=a$, $w=b$, and $u\in U \setminus
109+
\{a\}$, which explains the nonterminating behavior just exhibited.
110+
111+
Fortunately, there is another total ordering on words with better
112+
properties. The
113+
\emph{length-lexicographic ordering}
114+
For the sake of simplicity, we assume from now on that \code{T.Text}
115+
is ordered by \code{llocompare} and call the representation of a
116+
language by a strictly increasing list in length-lexicographic order
117+
its \emph{LLO representation}.
118+
119+
With the LLO representation, the operations \code{union},
120+
\code{intersect}, and \code{difference} run in linear time. If the input languages
121+
are finite of size $m$ and $n$, respectively, then $O(m+n)$
122+
comparisons, pattern matches, and cons operations are needed.
123+
Moreover, the operations are complete in the sense that any element in
124+
the output is sure to be detected by a terminating computation.
125+
It is easy to implement a version of \code{elem} that exploits the LLO ordering,
126+
such that the element test is decidable for any infinite
127+
language.
128+
129+
130+
131+
\subsubsection{Concatenation}
132+
To implement concatenation, we are in the following situation. Given
133+
two languages $U, V \subseteq \Sigma^*$ in LLO representation,
134+
produce the LLO representation of $U \cdot V = \{ u\cdot v \mid u\in
135+
U, v\in V\}$. If we compute the product naively as in
136+
Figure~\ref{fig:regular-operators-0}, then the output is not in LLO
137+
form:\footnote{The example relies on the operator
138+
\lstinline{Data.Monoid.<>} to append strings in any representation.}
139+
\begin{verbatim}
140+
λ> let lu = ["a", "ab"]
141+
λ> let lv = ["", "b", "bb"]
142+
λ> [ u<>v | u <- lu, v <- lv ]
143+
["a","ab","abb","ab","abb","abbb"]
144+
\end{verbatim}
145+
In fact, the output violates both constraints: it is not increasing
146+
and it has duplicates.
147+
148+
Perhaps the following observation helps: for each $u\in U$, the LLO
149+
representation of the language $u\cdot V$ can be trivially produced
150+
because the list \lstinline{[ u<>v | v <- lv ]} is strictly
151+
increasing. This observation motivates the following definition of
152+
language concatenation (using \code{union} from Figure~\ref{fig:merging-lists}).
153+
\begin{lstlisting}
154+
concatenate' :: Lang -> Lang -> Lang
155+
concatenate' lx ly =
156+
foldr union [] $ [[ T.append x y | y <- ly] | x <- lx]
157+
\end{lstlisting}
158+
This definition works fine as long as \code{lx} is finite. If it is
159+
infinite, then the \code{foldr} creates an infinitely deep nest of
160+
invocations of \code{union}, which do not make progress because
161+
\code{union} is strict in both arguments.
162+
163+
At this point, the theory of formal languages can help. The notion of
164+
a \emph{formal power series} has been invented to reason about and
165+
compute with entire languages \cite{DBLP:books/daglib/0067812,DBLP:books/sp/KuichS86}. In full
166+
generality, a formal power series is a mapping from $\Sigma^*$ into a
167+
semiring $S$ and we write $\FPS{S}$ for the set of these
168+
mappings. Formally, an element $r \in \FPS{S}$ is written as the
169+
formal sum
170+
\begin{align*}
171+
r &= \sum_{w \in \Sigma^*} (r,w) \cdot w
172+
\end{align*}
173+
where $(r,w) \in S$ is the coefficient of $w$ in $r$.
174+
A popular candidate for this semiring is the boolean semiring $B$
175+
because $\FPS{B}$ is isomorphic to the set of languages over
176+
$\Sigma$. This isomorphism maps $L\subseteq\Sigma^*$ to its
177+
characteristic series $r_L$ where $(r_L, w) = (w \in L)$.
178+
179+
The usual language operations have their counterparts on formal power
180+
series. We consider just three of them where the ``additions'' and ``multiplications'' on the
181+
right side of the definitions take place in the underlying semiring.
182+
\begin{align*}
183+
&\text{Sum:}
184+
& (r_1 + r_2, w) &= (r_1, w) + (r_2, w) \\
185+
&\text{Hadamard product:}
186+
&(r_1 \odot r_2, w) &= (r_1, w) (r_2, w) \\
187+
&\text{Product:}
188+
& (r_1 \cdot r_2, w) &= \sum_{u\cdot v=w} (r_1, u) (r_2,v)
189+
\end{align*}
190+
191+
Ok, so what is the connection between formal power series and the LLO
192+
representation? The LLO representation of a language $L$ can be viewed as a systematic
193+
enumeration of the indexes of the non-zero coefficients of the
194+
characteristic power series of $L$. Thus, the \code{union} operator
195+
corresponds to the sum and the \code{intersect} operator corresponds
196+
to the Hadamard product (in $B$ the $+$ and $\cdot$ operators are
197+
$\vee$ and $\wedge$).
198+
199+
We also get a hint for computing the product. To find out whether $w
200+
\in U \cdot V$ we need to find $u\in U$ and $v\in V$ such that $u\cdot
201+
v = w$. Abstracting from this observation, we obtain that building a
202+
word $w$ with $|w| = n$ requires two words $u\in U$ and $v\in V$ such that $|u| +
203+
|v| = |w| = n$. Hence, it would be useful to have a representation
204+
that exposes the lengths of words.
205+
206+
To obtain such a representation, we represent a language as a power series
207+
\begin{gather*}
208+
L = \sum_{n=0}^\infty L_nx^n
209+
\end{gather*}
210+
where, for all $n$, $L_n \subseteq \Sigma^n$, a decomposition which corresponds
211+
directly to the definition of $\Sigma^*$.

0 commit comments

Comments
 (0)