|
| 1 | + |
| 2 | + |
| 3 | + |
| 4 | + contains an implementation |
| 5 | +of a finite version of the generator module according to the preceding |
| 6 | +discussion. The implementation of \texttt{star} follows the definition |
| 7 | +of $U^*$ literally. It first recursively creates a list \texttt{lxs} |
| 8 | +of the iterates $U_\bullet^i$ where |
| 9 | +$U_\bullet = U \setminus \{\varepsilon\}$, concatenates all of |
| 10 | +them\footnote{It is easy to see that $U^* = U_\bullet^*$.}, removes |
| 11 | +the duplicates, and imposes the limit. Removing duplicates introduces |
| 12 | +quadratic worst-case time complexity in the size of the output. |
| 13 | + |
| 14 | +The implementation of \texttt{complement} generates the language |
| 15 | +$\Sigma^*$ analogously to the construction of $U^*$ in |
| 16 | +\texttt{star}, uses the list difference operator |
| 17 | +\texttt{L.\textbackslash\textbackslash} to remove elements of |
| 18 | +\texttt{lx}, and finally imposes the limit. Its worst-case run time is |
| 19 | +$O(m\cdot n)$ where $m$ is the limit and $n = \code{length lx}$. |
| 20 | + |
| 21 | +In summary, the naive approach in |
| 22 | +Figure~\ref{fig:regular-operators-0} can generate infinite languages, |
| 23 | +but has many drawbacks that lead to duplication and incompleteness |
| 24 | +(words in the language are not enumerated). Moreover, the complement |
| 25 | +is not computable for this approach. |
| 26 | + |
| 27 | +The finite approach in Figure~\ref{fig:finite-regular-operators} |
| 28 | +imposes an arbitrary limit on the number of generated words. This |
| 29 | +limit can lead to omitting words in nonempty languages where $P (w\in |
| 30 | +R) = 0$. Moreover, there are many places (in \texttt{union}, |
| 31 | +\texttt{concatenate}, and \texttt{star}) with quadratic worst-case |
| 32 | +time complexity. |
| 33 | + |
| 34 | +At this point, the question is: Can we do better? Can we come up with |
| 35 | +a generator that supports finite as well as infinite languages |
| 36 | +efficiently without incurring extraneous quadratic behavior? |
| 37 | + |
| 38 | + |
| 39 | +\subsection{Ordered Enumeration} |
| 40 | +\label{sec:ordered-enumeration} |
| 41 | +\begin{figure}[tp] |
| 42 | +\begin{lstlisting} |
| 43 | +union :: (Ord t) => [t] -> [t] -> [t] |
| 44 | +union xs@(x:xs') ys@(y:ys') = |
| 45 | + case compare x y of |
| 46 | + EQ -> x : union xs' ys' |
| 47 | + LT -> x : union xs' ys |
| 48 | + GT -> y : union xs ys' |
| 49 | +union xs ys = xs ++ ys |
| 50 | +\end{lstlisting} |
| 51 | + |
| 52 | +\begin{lstlisting} |
| 53 | +intersect :: (Ord t) => [t] -> [t] -> [t] |
| 54 | +intersect xs@(x:xs') ys@(y:ys') = |
| 55 | + case compare x y of |
| 56 | + EQ -> x : intersect xs' ys' |
| 57 | + LT -> intersect xs' ys |
| 58 | + GT -> intersect xs ys' |
| 59 | +intersect xs ys = [] |
| 60 | +\end{lstlisting} |
| 61 | + |
| 62 | +\begin{lstlisting} |
| 63 | +difference :: (Ord t) => [t] -> [t] -> [t] |
| 64 | +difference xs@(x:xs') ys@(y:ys') = |
| 65 | + case compare x y of |
| 66 | + EQ -> difference xs' ys' |
| 67 | + LT -> x : difference xs' ys |
| 68 | + GT -> difference xs ys' |
| 69 | +difference xs ys = xs |
| 70 | +\end{lstlisting} |
| 71 | + \caption{Union, intersection, and difference by merging lists} |
| 72 | + \label{fig:merging-lists} |
| 73 | +\end{figure} |
| 74 | + |
| 75 | +First, we concentrate on improving on the quadratic behavior. The key |
| 76 | +to improve the complexity of union, intersection, and complement lies |
| 77 | +in representing a language by a strictly increasingly sorted list. |
| 78 | +In this case, the three operations can be implemented by variations of |
| 79 | +the merge operation on lists as shown in |
| 80 | +Figure~\ref{fig:merging-lists}. |
| 81 | + |
| 82 | +The merge-based operations run in linear time on finite |
| 83 | +lists. However, the operations in Figure~\ref{fig:merging-lists} are |
| 84 | +incomplete on infinite lists. As |
| 85 | +an example of the incompleteness, consider the languages $U = a\cdot |
| 86 | +(a+b)^*$ and the singleton language $V = \{b\}$ where $\Sigma = \{a, b\}$ |
| 87 | +with $a<b$, represented as strictly increasing lists, the infinite |
| 88 | +list \code{lu} and the singleton list \code{lv}. The problem is that |
| 89 | +the list \code{union lu lv} does not contain the word $b$; more |
| 90 | +precisely, \code{T.singleton 'b' `elem` union lu lv} does not |
| 91 | +terminate whereas \code{u `elem` union lu lv} yields \code{True} for |
| 92 | +all \code{u} in \code{lu}. The source |
| 93 | +of the problem is that Haskell's standard ordering of lists and |
| 94 | +\code{Text} is the \emph{lexicographic ordering}, which we call |
| 95 | +$\le$. It relies on an underlying total ordering on $\Sigma$ and is |
| 96 | +defined inductively: |
| 97 | +\begin{mathpar} |
| 98 | + \inferrule{}{\varepsilon \le v} |
| 99 | + |
| 100 | + \inferrule{u \le v}{au \le av} |
| 101 | + |
| 102 | + \inferrule{a < b}{au \le bv} |
| 103 | +\end{mathpar} |
| 104 | + |
| 105 | +This total ordering |
| 106 | +is often used for $\Sigma^*$, but it has the property that |
| 107 | +there are words $v<w$ such that there are infinitely many words $u$ |
| 108 | +with $v<u$ and $u<w$. For example, $v=a$, $w=b$, and $u\in U \setminus |
| 109 | +\{a\}$, which explains the nonterminating behavior just exhibited. |
| 110 | + |
| 111 | +Fortunately, there is another total ordering on words with better |
| 112 | +properties. The |
| 113 | +\emph{length-lexicographic ordering} |
| 114 | +For the sake of simplicity, we assume from now on that \code{T.Text} |
| 115 | +is ordered by \code{llocompare} and call the representation of a |
| 116 | +language by a strictly increasing list in length-lexicographic order |
| 117 | +its \emph{LLO representation}. |
| 118 | + |
| 119 | +With the LLO representation, the operations \code{union}, |
| 120 | +\code{intersect}, and \code{difference} run in linear time. If the input languages |
| 121 | +are finite of size $m$ and $n$, respectively, then $O(m+n)$ |
| 122 | +comparisons, pattern matches, and cons operations are needed. |
| 123 | +Moreover, the operations are complete in the sense that any element in |
| 124 | +the output is sure to be detected by a terminating computation. |
| 125 | +It is easy to implement a version of \code{elem} that exploits the LLO ordering, |
| 126 | +such that the element test is decidable for any infinite |
| 127 | +language. |
| 128 | + |
| 129 | + |
| 130 | + |
| 131 | +\subsubsection{Concatenation} |
| 132 | +To implement concatenation, we are in the following situation. Given |
| 133 | +two languages $U, V \subseteq \Sigma^*$ in LLO representation, |
| 134 | +produce the LLO representation of $U \cdot V = \{ u\cdot v \mid u\in |
| 135 | +U, v\in V\}$. If we compute the product naively as in |
| 136 | +Figure~\ref{fig:regular-operators-0}, then the output is not in LLO |
| 137 | +form:\footnote{The example relies on the operator |
| 138 | + \lstinline{Data.Monoid.<>} to append strings in any representation.} |
| 139 | +\begin{verbatim} |
| 140 | +λ> let lu = ["a", "ab"] |
| 141 | +λ> let lv = ["", "b", "bb"] |
| 142 | +λ> [ u<>v | u <- lu, v <- lv ] |
| 143 | +["a","ab","abb","ab","abb","abbb"] |
| 144 | +\end{verbatim} |
| 145 | +In fact, the output violates both constraints: it is not increasing |
| 146 | +and it has duplicates. |
| 147 | + |
| 148 | +Perhaps the following observation helps: for each $u\in U$, the LLO |
| 149 | +representation of the language $u\cdot V$ can be trivially produced |
| 150 | +because the list \lstinline{[ u<>v | v <- lv ]} is strictly |
| 151 | +increasing. This observation motivates the following definition of |
| 152 | +language concatenation (using \code{union} from Figure~\ref{fig:merging-lists}). |
| 153 | +\begin{lstlisting} |
| 154 | +concatenate' :: Lang -> Lang -> Lang |
| 155 | +concatenate' lx ly = |
| 156 | + foldr union [] $ [[ T.append x y | y <- ly] | x <- lx] |
| 157 | +\end{lstlisting} |
| 158 | +This definition works fine as long as \code{lx} is finite. If it is |
| 159 | +infinite, then the \code{foldr} creates an infinitely deep nest of |
| 160 | +invocations of \code{union}, which do not make progress because |
| 161 | +\code{union} is strict in both arguments. |
| 162 | + |
| 163 | +At this point, the theory of formal languages can help. The notion of |
| 164 | +a \emph{formal power series} has been invented to reason about and |
| 165 | +compute with entire languages \cite{DBLP:books/daglib/0067812,DBLP:books/sp/KuichS86}. In full |
| 166 | +generality, a formal power series is a mapping from $\Sigma^*$ into a |
| 167 | +semiring $S$ and we write $\FPS{S}$ for the set of these |
| 168 | +mappings. Formally, an element $r \in \FPS{S}$ is written as the |
| 169 | +formal sum |
| 170 | +\begin{align*} |
| 171 | + r &= \sum_{w \in \Sigma^*} (r,w) \cdot w |
| 172 | +\end{align*} |
| 173 | +where $(r,w) \in S$ is the coefficient of $w$ in $r$. |
| 174 | +A popular candidate for this semiring is the boolean semiring $B$ |
| 175 | +because $\FPS{B}$ is isomorphic to the set of languages over |
| 176 | +$\Sigma$. This isomorphism maps $L\subseteq\Sigma^*$ to its |
| 177 | +characteristic series $r_L$ where $(r_L, w) = (w \in L)$. |
| 178 | + |
| 179 | +The usual language operations have their counterparts on formal power |
| 180 | +series. We consider just three of them where the ``additions'' and ``multiplications'' on the |
| 181 | +right side of the definitions take place in the underlying semiring. |
| 182 | +\begin{align*} |
| 183 | + &\text{Sum:} |
| 184 | + & (r_1 + r_2, w) &= (r_1, w) + (r_2, w) \\ |
| 185 | + &\text{Hadamard product:} |
| 186 | + &(r_1 \odot r_2, w) &= (r_1, w) (r_2, w) \\ |
| 187 | + &\text{Product:} |
| 188 | + & (r_1 \cdot r_2, w) &= \sum_{u\cdot v=w} (r_1, u) (r_2,v) |
| 189 | +\end{align*} |
| 190 | + |
| 191 | +Ok, so what is the connection between formal power series and the LLO |
| 192 | +representation? The LLO representation of a language $L$ can be viewed as a systematic |
| 193 | +enumeration of the indexes of the non-zero coefficients of the |
| 194 | +characteristic power series of $L$. Thus, the \code{union} operator |
| 195 | +corresponds to the sum and the \code{intersect} operator corresponds |
| 196 | +to the Hadamard product (in $B$ the $+$ and $\cdot$ operators are |
| 197 | +$\vee$ and $\wedge$). |
| 198 | + |
| 199 | +We also get a hint for computing the product. To find out whether $w |
| 200 | +\in U \cdot V$ we need to find $u\in U$ and $v\in V$ such that $u\cdot |
| 201 | +v = w$. Abstracting from this observation, we obtain that building a |
| 202 | +word $w$ with $|w| = n$ requires two words $u\in U$ and $v\in V$ such that $|u| + |
| 203 | +|v| = |w| = n$. Hence, it would be useful to have a representation |
| 204 | +that exposes the lengths of words. |
| 205 | + |
| 206 | +To obtain such a representation, we represent a language as a power series |
| 207 | +\begin{gather*} |
| 208 | + L = \sum_{n=0}^\infty L_nx^n |
| 209 | +\end{gather*} |
| 210 | +where, for all $n$, $L_n \subseteq \Sigma^n$, a decomposition which corresponds |
| 211 | +directly to the definition of $\Sigma^*$. |
0 commit comments