|
| 1 | +\documentclass{llncs} |
| 2 | +% Encoding and lang |
| 3 | +\usepackage[T1]{fontenc} |
| 4 | +\usepackage[utf8]{inputenc} |
| 5 | +\usepackage[english]{babel} |
| 6 | + |
| 7 | +% Graphical packages |
| 8 | +% \usepackage{graphicx} |
| 9 | +\usepackage{xcolor} |
| 10 | + |
| 11 | + |
| 12 | +% Specialized packages |
| 13 | +% \usepackage{syntax} % Grammar definitions |
| 14 | +% \usepackage{verbatim} |
| 15 | +\usepackage{listings} % Code |
| 16 | +% \usepackage{xspace} % Useful for macros |
| 17 | + |
| 18 | +\usepackage[noabbrev,nameinlink,capitalize]{cleveref} |
| 19 | +\usepackage{hyperref} |
| 20 | + |
| 21 | +% Custom macros |
| 22 | +\input{../prelude} |
| 23 | +\bibliographystyle{plain} |
| 24 | + |
| 25 | +\begin{document} |
| 26 | +\title{Generating Tests for Regular Expression Engines} |
| 27 | + |
| 28 | +\author{Gabriel Radanne \and Peter Thiemann} |
| 29 | +\institute{University of Freiburg, Germany \\ |
| 30 | + \email{\{radanne,thiemann\}@informatik.uni-freiburg.de} |
| 31 | +} |
| 32 | +% |
| 33 | + |
| 34 | + |
| 35 | +% \author{Peter Thiemann} |
| 36 | +% \affiliation{ |
| 37 | +% \institution{University of Freiburg} |
| 38 | +% \country{Germany} |
| 39 | +% } |
| 40 | + |
| 41 | + |
| 42 | +\maketitle |
| 43 | + |
| 44 | +\begin{abstract} |
| 45 | + \input{../abstract} |
| 46 | +\end{abstract} |
| 47 | + |
| 48 | +\section{Introduction} |
| 49 | + |
| 50 | +Regular languages are everywhere. Due to their apparent simplicity and |
| 51 | +their concise representability in the form of regular expressions, |
| 52 | +regular languages are used for many text processing |
| 53 | +applications, reaching from text editors |
| 54 | +\cite{DBLP:journals/cacm/Thompson68} to extracting data from web |
| 55 | +pages. |
| 56 | + |
| 57 | +Consequently, there are many algorithms and libraries that implement |
| 58 | +parsing for regular expressions. Some of them are based on Thompson's |
| 59 | +translation from regular expressions to nondeterministic finite |
| 60 | +automata and then apply the powerset construction to obtain a |
| 61 | +deterministic automaton. Others are based on Brzozowski's derivatives |
| 62 | +\cite{Brzozowski1964} and |
| 63 | +map a regular expression directly to a deterministic |
| 64 | +automaton. Antimirov's partial derivatives \cite{Antimirov96Partial} |
| 65 | +yield another transformation into a nondeterministic automaton. An |
| 66 | +implementation based on Glushkov automata has been proposed |
| 67 | +\cite{DBLP:conf/icfp/FischerHW10} with decent performance. |
| 68 | +Russ Cox's webpage gives a good overview |
| 69 | +of efficient implementations of regular expression search. It includes |
| 70 | +a discussion of his implementation of Google's RE2 \cite{cox10:_regul_expres_match_wild}. |
| 71 | + |
| 72 | +Some of the algorithms for regular expression matching are rather |
| 73 | +intricate and the natural question arises how to test these algorithms. |
| 74 | +While there online repositories with reams of real life regular |
| 75 | +expressions \cite{regul_expres_librar}, there are no satisfactory |
| 76 | +generators for test inputs. It is not too hard to come up with |
| 77 | +generators for strings that match a given regular expression, but that |
| 78 | +is only one side of the medal. On the other hand, the algorithm should |
| 79 | +reject strings that do not match the regular expression, so it is |
| 80 | +equally important to come up with strings that do \textbf{not} match. |
| 81 | + |
| 82 | +This work presents generator algorithms for extended regular expressions that |
| 83 | +contain intersection and complement beyond the regular operators. The |
| 84 | +presence of the complement operator enables the algorithms to generate |
| 85 | +strings that certainly do not match a given (extended) regular |
| 86 | +expression. |
| 87 | + |
| 88 | +Our implementations are useful in practice. They are guaranteed to be |
| 89 | +productive and produce total outputs. That is, a user can gauge the |
| 90 | +string size as well as the number of generated strings without risking |
| 91 | +partiality. |
| 92 | + |
| 93 | +Even though the implementations |
| 94 | +are not tuned for efficiency they generate |
| 95 | +languages at a rate between $1.3\cdot10^3$ and $1.4\cdot10^6$ strings per |
| 96 | +second, for Haskell, and up to $3.6\cdot10^6$ strings per second, for |
| 97 | +OCaml. The generation rate depends on the density of the language. |
| 98 | + |
| 99 | +\begin{itemize} |
| 100 | +\item Web app available at \url{https://regex-generate.github.io/regenerate/} |
| 101 | +\item OCaml code available at \url{https://github.com/regex-generate/regenerate} |
| 102 | +\item Haskell code available at \url{https://github.com/peterthiemann/re-generate} |
| 103 | +\end{itemize} |
| 104 | +\bibliography{../biblio} |
| 105 | +\end{document} |
| 106 | + |
| 107 | +%%% Local Variables: |
| 108 | +%%% mode: latex |
| 109 | +%%% TeX-master: t |
| 110 | +%%% End: |
0 commit comments