It depends... on what method of matching you use and what operations one wishes to perform while matching a string against a regexp. See Regular Expression matching in the wild: Match for the possible questions one might ask of a regular expression matcher. I'll list them out here for brevity.
-
Does the regexp match the whole string?
This question can be answered by constructing a DFA from the regexp and simulating the input on the resulting DFA.
-
Does the regexp match a substring of the string?
This question can be answered by prefixing the original regular expression with .* and constructing a DFA from the regexp and simulating the input on the resulting DFA.
-
Does the regexp match a substring of the string? If so, where?
This question can be answered by prefixing the original regular expression with .* and constructing a DFA from the regexp and simulating the input on the resulting DFA. Then, we construct the reverse DFA and match the reverse DFA with the reversed input, starting from the point where the match terminated, and trace where the reverse DFA matches the reversed string.
-
Does this regexp match this string? If so, where? Where are the submatches?
This question can only be answered by constructing an NFA from the regexp and simulating the input on the resulting NFA.
-
Constructing an NFA from a regular expression using Thompson's construction costs
O(m)
, wherem
is the length of the regular expression. -
Constructing a DFA from a regular expression involves first constructing an NFA from the regular expression and then converting the NFA to a DFA. Converting an NFA to a DFA costs O(2mS), where
m
is the length of the regular expression, andS
is the number of symbols in the input alphabet.
-
Matching an NFA with
m
states against a string of lengthn
costs O(mn). See Implementation: Simulating the NFA for details on how to simulate an NFA on an input string. The extra space requirement is O(m). -
Matching an DFA with
m
states against a string of lengthn
costs O(n). The extra space requirement is O(1). -
Matching an NFA with
m
states andc
capture groups against a string of lengthn
costs O(mn log c). Thelog c
factor comes from the fact that capture sets need to be updated and copied from one node to the next in the queue of nodes. This copy can be performed using a functional data structure using path copying with an extra pointer, which allows us to support this case using O(m) extra memory, which is the same as matching using an NFA without supporting sub-match captures.
To capture sub-matches, we need to create pseudo nodes in the NFA that update the capture set and tag the corresponding sub-match group as having started/ended when that state is hit (typically when an open parenthesis and a close parenthesis in the regular expression is seen).
We start off with an empty list (balanced binary tree search tree keyed by capture group number) that represents the capture set for the initial node in the NFA. Every time we want to update the capture set, we need to insert/update the corresponding node in the binary tree, which costs O(log c) per operation, but the space requirement per operation is amortized O(1) since we use a persistent data structre with path copying and an extra pointer for this operation.
If we don't update the capture set, we can just copy the root node of the binary tree that holds the capture set, and the cost of doing so is O(1). However, in the worst case, we could copy capture sets for every node added to the queue. Every node can represent only at most one open or close parenthesis, so we update at most one sub-match capture group per node added to the queue.
- Regular Expression Matching Can Be Simple And Fast and an insightful comment set
- Regular Expression Matching: the Virtual Machine Approach
- Regular Expression Matching in the Wild
- Regular Expression Matching with a Trigram Index OR How Google Code Search Worked
- Implementation of Algorithms for State Minimisation and Conversion of Regular Expressions to and from Finite Automata
- Compiler construction toolkit: Regexp to NFA
- Regular Expressions/Implementations: Wikibooks
- Path Copying with an Extra Pointer to implement balanced search trees using O(1) extra memory per operation