What is the complexity of matching a string against a regular expression?

It depends... on what method of matching you use and what operations one wishes to perform while matching a string against a regexp. See Regular Expression matching in the wild: Match for the possible questions one might ask of a regular expression matcher. I'll list them out here for brevity.

Does the regexp match the whole string?

This question can be answered by constructing a DFA from the regexp and simulating the input on the resulting DFA.
Does the regexp match a substring of the string?

This question can be answered by prefixing the original regular expression with .* and constructing a DFA from the regexp and simulating the input on the resulting DFA.
Does the regexp match a substring of the string? If so, where?

This question can be answered by prefixing the original regular expression with .* and constructing a DFA from the regexp and simulating the input on the resulting DFA. Then, we construct the reverse DFA and match the reverse DFA with the reversed input, starting from the point where the match terminated, and trace where the reverse DFA matches the reversed string.
Does this regexp match this string? If so, where? Where are the submatches?

This question can only be answered by constructing an NFA from the regexp and simulating the input on the resulting NFA.

What are the costs of constructing an NFA and a DFA from a regular expression?

Constructing an NFA from a regular expression using Thompson's construction costs O(m), where m is the length of the regular expression.
Constructing a DFA from a regular expression involves first constructing an NFA from the regular expression and then converting the NFA to a DFA. Converting an NFA to a DFA costs O(2^mS), where m is the length of the regular expression, and S is the number of symbols in the input alphabet.

What are the costs of matching input against an NFA and a DFA?

Matching an NFA with m states against a string of length n costs O(mn). See Implementation: Simulating the NFA for details on how to simulate an NFA on an input string. The extra space requirement is O(m).
Matching an DFA with m states against a string of length n costs O(n). The extra space requirement is O(1).
Matching an NFA with m states and c capture groups against a string of length n costs O(mn log c). The log c factor comes from the fact that capture sets need to be updated and copied from one node to the next in the queue of nodes. This copy can be performed using a functional data structure using path copying with an extra pointer, which allows us to support this case using O(m) extra memory, which is the same as matching using an NFA without supporting sub-match captures.

Capturing sub-matches

To capture sub-matches, we need to create pseudo nodes in the NFA that update the capture set and tag the corresponding sub-match group as having started/ended when that state is hit (typically when an open parenthesis and a close parenthesis in the regular expression is seen).

We start off with an empty list (balanced binary tree search tree keyed by capture group number) that represents the capture set for the initial node in the NFA. Every time we want to update the capture set, we need to insert/update the corresponding node in the binary tree, which costs O(log c) per operation, but the space requirement per operation is amortized O(1) since we use a persistent data structre with path copying and an extra pointer for this operation.

If we don't update the capture set, we can just copy the root node of the binary tree that holds the capture set, and the cost of doing so is O(1). However, in the worst case, we could copy capture sets for every node added to the queue. Every node can represent only at most one open or close parenthesis, so we update at most one sub-match capture group per node added to the queue.

Useful Links

Regular Expression Matching Can Be Simple And Fast and an insightful comment set
Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching in the Wild
Regular Expression Matching with a Trigram Index OR How Google Code Search Worked
Implementation of Algorithms for State Minimisation and Conversion of Regular Expressions to and from Finite Automata
Compiler construction toolkit: Regexp to NFA
Regular Expressions/Implementations: Wikibooks
Path Copying with an Extra Pointer to implement balanced search trees using O(1) extra memory per operation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complexity.md

complexity.md

What is the complexity of matching a string against a regular expression?

What are the costs of constructing an NFA and a DFA from a regular expression?

What are the costs of matching input against an NFA and a DFA?

Capturing sub-matches

Useful Links

Files

complexity.md

Latest commit

History

complexity.md

File metadata and controls

What is the complexity of matching a string against a regular expression?

What are the costs of constructing an NFA and a DFA from a regular expression?

What are the costs of matching input against an NFA and a DFA?

Capturing sub-matches

Useful Links