Sooner or later most developers occasionally face such behavior. The typical symptom – a regular expression works fine sometimes, but for certain strings it “hangs”, consuming 100% of CPU.
In such case a web-browser suggests to kill the script and reload the page. Not a good thing for sure.
Let’s say we have a string, and we’d like to check if it consists of words
\w+ with an optional space
\s? after each.
An obvious way to construct a regexp would be to take a word followed by an optional space
\w+\s? and then repeat it with
That leads us to the regexp
^(\w+\s?)*$, it specifies zero or more such words, that start at the beginning
^ and finish at the end
$ of the line.
To be fair, let’s note that some regular expression engines can handle such a search effectively. But most of them can’t. Browser engines usually hang.
What’s the matter? Why the regular expression hangs?
To understand that, let’s simplify the example: remove spaces
\s?. Then it becomes
And, to make things more obvious, let’s replace
\d. The resulting regular expression still hangs, for instance:
So what’s wrong with the regexp?
First, one may notice that the regexp
(\d+)* is a little bit strange. The quantifier
* looks extraneous. If we want a number, we can use
Indeed, the regexp is artificial, we got it by simplifying the previous example. But the reason why it is slow is the same. So let’s understand it, and then the previous example will become obvious.
What happens during the search of
^(\d+)*$ in the line
123456789z (shortened a bit for clarity, please note a non-digit character
z at the end, it’s important), why does it take so long?
Here’s what the regexp engine does:
First, the regexp engine tries to find the content of the parentheses: the number
\d+. The plus
+is greedy by default, so it consumes all digits:
After all digits are consumed,
\d+is considered found (as
Then the star quantifier
(\d+)*applies. But there are no more digits in the text, so the star doesn’t give anything.
The next character in the pattern is the string end
$. But in the text we have
zinstead, so there’s no match:
X \d+........$ (123456789)z
As there’s no match, the greedy quantifier
+decreases the count of repetitions, backtracks one character back.
\d+takes all digits except the last one (
Then the engine tries to continue the search from the next position (right after
(\d+)*can be applied – it gives one more match of
\d+, the number
The engine tries to match
$again, but fails, because it meets
X \d+.......\d+ (12345678)(9)z
There’s no match, so the engine will continue backtracking, decreasing the number of repetitions. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on.
All possible combinations are attempted. Here are their examples.
The first number
\d+has 7 digits, and then a number of 2 digits:
X \d+......\d+ (1234567)(89)z
The first number has 7 digits, and then two numbers of 1 digit each:
X \d+......\d+\d+ (1234567)(8)(9)z
The first number has 6 digits, and then a number of 3 digits:
X \d+.......\d+ (123456)(789)z
The first number has 6 digits, and then 2 numbers:
X \d+.....\d+ \d+ (123456)(78)(9)z
…And so on.
There are many ways to split a sequence of digits
123456789 into numbers. To be precise, there are
n is the length of the sequence.
n=9, that gives 511 combinations.
- For a longer sequence with
n=20there are about one million (1048575) combinations.
n=30– a thousand times more (1073741823 combinations).
Trying each of them is exactly the reason why the search takes so long.
The similar thing happens in our first example, when we look words by pattern
^(\w+\s?)*$ in the string
An input that hangs!.
The reason is that a word can be represented as one
\w+ or many:
(input) (inpu)(t) (inp)(u)(t) (in)(p)(ut) ...
For a human, it’s obvious that there may be no match, because the string ends with an exclamation sign
!, but the regular expression expects a wordly character
\w or a space
\s at the end. But the engine doesn’t know that.
It tries all combinations of how the regexp
(\w+\s?)* can “consume” the string, including variants with spaces
(\w+\s)* and without them
(\w+)* (because spaces
\s? are optional). As there are many such combinations (we’ve seen it with digits), the search takes a lot of time.
What to do?
Should we turn on the lazy mode?
Unfortunately, that won’t help: if we replace
\w+?, the regexp will still hang. The order of combinations will change, but not their total count.
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but most engines don’t, and it doesn’t always help.
There are two main approaches to fixing the problem.
The first is to lower the number of possible combinations.
Let’s make the space non-optional by rewriting the regular expression as
^(\w+\s)*\w*$ – we’ll look for any number of words followed by a space
(\w+\s)*, and then (optionally) a final word
This regexp is equivalent to the previous one (matches the same) and works well:
Why did the problem disappear?
That’s because now the space is mandatory.
The previous regexp, if we omit the space, becomes
(\w+)*, leading to many combinations of
\w+ within a single word
input could be matched as two repetitions of
\w+, like this:
\w+ \w+ (inp)(ut)
The new pattern is different:
(\w+\s)* specifies repetitions of words followed by a space! The
input string can’t be matched as two repetitions of
\w+\s, because the space is mandatory.
The time needed to try a lot of (actually most of) combinations is now saved.
It’s not always convenient to rewrite a regexp though. In the example above it was easy, but it’s not always obvious how to do it.
Besides, a rewritten regexp is usually more complex, and that’s not good. Regexps are complex enough without extra efforts.
Luckily, there’s an alternative approach. We can forbid backtracking for the quantifier.
The root of the problem is that the regexp engine tries many combinations that are obviously wrong for a human.
E.g. in the regexp
(\d+)*$ it’s obvious for a human, that
+ shouldn’t backtrack. If we replace one
\d+ with two separate
\d+\d+, nothing changes:
\d+........ (123456789)! \d+...\d+.... (1234)(56789)!
And in the original example
^(\w+\s?)*$ we may want to forbid backtracking in
\w+. That is:
\w+ should match a whole word, with the maximal possible length. There’s no need to lower the repetitions count in
\w+, try to split it into two words
\w+\w+ and so on.
Modern regular expression engines support possessive quantifiers for that. Regular quantifiers become possessive if we add
+ after them. That is, we use
\d++ instead of
\d+ to stop
+ from backtracking.
Possessive quantifiers are in fact simpler than “regular” ones. They just match as many as they can, without any backtracking. The search process without bracktracking is simpler.
There are also so-called “atomic capturing groups” – a way to disable backtracking inside parentheses.
We can emulate them though using a “lookahead transform”.
So we’ve come to real advanced topics. We’d like a quantifier, such as
+ not to backtrack, because sometimes backtracking makes no sense.
The pattern to take as much repetitions of
\w as possible without backtracking is:
(?=(\w+))\1. Of course, we could take another pattern instead of
That may seem odd, but it’s actually a very simple transform.
Let’s decipher it:
?=looks forward for the longest word
\w+starting at the current position.
- The contents of parentheses with
?=...isn’t memorized by the engine, so wrap
\w+into parentheses. Then the engine will memorize their contents
- …And allow us to reference it in the pattern as
That is: we look ahead – and if there’s a word
\w+, then match it as
Why? That’s because the lookahead finds a word
\w+ as a whole and we capture it into the pattern with
\1. So we essentially implemented a possessive plus
+ quantifier. It captures only the whole word
\w+, not a part of it.
For instance, in the word
Java, but leave out
Script to match the rest of the pattern.
Here’s the comparison of two patterns:
- In the first variant
\w+first captures the whole word
+backtracks character by character, to try to match the rest of the pattern, until it finally succeeds (when
- In the second variant
(?=(\w+))looks ahead and finds the word
\1, so there remains no way to find
We can put a more complex regular expression into
(?=(\w+))\1 instead of
\w, when we need to forbid backtracking for
+ after it.
There’s more about the relation between possessive quantifiers and lookahead in articles Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead and Mimicking Atomic Groups.
Let’s rewrite the first example using lookahead to prevent backtracking:
\2 is used instead of
\1, because there are additional outer parentheses. To avoid messing up with the numbers, we can give the parentheses a name, e.g.
The problem described in this article is called “catastrophic backtracking”.
We covered two ways how to solve it:
- Rewrite the regexp to lower the possible combinations count.
- Prevent backtracking.