Jump to content

Hilfe:CirrusSearch/RegexTooComplex

From mediawiki.org
This page is a translated version of the page Help:CirrusSearch/RegexTooComplex and the translation is 16% complete.
PD Hinweis: Wenn Du diese Seite bearbeitest, stimmst Du zu, dass Dein Beitrag unter der [CC0] veröffentlicht wird. Mehr Informationen findest du auf der Public Domain Hilfeseite. PD

The insource:// syntax implements reasonably efficient regular expression searches written in Lucene's dialect. For efficiency reasons there is a limit to how complex these regexes can be.

Welche Syntax wird als komplex betrachtet?

The biggest increases in complexity come from non-determinism followed by repetition. That looks like this:

insource:/[ac]*a[ac]{50,200}/

The [ac]* part is non-deterministic and the [ac]{50,200} is a repeat. On the other hand this is better:

insource:/[ac]*a[de]{50,200}/

Because [de]{50,200} doesn't overlap with [ac]*. It's still complex and still cannot be fully accelerated but it isn't rejected outright and we will try to match it.

Generally the repetition adds more complexity than it is worth. Better to just repeat, so:

insource:/[ac]*a.*[^"]+\"/

is much less complex than:

insource:/[ac]*a.*[^"]{50,100}\"/

Warum?

Lucene compiles regular expressions to DFAs. It does this by converting the regular expressions to NFAs and then converting those to DFAs. The worst case complexity for that operation is exponential on the number of states in the NFA and the NFA's number of states is related to the regular expression. Non-determinism followed by repetition followed by repetition triggers that exponential state growth. We limit the number of states to 20,000 to prevent them from eating all of our memory.