-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re is much slower than cpython #445
Comments
Regexes are their own compilation unit, their DFAs are JIT compiled and optimized separately just like python functions. You have 628 regexes, which takes a long time to compile. I got a better time (still worse than CPython) with the JIT compilation for regexes turned off ( |
So I talked to one of the devs of our regex engine. The main problem is that your regexes use a lot of |
Ah I see, I didn't know graal used a dfa internally, that explains things. I've had the same issue in rust as I improved things by transforming the bounded repetitions back into unbounded before compilation, I'll see if I can do that for graal. Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA? |
Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. ( |
OK I'll go with an implementation check then at least for the time being (assuming the rewriting plan does good). |
@msimacek with a "simplifier" added to the script, the timings do improve by about 25% on this machine, but that's still quite far behind cpython. > graalpy run.py -c
75158 lines in 35.8s
476.8 us/line
graalpy run.py -c 142.91s user 0.93s system 391% cpu 36.729 total The very minor improvement in runtime you noticed without regex compilation does seem to hold here (I get about 1 second less with the options you provideed), though the main benefit is likely CPU consumption dropping from 400% to 100%. Somewhat oddly it doesn't seem to change memory consumption much if at all, even though that was by far the biggest effect on both rust-regex and re2. Here's the simplification routine I implemented: REPETITION_PATTERN = re.compile(r"\{(0|1)\s*,\s*\d{3,}\}")
CLASS_PATTERN = re.compile(
r"""
\[[^]]*\\(d|w)[^]]*\]
|
\\(d|w)
""",
re.VERBOSE,
)
def class_replacer(m: re.Match[str]) -> str:
d, w = ("0-9", "A-Za-z0-9_") if m[1] else ("[0-9]", "[A-Za-z0-9_]")
return m[0].replace(r"\d", d).replace(r"\w", w)
def fa_simplifier(pattern: str) -> str:
pattern = REPETITION_PATTERN.sub(lambda m: "*" if m[1] == "0" else "+", pattern)
return CLASS_PATTERN.sub(class_replacer, pattern) It basically converts the ranges with an upper bound of more than 3 digits to unbounded, and replaces the perl-style character classes by enumerations (that might be completely irrelevant for tregex, it's useful for rust-regex as its perl-style character classes are full unicode and thus a large amount of state). |
I've been adding graal support to a classifier type project naively based on applying a bunch of regexes to an input, and while Graal works the regex application is quite slow: it's about 4x slower than cpython, while using 4 times the CPU.
Here's a repro script and attending data (basically a cut down version of the naive classifier implementation): script.zip
timings:
This is on a 10-core M1 Pro. Using cpusampler I confirmed that essentially all the "user" time is in
_sre
:The text was updated successfully, but these errors were encountered: