Timing and Waiting
It was time to confront the naked truth. I did some timings (time_sre.py) on different Python/_sre combinations and this is how it turned out:
CPython 2.4 with _sre.c:
Pure literals: re.search(r'bar', 'bazbarfoo')
100 passes took 0.000492, 0.000005 per pass
Classes and stuff: re.search(r'\d+.\d+\s\w{,2}', 'Price 144,50 USD')
100 passes took 0.000891, 0.000009 per pass
Branching and grouping: re.search(r'<(strong|b|em)>.+?', 'Bla <em>bla</em>')
100 passes took 0.000595, 0.000006 per pass
CPython 2.4 with _sre.py:
Pure literals: re.search(r'bar', 'bazbarfoo')
100 passes took 0.011108, 0.000111 per pass
Classes and stuff: re.search(r'\d+.\d+\s\w{,2}', 'Price 144,50 USD')
100 passes took 0.357161, 0.003572 per pass
Branching and grouping: re.search(r'<(strong|b|em)>.+?', 'Bla <em>bla</em>')
100 passes took 0.188313, 0.001883 per pass
PyPy with faked _sre:
Pure literals: re.search(r'bar', 'bazbarfoo')
100 passes took 2.404304, 0.024043 per pass
Classes and stuff: re.search(r'\d+.\d+\s\w{,2}', 'Price 144,50 USD')
100 passes took 2.269383, 0.022694 per pass
Branching and grouping: re.search(r'<(strong|b|em)>.+?', 'Bla <em>bla</em>')
100 passes took 2.268665, 0.022687 per pass
PyPy with _sre.py:
Pure literals: re.search(r'bar', 'bazbarfoo')
100 passes took 47.365551, 0.473656 per pass
Classes and stuff: re.search(r'\d+.\d+\s\w{,2}', 'Price 144,50 USD')
100 passes took 988.771270, 9.887713 per pass
Branching and grouping: re.search(r'<(strong|b|em)>.+?', 'Bla <em>bla</em>')
100 passes took 604.840673, 6.048407 per pass
In summary: _sre.py on top of CPython is generally about 300 to 400 times slower than the native _sre.c. PyPy with _sre.py has such an ungodly overhead that I won’t even bother to spell out the orders of magnitude it’s slower than CPython … The upside of this is that I will get very satisfying performance improvements from rewriting parts of _sre.py in RPython.
Another very interesting datapoint would be the performance of translated/compiled PyPy with _sre.py. I will try that later, when translation is stable again.