Archive for the '_sre' Category

Approaching acceptable performance

Wednesday, August 31st, 2005

I finally did the timings of _sre on pypy-c, the C translated PyPy:
Pure literals: re.search(r’bar’, ‘bazbarfoo’)
100 passes took 1.067957, 0.010680 per pass
Classes and stuff: re.search(r’\d+.\d+\s\w{,2}’, ‘Price 144,50 USD’)
100 passes took 1.278839, 0.012788 per pass
Branching and grouping: re.search(r’<(strong|b|em)>.+?’, ‘Bla <em>bla</em>’)
100 passes took 1.369234, 0.013692 per pass
I am pleasantly surprised, this is only around 1000 times slower […]

Final _sre.py release

Tuesday, August 30th, 2005

For what it’s worth, I assembled a final release of _sre.py. It’s available from the _sre.py website. Since the previous release, one bug related to the definition of a unicode word character was fixed and a timing script was added. In the absence of any bug reports I consider it stable and done.
I’ll be compiling […]

Timing and Waiting

Wednesday, August 10th, 2005

It was time to confront the naked truth. I did some timings (time_sre.py) on different Python/_sre combinations and this is how it turned out:
CPython 2.4 with _sre.c:
Pure literals: re.search(r’bar’, ‘bazbarfoo’)
100 passes took 0.000492, 0.000005 per pass
Classes and stuff: re.search(r’\d+.\d+\s\w{,2}’, ‘Price 144,50 USD’)
100 passes took 0.000891, 0.000009 per pass
Branching and grouping: re.search(r’<(strong|b|em)>.+?’, ‘Bla <em>bla</em>’)
100 passes […]

Released: _sre.py 2.4b

Tuesday, August 9th, 2005

I’ve just created another release of _sre.py. No bugs were discovered since the first alpha release (which makes me a bit suspicious - I know they’re hiding in there somewhere). The only thing new in this release are some optimizations, mainly centering around special cases like pure literal regular expressions.
There is a tarball and a […]

First release

Monday, July 25th, 2005

I have a reputation for overestimating the time it takes to accomplish a given task. This has led to interesting discussions with software project managers in the past: “You told me you’d finish this in 2 days, so I assumed it’d take you 4 days, and now you’re telling me you completed it in 1 […]

Bit-fiddling

Sunday, July 24th, 2005

When I was talking about not walking in the park, I meant exactly the kind of thing that has been holding me up yesterday and today: A bug related to character sets, manifesting itself on one machine but not on the other, for seemingly random reasons.
Character sets are things like [a-cg-j] in a regular expression. […]

45 down, 6 to go

Thursday, July 21st, 2005

I’m making progress much faster than I thought. Blame Canada, er, blame the unspectacular weather this week, keeping my head cool and the urge to go out low. I am now down to just 6 (of totally 51) regex unit tests from CPython 2.4 failing. Of these failures, 4 seem to be unicode- or locale-related, […]

Exhaustive Testing

Wednesday, July 20th, 2005

I’ve been working a bit on test infrastructure today. I can now

run my own tests on CPython using _sre.py,
run my own tests on CPython using _sre.c (crosschecking that my tests are actually correct),
run the CPython re tests on CPython using _sre.py,
run my own tests on PyPy using _sre.py,
and run the CPython re tests on PyPy […]

Coder’s Little Helpers

Wednesday, July 20th, 2005

The Python regex implementation compiles regex patterns to an intermediate bytecode form. Since what I’m writing is basically the interpreter for this bytecode, I’ve spent quite some time trying to make sense of numeric bytecode representation produced by sre. It’s not that hard, but you really loose track very quickly, mentally parsing a sequence like […]

To recurse or not to recurse

Sunday, July 17th, 2005

The university project I talked about was postponed until October, so I could already pick up SoC again on Friday. That day I implemented pretty much all regex syntax that doesn’t require interpreting a subexpression, among them most categories like \d (matches all digits) and character sets (e.g., [a-d] to match all letters from […]