Relaxing How-to

August 1st, 2005

It’s been a while since I’ve been so deep into a computer-related project as I’m now with SoC. I start to mentally program in Python as soon as I’m only half awake. It’s actually become somewhat annoying, because that prevents me from relaxing and getting other stuff done. Strategies to trick my mind into not thinking about Python for longer periods of time have mostly failed so far - until this weekend.

The solution: party hard, very hard. I will spare you the details about yesterday, it suffices to say that I didn’t get to bed before 7 am (today is a national holiday in Switzerland, so there was quite a bit of nightlife happening). After only a few hours of sleep I was in such a state of comfortable tiredness that I could divide my time blissfully apathetic between the TV couch and the chaise longue in the garden, totally forgetting SoC over the Gilmore Girls (of all things) and the latest Terry Pratchett novel. Until that nagging feeling of guilt crept up in the back of my head, and I had to go and blog this to make up for my laziness today. Oh well.

Not Lost in Translation

July 31st, 2005

I have been working on the array module the past days. The current version already passes all compliance tests when run on top of CPython. But the integration with PyPy still poses some potentially hard problems. They’re mostly caused by the tight coupling of the array module to C data types. E.g., how can I tell from Python whether a machine’s short integer is 2 or 4 bytes wide? Should I even respect that or just treat PyPy like a virtual machine with fixed-size data types across all architectures? These kinds of issues probably have to be worked out with PyPy core developers after I integrate the array module with the PyPy trunk at some point in the next few days.

A Python clone of the array module is of even less interest to the general public than _sre, so I’m tailoring it mostly to PyPy. This prompted me to delve a bit deeper into the innards of PyPy. My explorations mainly revolved around PyPy’s notion of RPython, a somewhat static subset of the Python language. Imagine Java with Python syntax where you try to use primitive types (integers, strings, etc.) most of the time. There is a description of RPython restrictions in the official docs.

The point of RPython is that PyPy can translate it to more efficient lower-level languages, the big target being C of course. So, I floated the idea of writing the array module directly in RPython to reclaim some efficiency by translating it to C (yeah, that is slightly ironic, after laboriously having translated it from C to Python). But after some back and forth with my mentors and on pypy-dev it turns out this makes little sense, because it’s currently too cumbersome. When PyPy has evolved a bit, this might get easier.

A few people (me included, before I entered the project) apparently have had the impression, that PyPy’s ultimate goal is to translate arbitrary Python code to C. This is definitely not the case. Given Python’s highly dynamic nature this would be rather difficult, if not to say impossible (without cheating and including the whole interpreter in the generated C code). RPython is mostly designed for the purpose of translating just the PyPy core interpreter to C. The imagined advantages and speed gains of PyPy will come from the ease of innovating and experimenting on top of it, e.g., from a better VM, a JIT compiler or garbage collection schemes other than CPython’s reference counting.

In related news: As a result of the on-going sprint in Hildesheim, the PyPy interpreter has just been successfully translated to C for the first time!

First release

July 25th, 2005

I have a reputation for overestimating the time it takes to accomplish a given task. This has led to interesting discussions with software project managers in the past: “You told me you’d finish this in 2 days, so I assumed it’d take you 4 days, and now you’re telling me you completed it in 1 day? Can I offer you a bottle of champagne?”

OK, so that was my meager attempt at a funny explanation why you get the alpha release already today instead of tomorrow. I released _sre.py 2.4a. It passes all compliance tests everywhere I tested it, so any remaining bugs will be rather obscure and can only be discovered by real-world use of the module. This is where I need your help. Grab the release as a tarball or directly from the SVN repository:

svn co http://codespeak.net/svn/user/nik/_sre_py/2.4a _sre_py

Now, from the base directory of the package, run the tests by executing in your shell:

$ python run_tests.py

If you get 2 OK’s, go relax. If you have any failures, please send the whole test output as well as information about your platform and Python version to me at nhaldimann AT gmx DOT ch. Thanks very much! Note: _sre.py is supposed to work with Python 2.3.x and 2.4.x, it’s not compatible with earlier versions and I haven’t tested with 2.5 from CVS.

If you’ve got some more spare time, you can do even more for me. Do you know any Python projects that use regular expressions extensively and have a comprehensive test suite? I welcome any hints since I want to test _sre.py on as much third-party code as possible. You can even do this yourself. First, you’ll want to install the package using distutils, i.e., still in _sre.py’s base directory, execute in your shell:

$ python setup.py install

This does not, however, override or shadow the standard _sre extension module (in fact it can’t because standard _sre is a builtin module, compiled directly into the interpreter). If you do “import _sre” now, you still get standard _sre. To use _sre.py from your code, you must insert the following two lines before any imports of the re module:

import _sre_support
_sre_support.import_sre_py()

I’ll make an example for the sake of clarity. Say you have unit tests in a file called test_frazzle.py, something like:

import re, unittest
...
class FrazzleTest(unittest.TestCase):
...

You add the _sre.py hook before the import of re:

import _sre_support
_sre_support.import_sre_py()
import re, unittest
...
class FrazzleTest(unittest.TestCase):
...

If you run test_frazzle.py now, _sre.py will be used. I would love to hear about any results you get from that.

Bit-fiddling

July 24th, 2005

When I was talking about not walking in the park, I meant exactly the kind of thing that has been holding me up yesterday and today: A bug related to character sets, manifesting itself on one machine but not on the other, for seemingly random reasons.

Character sets are things like [a-cg-j] in a regular expression. A big charset is one containing at least one character outside of the 8-bit range, e.g., all greek sets like [πΩµ]. When compiled to bytecode by sre, these are represented by a kind of triple-indexed bitmap data structure (think triple indirect i-node blocks) with the nice property of a constant-time membership test for any character.

For space efficiency, sre packs several indices, each 1 byte wide, in one bytecode. A bytecode is actually up to 4 bytes wide, represented by a Python long. Now, extracting bytes out of a Python long seems easy enough. Not! For hours I was stumped because my charset lookup algorithm would work on my iBook but would give strange results on my (x86) Debian machine. Until it finally dawned on me: It’s the endianness, stupid! I was shifting bits around manually, not accounting for the reverse order of bytes on the little-endian x86 compared to the big-endian PowerPC. It took me another hour or so to figure out how to rewrite this in an architecture-independent way using the array module (incidentally, my first real-world use of array).

Lesson learned: Low-level bit-fiddling is not Pythonic. Use the high-level facilities provided by the Python stdlib.

In related news: All compliance tests pass now! Even on UCS-4 builds. Realistically, I can post an alpha release of _sre.py on Tuesday.

45 down, 6 to go

July 21st, 2005

I’m making progress much faster than I thought. Blame Canada, er, blame the unspectacular weather this week, keeping my head cool and the urge to go out low. I am now down to just 6 (of totally 51) regex unit tests from CPython 2.4 failing. Of these failures, 4 seem to be unicode- or locale-related, the other 2 look like obscure bugs that I’m currently clueless about.

In terms of functionality and lines of code I’m now probably 90% done with _sre.py. But of course the remaining 10% are not going to be a walk in the park. I have willfully ignored unicode issues so far, and I expect recreating the exact, inconsequential unicode semantics of _sre.c to be somewhat painful. There are also a few optimizations that I haven’t implemented, yet.

The code is a bit messy in certain areas, too. After I reach the milestone of these last few compliance tests passing, I will refactor towards simplicity and readability and then put out an alpha release. I hope I will have that ready by the middle of next week.

Exhaustive Testing

July 20th, 2005

I’ve been working a bit on test infrastructure today. I can now

  • run my own tests on CPython using _sre.py,
  • run my own tests on CPython using _sre.c (crosschecking that my tests are actually correct),
  • run the CPython re tests on CPython using _sre.py,
  • run my own tests on PyPy using _sre.py,
  • and run the CPython re tests on PyPy using _sre.py.

Can you tell I’m seriously test-infected? At least one item is even still missing in this list. I need to run my tests on a UCS-4 build of CPython as well (a build that internally uses 4 bytes for every unicode character), because sre compiles some stuff differently for those.

All of this is glued together with my minimal automation framework Sysyphus. I basically use this as a substitute for shell scripts because I can never remember shell syntax. I now have a task AcceptanceTest in the build file that runs all of the possible test/platform combinations. This marks the goal for _sre.py at the end of SoC.

Coder’s Little Helpers

July 20th, 2005

The Python regex implementation compiles regex patterns to an intermediate bytecode form. Since what I’m writing is basically the interpreter for this bytecode, I’ve spent quite some time trying to make sense of numeric bytecode representation produced by sre. It’s not that hard, but you really loose track very quickly, mentally parsing a sequence like 17, 4, 0, 11, 0, 28, 29, 1, 65535, 21, 0, 21, 2, 19, 101, 19, 110, 19, 111, 19, 117, 19, 103, 19, 104, 21, 3, 19, 32, 19, 105, 19, 115, 21, 1, 22, 19, 32, 19, 2, 1. That stands for the pattern “((enough) is)+ \2“, by the way.

Today, I invested the whole of half an hour and about 20 lines of code into a small utility to display regex bytecode in symbolic form. It’s available from the _sre_support module. The _sre_support.opcodes(pattern) function returns a plain list of numeric bytecodes and _sre_support.dis(pattern), reminiscent of the dis module, prints a symbolic representation. Here’s an example:

>>> _sre_support.opcodes("^no\d.*")
[17, 4, 0, 3, 0, 6, 0, 19, 110, 19, 111, 15, 4, 9, 0, 0, 29, 5, 0, 65535, 2, 1, 1]
>>> _sre_support.dis("^no\d.*")
info [4, 0, 3, 0]
at 0
literal 110
literal 111
in [4, 9, 0, 0]
repeat_one [5, 0, 65535, 2, 1]
success

In case you wonder, here’s a very brief explanation of the above: All bytecode sequences start out with an info section that contains optimization hints for the interpreter (e.g., here the third argument means that a string must be at least 3 characters long to match). at 0 means that the beginning of the string must match here. literals expect a character matching the argument. in is for character sets which \d is a type of. repeat_one greedily matches repetitions of a single character, here at least 0 times (and at most inifinite times, 65535 is infinity), and the type of character to match is “any” (which the fourth argument, 2, stands for). success probably speaks for itself; when the interpreter gets here, we have match.

To recurse or not to recurse

July 17th, 2005

The university project I talked about was postponed until October, so I could already pick up SoC again on Friday. That day I implemented pretty much all regex syntax that doesn’t require interpreting a subexpression, among them most categories like \d (matches all digits) and character sets (e.g., [a-d] to match all letters from a to d). In other words, everything that doesn’t require recursion or an emulation thereof.

Yesterday, I stared at the _sre.c code for a couple of hours until I understood the basic idea of the scheme implemented by Gustavo that avoids recursion. I don’t think he’ll object if I call it a twisted little optimization hack here. Effectively, the core interpreter loop is one big C function, although that fact is obscured by the use of many macros which are in fact functions in disguise. It uses these macros instead of real functions because they share the scope of the main interpreter loop, so to speak, and this allows for gotos to be used safely.

Curiously, this is the first time I ever had to deal with gotos in any language (except for jumps in assemblers). It’s fascinating to see what you can use them for. In the same way it’s fascinating to watch a car crash test in slow motion. Here they are used to jump to the beginning of the interpreter loop when a subexpression is encountered. Prior to the jump, parts of the current interpreter state are saved on an explicit stack, including a constant that represents the label to jump back to once the subexpression has been evaluated. Once you see through all the implementation-specific cruft, this is an emulation of the C runtime stack, minus the recursion limit.

With Python it’s rather non-obvious how you’d do a similar thing. It took me a while to figure out how to resume the interpretation of an expression after a subexpression was evaluated (effectively, a kind of continuation). I did come up with a scheme that works, and I could implement regex branches (e.g., (this|or|that)) with it. I’m not sure if the model is general enough to also handle all the repetition operators (*, + and {x,y}), though. Implementing those will be my next task.

All the code is available from the Codespeak Subversion repository at http://codespeak.net/svn/user/nik, by the way. (Note that Codespeak has been down since Friday and probably will be for another day or so.)

It’s 30° celsius around here. Excuse me while I hang out down at the river, now.

PyPy sprint, part 2

July 16th, 2005

This is part 2 of my informal sprint report. Here’s part 1.

On the third sprint day in Göteborg I finally kicked off my actual SoC project, rewriting the _sre and array modules in Python. I sat together with Armin to discuss some details. We agreed on the basic requirement that will be used to evaluate it: The rewritten modules must pass the corresponding unit tests from the CPython source. As a bonus, I might try to convert it to RPython as well, the static Python subset to which PyPy can apply its optimization magic (e.g., compiling to machinecode via C). Right now, I don’t think that I will have working RPython versions by the end of SoC, that could possibly keep me busy for another few weeks.

Incidentally, Gustavo was around at the sprint that day as well. He’s the current maintainer of _sre for CPython, who accomplished the rather mindboggling feat of removing recursion from _sre for 2.4. He gave me a short tour of the code that I will live and breath for a while, now. You’ll probably get some more complaining about it here. Just this much: It’s highly optimized for speed, there’s lots of gotos, and the new nonrecursiveness doesn’t exactly make it more accessible.

In the evening we had to pack our stuff and leave the sprint site at Chalmers university. On the next day, my last full day in Göteborg, sprinting continued at Laura’s and Jacob’s apartment instead. The word “apartment” doesn’t really adequately describe what we’re talking about here. It’s got a staircase with marble stairs covered by a plush red carpet, it’s got a bathroom with a builtin sauna, and it’s huge. There was enough space to host the ten or so sprinters who were eager to code even though this had officially been declared a rest day.

I worked alone on _sre all day, making good progress. In the end I had a basic framework set up and could already match against very simple regexes. Like ^foo or bar$. Well, you gotta start small. I knew that I would have to revise some of the code I wrote then, but a quick spike solution was important to explore as much of the problem space as possible.

The day culminated in a late-night dinner cooked by Laura and Jacob, both apparently first-rate gourmets. There was lamb in red wine sauce, yellow beans and potatoes. Simple, but absolutely delicious. For dessert there was a fabulous strawberry soup that I will try to make for myself sometime soon. (I’m a strawberry kind of guy. My flatmates ridicule me for the amount of strawberries I eat during spring and summer.)

We got back to the hotel rather late, and I had to get up very early next morning. Don’t book flights at 7 am if you can avoid it … Only when I got home I realized how intense the sprint was. In Göteborg I never really felt tired, but back in Berne it needed a day or two until I felt fully awake again.

Academic writing

July 13th, 2005

I’m trying very hard to not work on my SoC project this week (blogging about it here instead, is just a substitute for the real thing). I know my working habits too well. Would I just casually start coding again, I would quickly get so immersed that everything else that needs to be done this week would fall short.

Especially if one of the things to be done is writing a report/paper about a project at university. In general, I enjoy writing, but that kind of writing is hard and demands a lot of time, discipline and dedication. It’s also the first time I do this seriously (crappy term papers about uninteresting topics don’t count). The project is about classboxes, a novel module system invented by Alexandre, a PhD student in my research group. He had a paper about the Java classboxes implementation accepted at upcoming OOPSLA, so the topic is hot.

Today I drafted the introduction for my report/paper. It turned out somewhat polemical and will probably have to be toned down a bit for the final version. But I think it presents the base problem well, so there it is for you’re reading pleasure:

Software is enormously complex. Every piece of software today, however small, is built on top of a tower (or maybe more of a heap) of existing environments, libraries and components. At the least, we rely on operating system services and a programming language environment. But more often than not, we also use a plethora of components provided by a language’s standard library as well as by third-party libraries. This, reusing existing components, is of course a good thing. It spares us from reinventing the wheel all the time and allows for efficient, rapid software development.

Component reuse is not effortless, though. When we build a software system out of existing components, it often happens that a component fulfills our basic need, but that the specific implementation or interface does not quite exactly do what would be most convenient for our use case. The desire arises to modify the component. It would make our own code just so much cleaner. But this is mostly not an option. Maybe the component’s code is not open for modification, or maybe the modifications would affect other components in our system that rely on the code we would like to change. In either case, we normally have to give up and just program around the inconvenient component in our own code.

In object-oriented programming, classes are such components. Some modifications we might want to do to an existing class: adding a method, redefining a method or adding an instance variable. Some languages have recognized this need for a flexible base system. One of them is Smalltalk, which in fact supports all of the mentioned operations. Indeed, it’s almost de rigeur that everyone programming on top of Smalltalk freely modifies system classes in arbitrary ways. Smalltalkers want to control their system.

Control is good. But what if all the different authors of libraries you are using claim control over your system? Things get ouf of control. This is what happens, to a certain extent, to Smalltalk software reusing a lot of third-party libraries. At the very least you get lots of new methods on system classes. In the worst case, this creates actual conflicts, e.g., when two libraries try to add the same method to a class.

Still, being able to modify system classes is a very useful feature of Smalltalk. But what if the scope of such changes could be controlled? What if you could use all of Smalltalk’s power while avoiding the pitfalls usually associated with that power? The classboxes model proposes a novel module system which achieves just that.

In this paper we present the implementation of an environment that allows the programmer to take full advantage of classboxes in Squeak Smalltalk. We introduce a classbox browser that supports views and operations specific to classboxes.

I have fleshed out all parts of the paper, now. The final two days are for refining everything and sorting out all kinds of interesting LaTeX issues (not).