Segfaulting Fun
I’m kinda inadvertently becoming an expert on unicode and UCS-4 issues. I just discovered a bug in CPython. This is on an UCS-4 build:
Python 2.4.1 (#2, Jul 23 2005, 13:16:23)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\x7f\x00\x00\x00'.decode("unicode_internal")
u'\U7f000000'
>>> '\x80\x00\x00\x00'.decode("unicode_internal")
u'\x00'
That’s strange enough, but now watch this:
>>> '\x81\x00\x00\x00'.decode("unicode_internal")
Segmentation fault
This is admittely an edge case, but a segfault still seems a bit harsh. As Python doesn’t even know any unicode code points above 0x10FFF, I think the correct behaviour would be for unicode_internal to throw a UnicodeDecodeError in these cases. The unicodeescape encoding does this as well:
>>> u'\U7f000000'
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
I’ll have to check whether CPython 2.5 from CVS acts the same way and file a bug report. A re documentation bug I filed last week just got closed, by the way.
Edit: Bug filed.