Arc Forumnew | comments | leaders | submit | kirkeby's commentslogin
6 points by kirkeby 6140 days ago | link | parent | on: Clarification about Character Sets

I think the Py3K solution sounds right: Two different types of strings, 8-bit byte-strings and full unicode strings, with encode and decode functions to convert between the two.

Without this strict separation you get the wart that is Pythons current string-support.

-----

9 points by olavk 6140 days ago | link

Or just one type of string: unicode character strings (which is sequences of unicode code points). Then a seperate type for byte arrays. Byte arrays are not character strings, but can easily be translated into a string (and back).

-----

3 points by olavk 6140 days ago | link

...and this seem to be exactly what MzScheme provides :-) Strings in MzScheme are sequences of unicode code points. "bytes" is a seperate type which is a sequence of bytes. There are functions to translate between the two, given an encoding.

Python 3000 is close to this, but I think Python muddles the issue by providing character-releated operations like "capitalize" and so on on byte arrays. This is bound to lead to confusion. (The reason seem to be that the byte array is really the old 8-bit string type renamed. Will it never go away?) MzScheme does not have that issue.

-----