Arc Forumnew | comments | leaders | submit | olavk's commentslogin
3 points by olavk 6465 days ago | link | parent | on: Clarification about Character Sets

That just shows how agile PG is. He added unicode support the minute he saw people request it! :)

Seriously, PG explicitly claims that Arc intentionally doesn't support anything but ASCII (http://www.arclanguage.org/), so that might be why people (including me) believed that to be the case.

-----

1 point by aaco 6465 days ago | link

Yes, I think Arc intentionally supports only ASCII just to not bother with Unicode issues as of right now.

Anyway, I can't see how Unicode can break in Arc. I'm not a Lisper, but I think you can't extract 1 byte from an Arc string (since it's just a MzScheme string), but 1 char instead. That's a different concept, because in Unicode 1 char can be formed with 1, 2 or more bytes.

-----

9 points by olavk 6465 days ago | link | parent | on: Clarification about Character Sets

Or just one type of string: unicode character strings (which is sequences of unicode code points). Then a seperate type for byte arrays. Byte arrays are not character strings, but can easily be translated into a string (and back).

-----

3 points by olavk 6465 days ago | link

...and this seem to be exactly what MzScheme provides :-) Strings in MzScheme are sequences of unicode code points. "bytes" is a seperate type which is a sequence of bytes. There are functions to translate between the two, given an encoding.

Python 3000 is close to this, but I think Python muddles the issue by providing character-releated operations like "capitalize" and so on on byte arrays. This is bound to lead to confusion. (The reason seem to be that the byte array is really the old 8-bit string type renamed. Will it never go away?) MzScheme does not have that issue.

-----

3 points by olavk 6465 days ago | link | parent | on: Clarification about Character Sets

Thank you very much for the clarification! People got riled up because it sounded like you didn't want Arc to support unicode ever (or that the current support would be removed). As long as the language is in flux its not a problem.

However, fundamental Unicode support probably has to be in place before release 1.0. It will be painful to add at a later time if backwards compatibility is an issue. For example a lot of string processing code might assume that accessing characters by index is constant time. If the internal representation is changed to eg. UTF-8 this might lead to performance issues. On the other hand, if code assumes that strings are equivalent to byte-arrays, it might lead to trouble if they are changed to arrays of 32bit-values.

I believe the simplest solution is to just have characters be 32bit integers. The internal representation of a string is just an array of 32bit characters. Sure this consumes more space, but who cares? As long as strings are a type seperate from byte-arrays, encoding/decoding issues and can be handled in libraries.

-----

1 point by weeble 6465 days ago | link

I think the point is that, in the presence of combining diacritics, even 32 bits isn't enough. A character is (roughly) one "base" 32-bit code plus zero or more "combining" 32-bit codes. And equality between two characters isn't purely structural - you might re-order its combining codes or use a pre-combined code. (Not all combinations have pre-combined codes.)

I will point out that I know very little about Unicode, so I might be a bit off. I can't say that I'm even very interested in the whole Unicode debate, so long as it all gets sorted out at some point in the future.

-----

1 point by tree 6465 days ago | link

The only reason Unicode contains combined forms is for compatibility with existing standards: you cannot invent new code points representing a novel combination of base and combining characters. The Unicode normalization forms deal with these issues.

Unicode support is a complex issue: fundamentally there are the issues of low-level character representation (e.g., internal representation) followed by library support to handle normalization and higher-level text processing operations.

-----

1 point by olavk 6465 days ago | link

True, I should have said unicode code points rather than characters. I believe the fundamentals is that strings should always be sequences of unicode code points, and shouldn't be conflated with byte arrays. The thorny issues of normalization, comparing, sorting, rendering combined characters and so on could be handled with libraries at a later stage.

-----