Arc Forum | About arrays: I think the major concern of most people has more to do with effic...

Arc Forum

3 points by almkglor 6625 days ago | link | parent

About arrays: I think the major concern of most people has more to do with efficiency than usability. Generally, lists are more useable (easier to construct and insert into) than arrays, but arrays have O(1) indexed lookup compared to list O(n).

Since lists in Arc have the cute (lst ind) syntax, indexed lookups are expected to be common (because the syntax is simple) but the problem is efficiency.

However, I would like to propose the use of an unrolled linked list as an implementation of lists:

http://en.wikipedia.org/wiki/Unrolled_linked_list

Unrolled linked lists have O(n/m) lookup time, at the cost of significant insertion time boosts. We expect insertion to be less common, however.

The naive implementation of unrolled linked lists cannot have safe scdr implementations (i.e. indistinguishable from singly-linked list scdr). However, with some thinking, scdr can be implemented.

Instead, we should use the following structures (Arc mockup; obviously this is much, much more efficiently expressed in C++):

  (= m 16) ; the unrolling factor: larger numbers trade off memory for speed
  (deftem unrolled-list
    'elements (vec m) ; where (vec m) returns a basic true array
    'start m
    ; best implemented as a C++ stl::map
    'cdr-lookup (table-that-allows-nil))
  ; this is the actual cons cell
  (deftem cons
    'list-struc (inst 'unrolled-list)
    'index 0)

Now in order to implement a cons operation, we first check if the second argument is a cons cell. If it is, we check its list-struc's start parameter. If the index and the start parameters are equal and non-zero, we just construct a new cons cell with the same list-struc; otherwise we also construct a new unrolled-list list-struc.

  (def cons (a d)
    (if (and
          (isa  d 'cons)
          (is   (d!list-struc 'start) d!index)
          (isnt (d!list-struc 'start) 0))
      ; use the same array
      (withs (list-struc d!list-struc
              elements list-struc!elements)
        ; decrement the start
        (-- list-struc!start)
        ; add the element
        (= (list-struc!elements list-struc!start) a)
        ; create a new cons cell, sharing structure
        (inst 'cons
              'list-struc list-struc
              'index list-struc!start))
      ; *don't* use the same array; create a new one
      (withs (elements (vec m)
              start (- m 1)
              list-struc (inst 'unrolled-list
                               'elements elements
                               'start start
                               'cdr-lookup
                               (fill-table (table-that-allows-nil)
                                 `(,start ,d))))
        ; add it to the new array
        (= (elements start) a)
        ; create the cons cell
        (inst 'cons
              'list-struc list-struc
              'index start))))

Getting the car of the cons cell requires us to only look it up directly from the list-struc's elements:

  (def car (l)
    (let elements (l!list-struc 'elements)
         (elements l!index)))

Setting the car is similar:

  (def scar (l a)
    (let elements (l!list-struc 'elements)
      (= (elements l!index) a)))

Getting the cdr is more difficult: first we need to look up our index in the cdr-lookup of the table. If it's in there, we return the result of the cdr-lookup. If it's not, we create a cons cell that refers to it for us.

  (def cdr (l)
    (withs (list-struc l!list-struc
            elements list-struc!elements
            cdr-table list-struc!cdr-table)
      ; cdr-table must support values that are nil
      (if (has-key cdr-table l!index)
          (cdr-table l!index)
          (inst 'cons
                'list-struc list-struc
                'index (+ 1 l!index)))))

The above means that comparing cons cells require us to compare not the cons objects themselves (which could be different) but what they refer to:

  ; the anarki redef binds the old definition to `old'
  (redef is (a b)
    (if (and (acons a) (acons b))
        (or (old a b)
            (and
              (is a!list-struc b!list-struc)
              (is a!index b!index)))
        (old a b)))

Now setting the scdr requires us to first determine if the index is in the cdr-table. If it is, we modify the cdr-table; if it isn't, we insert it.

  (def scdr (l d)
    (withs (list-struc l!list-struc
            cdr-table list-struc!cdr-table)
      ; insertion and replacement use the
      ; same semantics.
      (= (cdr-table l!index) d)))

Since cdr first checks the cdr-table, any other cons cells which point to the same list-struc will also see the replacement.

Lookups must first check if the index would go out of range to the nearest highest cdr-table entry:

  (defcall cons (l i)
    (withs (start-index l!index
            index (+ start-index i)
            list-struc l!list-struc
            cdr-table list-struc!cdr-table
            elements list-struc!elements
            nearest-cdr-table
            ; the arc-wiki 'breakable creates a control structure
            ; that can be returned from by (break <value>)
            ; this part should really be done by binary search;
            ; C++ stl::map fortunately sorts keys, so binary
            ; search should be easy if that is the underlying
            ;implementation
            (breakable:each i (sort < (keys cdr-table))
              (if (> i start-index)
                  (break i)))
      (if (> index nearest-cdr-table))
          ; pass it on, then
          ((cdr-table nearest-cdr-table) (- index nearest-cdr-table 1))
          ; it's in this array
          (elements index))))

Although the part above where it looks up nearest-cdr-table may seem expensive, for reasonably-clean lists (those where scdr hasn't been used often) will only have one entry in the cdr-table anyway; the checking here also doubles as an array bounds check! At the same time, even if scdr has been used often, the lookup and cdr operations work exactly as if it were singly-linked lists.

We can also build a 'copy-list operation which "optimizes" lookup - it measures the length until the source list is improper (cdr is not a cons - either nil or some other value) and returns a cons whose list-struc!elements contains the entire proper list, with the (cdr-table (- (length source) 1)) being the final cdr.

2 points by nex3 6625 days ago | link

This is a cool idea, but I'm not sure that it's a good idea in general to add complexity to cons cells. Even if the interface is the same, the performance characteristics are different (that's the whole point, I suppose), and that makes reasoning about them more complicated.

Also, from a more wishy-washy perspective, it just feels right to me that the core data structure of Lisp has such an conceptually simple implementation. Two pointers: a car and a cdr. That's it

It seems to me that it's worth just making arrays available (which we'd have to do anyway, really) to keep lists as simple as possible.

Maybe I'm being naïve, though ;-).

-----

3 points by almkglor 6625 days ago | link

Come now. It's always possible to "seamlessly" funge the division between lists and arrays. For the most part, they have virtually the same purpose: to keep a sequence of objects. One can imagine an implementation which keeps lists as arrays underneath, but if something really difficult to do in array - say insertion, or scdr - to switch over to singly-linked lists.

Really, though, I think what's missing in Arc is the layers of abstraction. We don't need two sequences - singly-linked lists and arrays. What we should have is the concept of a sequence. Iterating over a sequence, regardless of its implementation, should be handled by 'each, 'map, etc. I think as pg envisioned it, a profiler would eventually be integrated into Arc, which would measure the performance difference in using either of the implementations. If insertion is common in one place, and then indexed access in another, the compiler+profiler should be able to figure out that it should use linked lists in one place, then copy it over to the other place as an array to be used as indexed access.

Basically, what I'd like is a layer of abstraction - "this thing is a sequence, I'll choose array or linked-list later". Then, after profiling, I'll just add some tags and magically, this part of the code building the array uses linked-lists and that part doing weird indexed stuff (e.g. heap access) uses arrays.

When you look at it, most other dynamic languages don't have a separate linked-list type. They're all just "array", and not all of them use a sequential-cells-of-memory implementation of arrays. Say what you will about lesser languages, but first think - if everyone's doing it, there must be a reason. They could just be copying each other, of course, but one must always consider whether having two types of sequence, whose only true difference is their difference in access, is such an important thing.

-----

4 points by nex3 6625 days ago | link

I don't see lists in Lisp as just sequences. Rather, I don't see cons cells as just sequences. They're so much more versatile than that, which is part of what gives Lisp its power (cue "hundred operations on one data structure" quote). They can be used as maps, trees, etc. I think it would be a mistake to say, "these are mostly the same as arrays, let's implement them as arrays most of the time." Cons cells aren't the same as arrays.

I guess you're right in that arrays have more-or-less a subset of the functionality that cons cells do. Maybe it would be a good idea to have lists as the default and switch to arrays under some circumstances (lots of indexing or index-assignment?). But I'm skeptical about this as well.

Also, I foresee some unexpected behavior if the transition between conc cells and arrays is entirely behind-the-scenes. For example:

  ; Suppose foo is an array acting like a cons cell
  (= bar (cons 'baz (cdr foo)))
  (scdr (cdr bar) 'baz)

Now we'd need to somehow update the foo variable to point to a cons cell rather than array. You could imagine this getting even more tricky, even incurring large unexpected cost, with many variables pointing at different parts of an array-list and one of them suddenly scdr-ing.

-----

2 points by almkglor 6625 days ago | link

All of that solved by the unrolled-list mockup. It works, and works with scdr, no matter how many pointers point in the middle of otherwise-proper lists, and even works with lists split by scdr, handling both the parts before and after the scdr-split seamlessly. Costs are also not very large - mostly the added cost is in the search through cdr-table for the next highest key. That said the added cost is O(n) where n is the number of individual cons cells scdr has been used on.

So yes, the above code you show will add cost. But how often is it used in actual coding anyway? The most common use of cons cells is straight sequences, and quite a bit of arcn can't handle improper lists (quasiquote comes to mind) - yet arc is useful enough to write news.yc.

Edit: Come to think of it, for the common case where scdr completely trashes the old cdr of a cons cell (i.e. no references to it are kept), the linear lookup through cdr-table will still end up being O(1), since keys are sorted, and the earlier cdr-table entry gets found first.

-----

2 points by tokipin 6625 days ago | link

i don't know the technical terms, but probably one of the things that gives Lua its speed is that if you have multiple strings in the program that are the same, the VM assigns them the same pointer. string comparisons are therefore trivial and i imagine this mechanism would make table lookup very direct

-----

5 points by absz 6625 days ago | link

That's what Arc's symbols are for. Generally speaking, you should be keying your tables with symbols for exactly that reason: every 'a is at the same place in memory, but every "a" is not (which allows mutable strings).

-----

3 points by kens1 6624 days ago | link

You had me worried, but I'm pretty sure there's absolutely no problem with using strings as the keys for tables.

The MzScheme documentation says: make-hash-table ... 'equal -- creates a hash table that compares keys using equal? instead of eq? (needed, for example, when using strings as keys).

Checking ac.scm, sure enough:

  (xdef 'table (lambda () (make-hash-table 'equal))

Likewise, the "is" operation in Arc uses MzScheme's string=? . (That's a statement, not a question :-) So string comparison works, although in O(n) time.

Net net: strings are okay for comparison and table keys in Arc.

-----

2 points by absz 6623 days ago | link

I wasn't saying that they weren't usable, just that they were, in fact, slower; that's what you observed. Symbol comparison is O(1) time, whereas string comparison is O(n) time. That's all.

-----

3 points by are 6624 days ago | link

I would rather have immutable strings + unification of symbols and strings.

- Any string could have an associated value, like symbols today.

- "foo", 'foo and (quote foo) would be the same object (you would allow Lisp-style prepend-quoting of non-whitespace strings for convenience).

- foo, (unquote "foo") or (unquote 'foo) would then be used to evaluate, so even non-whitespace strings like "bar baz" could act as symbols (but with less convenience, of course, since you would have to use the unquote form to get them evaluated).

- Since such a unified string/symbol would also act as a perfectly good key/value pair, a simple list of such strings will in effect act as a string-keyed hashtable (since duplicate strings in the list would be the same immutable key), and can be used wherever you need symbol tables (e.g. for lexical environments). In fact, string-keyed hash tables would be a subtype of any-sort-of-key hashtables, and probably used much more.

-----

2 points by absz 6624 days ago | link

Right now, you can do (= |x y| 3) to get at oddly-named symbols, or

  arc> (eval `(= ,(coerce "x y" 'sym) 42))
  42
  arc> |x y|
  42

. And by (unquote "foo"), do you mean (eval "foo")? Or do you mean `,"foo"? The latter makes more sense here.

At any rate, I'm not convinced that this is actually a benefit. Strings and symbols are logically distinct things. Strings are used when you want to know what they say, symbols when you want to know what they are. Unifying them doesn't seem to add anything, and you lose mutability (which, though unfunctional, can be quite useful).

-----

3 points by are 6624 days ago | link

Good feedback.

> Strings and symbols are logically distinct things. Strings are used when you want to know what they say, symbols when you want to know what they are.

Fine. But this breaks down anyway when you force people to use (immutable) symbols instead of strings for efficient allocation. When using symbols as keys in hashtables, you do not "want to know what they are", you "want to know what they say".

And unification would possibly have good consequences for simplifying macros and code-as-data (especially if characters are also just strings of length 1). Most code fragments would then literally be strings (most notable exceptions would be numeric literals, list literals and the like).

-----

2 points by absz 6624 days ago | link

Actually, in a hash table, I usually don't care what the key says, any more than I care about the name of the variable used to store an integer. I care about it for code readability, but I'm usually not concerned about getting a rogue key (where I do care what it says). In that case, I would either use string keys or (coerce input 'sym).

I'm not convinced that characters being strings of length one is a good idea... it seems like the "character" is actually a useful concept. But I don't have a huge opinion about this.

Code fragments would still be lists, actually: most code involves at least one function application, and that's a list structure. Only the degenerate case of 'var would be a string.

-----

1 point by are 6623 days ago | link

> Actually, in a hash table, I usually don't care what the key says, any more than I care about the name of the variable used to store an integer.

That's fine again, but my point is just that by using symbols as keys in hashtables, you never care about the value part of that symbol (you just need an immutable key); you're not using the symbol "as intended", for value storage.

> most code involves at least one function application, and that's a list structure.

Yep. But in the case where that function application does not contain another function application (or list literal) in any of its argument positions, we would, with my proposal, be talking about a list of strings, which could then again be seen as a string-keyed hash table...

-----

1 point by absz 6623 days ago | link

Symbols are not "intended" for value storage, symbols happen to be used for value storage. Symbols have exactly the same properties as, say, named constants in C, and thus fit in the same niche. They also have the properties of variable names, and so fit in that niche too. Symbols are a generally useful datatype, and they are no more intended for just "value storage" than cons cells are intended for just list construction.

A list of strings is still a list, which is not what you said; right now, it's a list of symbols, and I don't see the benefit of a list of strings over a list of symbols.

-----

3 points by almkglor 6624 days ago | link

Really, I think most people are confused by the boundary between interface and implementation.

It's possible to have a mutable string interface built on an immutable string implementation. Just add indirection! We can get the best of both worlds: immutability of actual strings optimizes comparison of strings, while the pseudo-mutable strings allow you to pass data across functions by a method other than returning a value (i.e. mutating the string).

-----

1 point by absz 6624 days ago | link

That's a good point. However, it leaves open the question of what "a new string" creates. One can build either one on top of something else (e.g. immutable strings on top of symbols [though one could argue that that's what symbols are, I suppose]), so the real question (it seems to me) is what the default should be.

-----

3 points by almkglor 6624 days ago | link

This is where "code is spec" breaks down, again ^^; \/

I suppose if the user uses symbol syntax, it's an immutable string, while if the user uses "string syntax", it's a mutable string. Interface, anyone? ^^

edit: typical lisps implement symbols as references to mutable strings; some newer ones implement mutable strings as references to immutable strings, with symbols referring also to immutable strings.

-----

3 points by absz 6624 days ago | link

This isn't so much code is spec, though: Arc only has mutable strings and symbols. You could consider symbols immutable strings, but they exist to fill a different niche.[1] If mutable and immutable strings were created, then the code-spec would have to deal with this; I think it would be capable of doing so.

I'm not so concerned with how Lisps represent symbols and (mutable) strings as long as (1) my strings can be modified, and (2) comparing symbols takes constant time. If it's the Lisp interpreter protecting the string-representing-the-symbol, so be it; that doesn't affect me as a Lisp programmer.

[1]: Although if I recall, Ruby 2.0 will make its Symbols "frozen" (immutabilified) Strings, thus allowing things like regex matching on Symbols. This might be an interesting approach...

-----