Arc Forum | So that we don't walk on each other's toes, I propose to work myself on the foll...

Arc Forum

3 points by sacado 6632 days ago | link | parent

So that we don't walk on each other's toes, I propose to work myself on the following (when I'll have time and the possibility) :

1) annotate and friends, so as to make macros possible (as a macro is an annotated list).

2) chars and strings (and unicoding symbols too. For now on, it's not a problem as they can't be destructured : an utf8 string looks like a byte of chars. But, with strings, as you can access individual chars, it won't work anymore).

3) bignums, rational and inexact numbers. Maybe complex numbers, too.

4) hash tables. Maybe I'll use Lua's approach so as to make them more array-friendly.

Well, in fact, everything about typing.

4 points by sacado 6632 days ago | link

Ok, I've got annotations working, with annotate, type and rep tuned to make everything work. Displaying annotated data gives the same output as canonical Arc.

pr also displays lists the same way as canonical Arc : instead of

  (foo . (bar . (baz . nil)))

it displays

  (foo bar baz)

And

  (foo . (bar . baz))

is displayed

  (foo bar . baz)

I'm fighting with strings now. I can display individual characters, as long as they are regular ASCII characters. A character is only a Unicode codepoint, encoded as a long. A string is internally an array of codepoints, so it occupies a lot of space, as each character uses 8 bytes. But accessing / modifying characters is fast & easy. When output, strings are converted to UTF-8 (that's where it does not work anymore :).

Edit : it now works, as long as you only use ASCII characters, which is, I admit it, rather stupid, but I had to do that first. And the len primitive is now implemented and works on cons cells and strings.

I didn't play with numbers & tables yet.

-----

2 points by almkglor 6632 days ago | link

Re: symbols - perhaps it's better to leave them in UTF-8, then only convert them to UTF-32 if and only if they are coerced to strings.

I suggest making tables work first before the really weird numbers ^^

Edit: there's also the problem of making I/O ports redirect to strings under construction. 'tostring is pretty much the idiomatic way of creating strings in Arc.

Edit2: As an aside, here's what I intend to do with ac.scm built-ins:

1) Remove ac.scm xdef'ed functions from the mac* stuff in xe.arc

2) Create a special lib-ac.scm.arc, where we can access primitives:

  (set car
    (fn (x)
      (%car x))) ;will access primitive %car, but only in lib-ac.scm.arc

Code that isn't in lib-ac.scm.arc will not be able to access the primitives %foo.

The above will of course allow code like:

  (map car (pair lst))

3) Finish up my inliner, so that user code of the form:

  (car foo)
  =>
  #hash( (type . app) (subx .
   (#hash( (type . ref) (var . #hash( (id . car) (uid . car)))))
      #hash( (type . ref) (var . #hash( (id . foo) (uid . foo))))))

Gets converted, by inlining, to:

  (%car foo)
  =>
  #hash( (type . prim) (prim . %car) (subx .
    #hash( (type . ref) (var . #hash( (id . foo) (uid . foo))))))

-----

3 points by sacado 6631 days ago | link

Yep. I left symbols encoded in UTF-8. When a string is output to anything (included when transformed to a symbol) it is translated to UTF-8. And I was planning to implement tables before the numeric tower. I think there's more fun in tables rather than in numbers :)

As for ports I didn't think about it yet.

-----

1 point by eds 6632 days ago | link

Is this on Anarki yet? (For that matter, I don't even see your version using Boehm GC on Anarki.)

Edit: I just found http://github.com/sacado/arc2c/tree/master. Is this the repo you are talking about?

-----

1 point by sacado 6632 days ago | link

No, it's not on the git yet as I cannot access it this week. And yes, arc2c's git is http://github.com/sacado/arc2c/tree/master

-----

3 points by almkglor 6631 days ago | link

Just a question, but I wonder if you could give eds write access to the repo? And of course, if there's anyone else out there who would like to contribute, just let us know, I think sacado would be glad to accommodate you ^^

-----

1 point by stefano 6631 days ago | link

I'd like to have access too. Currently I don't have much free time to contribute, but if I find some time i'll be glad to contribute. My github username is, surprisingly, 'stefano'.

-----

1 point by sacado 6631 days ago | link

You're on the list now :)

-----

1 point by stefano 6630 days ago | link

Thanks :)

-----

1 point by sacado 6631 days ago | link

Sure, eds, do you have a github account ? What's your login ?

-----

1 point by eds 6631 days ago | link

Yeah, my github username is slaguth.

-----

1 point by sacado 6631 days ago | link

done.

-----

1 point by almkglor 6632 days ago | link

Yes ^^

-----

4 points by sacado 6628 days ago | link

Now, I've got first class functions : they have a type tag and can thus be printed (it only displays #<procedure>, as canonical arc), asked their type, etc. As soon as I did that, Boehm GC started complaining a lot (always saying "hmm, is that a pointer or not ? I dunno, I'm lost...").

Not very dramatic, these are just warning messages. But as Boehm seems almost borken on some machines and as it was starting to bother me, I did it again : I implemented a new version of my own, hand-made, garbage collector. It was easier with real first-class closures and it is much faster this time. On the test code I've written, the program runs twice as fast with my buggy-GC.

The code still partially relies on Boehm GC, as I have to make it know how every data type has to be collected, but it currently collects tagged objects, cons cells and closures. Will be pushed on Monday.

-----

2 points by almkglor 6628 days ago | link

We'll all be waiting ^^. How'd you implement closures? As a structure or just an array? Boehm might get confused in the array case if you're using the first entry of the array as a type tag (which isn't a pointer). Or maybe not; I haven't studied Boehm GC very well.

As for GC: what kind did you write? Copy or mark? If it's marking, I'd suggest a mark-and-don't-sweep collector. I think most incremental and thread-friendly modern GC's are copying though.

Edit: as for me doing the macro hacking stuff, well, it looks like I'm all hacked out. Hehehe^^

-----

1 point by sacado 6627 days ago | link

Hmm... I'm not very good at terminology, but I'm almost sure it's a mark-and-sweep. The implementation relies on system malloc. Every time some memory is required, the user calls gc_malloc. This function calls malloc stores the pointer in an array and returns that pointer. Once the array is full (we're not talking about consumed memory yet, but about built references, so it can break down if you build very big objects), collection is performed : everything not reachable from the stack (or recursively from reachable objects) is freed. It has to be improved, but for now on it's working.

I implemented closures as an array of long. Very easy to deal with. The first one is the tag type, the second one is the goto label, the third is size of the array (we need it for garbage collection) and all others are the arguments (well, they are objs, but they are implemented as a long).

-----

2 points by almkglor 6626 days ago | link

I see.

It does indeed seem to be a mark-and-sweep. Generally though most GC's will handle the heap: they allocate one big bunch of memory via malloc() and allocate from that.

"Mark" means to determine if a memory area is accessible. Usually this means setting some sort of bit or variable for each memory area. After you've marked all reachable memory, you perform a "sweep": any unmarked memory is freed.

A slightly-more-efficient algorithm than mark-and-sweep is mark-and-don't-sweep (obviously because you skip the "sweep" step), but this requires us to handle the heap directly. Here's an explanation:

Each memory area in the heap has a "free/in-use" bit. This bit's sense of "free" can vary. For example, at any one time, all "free/in-use" bits may have the meaning:

  0 = FREE
  1 = IN-USE

At another time, however, the meaning might be:

  0 = IN-USE
  1 = FREE

The magic here is the way the free/in-use bit is interpreted by the memory manager.

Let's start with the following assumption:

  MEANING:
  0 = FREE
  1 = IN-USE
  +---------+--------------+---+------------+---------------+-------+
  |    0    |       1      | 1 |     0      |      1        |   1   |
  +---------+--------------+---+------------+---------------+-------+
   ^
   Alloc pointer

Now, suppose the application requests for memory. The allocator moves the alloc pointer and marks the memory allocated as "in-use".

  +---+-----+--------------+---+------------+---------------+-------+
  | 1 |  0  |       1      | 1 |     0      |      1        |   1   |
  +---+-----+--------------+---+------------+---------------+-------+
   |   ^
   v   alloc pointer
  returned

Now suppose we allocate a bit of memory that is too large for the current free memory pointed at the alloc pointer:

     |-------| <- I need something this big
  +---+-----+--------------+---+------------+---------------+-------+
  | 1 |  0  |       1      | 1 |     0      |      1        |   1   |
  +---+-----+--------------+---+------------+---------------+-------+
       ^
       alloc pointer

Obviously, we have to skip the free memory that's too small. However, let me introduce an invariant: everything to the left of the alloc pointer must be in-use. So if ever we skip free memory that's too small, we still mark it in-use, but we don't return it (obviously, it's too small!). Instead we continue over to the next free memory and see if that is large enough, and so on.

In this case the very next portion of memory is available:

                               |-------| <- I need something this big
  +---+-----+--------------+---+-------+----+---------------+-------+
  | 1 |  1  |       1      | 1 |   1   |  0 |      1        |   1   |
  +---+-----+--------------+---+-------+----+---------------+-------+
                                |       ^alloc pointer
                                v
                                returned

And so on, until we consume the heap:

  +---+-----+--------------+---+-------+----+---------------+-------+
  | 1 |  1  |       1      | 1 |   1   |  1 |      1        |   1   |
  +---+-----+--------------+---+-------+----+---------------+-------+

..which now requires garbage collection.

Then the magic here comes in: we flip the meaning of the free/in-use bit. This frees everyone!

  MEANING:
  0 = IN-USE
  1 = FREE
  +---+-----+--------------+---+-------+----+---------------+-------+
  | 1 |  1  |       1      | 1 |   1   |  1 |      1        |   1   |
  +---+-----+--------------+---+-------+----+---------------+-------+

Then we begin the "mark" step, specifying reachable memory areas as in-use:

  +---+-----+--------------+---+-------+----+---------------+-------+
  | 0 |  1  |       0      | 1 |   0   |  1 |      1        |   1   |
  +---+-----+--------------+---+-------+----+---------------+-------+
       ^alloc pointer

...and afterwards... uhh... well.... we just allocate as normal, except the meaning of 0/1 of the free/in-use bit has flipped. "Don't sweep". Thus our sweep step is part of our allocation.

As an aside: I've started writing an 'eval function for use with macros in arc2c. This is done by creating a new "eval" function using (make-eval) in make-eval.arc. It's not done yet though.

My plan is that for each compilation run, we (make-eval) a new version of 'eval. Why? Because we want to protect the global environment.

For example, the user code might want to use the following macro:

  (mac xe body
    `(tag (div class 'xe)
        ,@body))

Unfortunately, 'xe is a function defined and used by arc2c. If we were to simply 'eval all 'mac forms, then user code could thrash arc2c.

Instead, we create a "protected" eval. This eval, when used, will prevent writes to global variables. Instead, writes to global variables will mutate a global-variable table, not the "real" global variables.

However, it's not done yet, there are a bunch of TODO's floating around. And unfortunately, I might not be able to do this for a week. Or maybe two weeks, or maybe a month.

A friend of mine has a pretty big personal Real Life(TM) problem (it involves, like nearly every big personal RL problem, a member of the opposite sex). I'll need to help him for now. Sorry.

(the guy will, usa embassy willing, be in san francisco, california, usa a month from now. he's had to borrow quite a bit from his friends too, so we're all pretty tapped out and can't accompany him. err. just wondering if someone near there could keep an eye on him.)

The code for the 'eval interpreter is on github. Anyone who wants to try continuing it is welcome to do so. You're even welcome to completely dismantle that bit and start some other way of doing macros.

Bye for now, AmkG

-----

1 point by sacado 6626 days ago | link

That looks clever, and not too complicated... I'll try to implement it when I'll have enough time... As a matter of fact, dealing with heap space myself would let me reduce the size of closure objects (I wouldn't need to know the # of arguments they hold anymore).

Well, good luck with your friend, and see you soon !

-----

3 points by almkglor 6632 days ago | link

Okay. Although I'd like to ask if you can implement the Anarki 'defcall and 'call* too?

Basically this means that instead of updating the pc C-var at END_JUMP(), our jump: C-label does the updating. If it's an ordinary function, extract pc from CLOSURE_REF(LOCAL(0),0). If it's not, lookup its type in the call* table, and rearrange the stack:

  (obj k . ind)
  =>
  ((call* (type obj)) k obj . ind)

This also means that closures need type tags now too.

-----

3 points by sacado 6631 days ago | link

Ok, I'll work on this too. I need to type closures now anyway, as I am implementing full support for things like pr and type.

Btw, about type tags and the pointer-as-fixnum hack : I read a paper about the implementation of Lua (for tips about tables implementation). In Lua, everything, including numbers, is implemented as a structure containing first the type, then the data for the actual object (somebody already mentioned this in the forum). The reason is that the ANSI C standard does not allow the pointer-as-fixnum trick : you cannot know for sure that addresses will have a 0 low bit. In practice, it works on the most common architectures, but it's not fully portable (which is a problem given Lua's target). Hmm, you were right, maybe later we could add another version of codegen that would be slower and more memory-consuming but completely portable.

-----

3 points by almkglor 6631 days ago | link

True, which is why I was always a bit leery of the trick. Besides, if you're going to implement bignums anyway, you might as well start off "fixnums" as very small bignums ^^. LOL. Of course there's the obvious problem that most applications won't use bignums, and in applications that do, most numbers still aren't bignums.

But then, if you add two fixnums together and the result won't fit in a fixnum...

As an aside, in the current mzscheme implementation, it seems fixnums are type 'int and everything else is type 'num

Finally: if you need to access call* , it may be possible to determine its position in GLOBAL() and add a C-global CALL_STAR:

  ; in codegen...
  (list
     "obj * CALL_STAR;"
     ...
     "int main(){
     CALL_STAR = &GLOBAL(" (pos [is _!uid 'call*] global-vars) ");
     initialize_constants();
     execute(0);
     }")

Quite obviously the unused global elimination step will have to avoid eliminating 'call* though.

Edit: regarding typing closures: I've been reading about flexible array members of structs, so it might be possible to define the closure type as:

  struct{
    long type; /* T_FN */
    obj vars[];
  } closure_type;

For non-C99 compliant compilers (practically everything except GCC), though, you need:

  struct{
    long type; /* T_FN */
    obj vars[1];
  } closure_type;

-----