Arc Forumnew | comments | leaders | submitlogin
What's broken about my "copy file" function?
4 points by zck 2623 days ago | 7 comments
I have a static site generator I wrote in Arc, and one thing that I needed for that is to copy non-generated files to the output -- e.g. css, some javascript files, etc.

I now know that in Anarki's libraries is a method `cp`, that copies files. That should work (subject to getting an answer for http://www.arclanguage.org/item?id=20200). But before I knew about that, I wrote my own:

    (def copy-file (src-loc dest-loc)
         (w/infile src src-loc
                   (w/outfile dest dest-loc
                              (whilet char (readc src)
                                      (disp char dest)))))
This seems to work just fine for text. But for my favicon file, I get corrupted results. Most of the file seems to be copied fine. The first part that is different starts like this:

00000060: 7f00 8989 8900 8f8f 8f00 9898 9800 a4a4 ................ 00000070: a400 acac ac00 aeae ae00 b2b2 b200 b5b5 ................

and ends up being copied like this:

00000060: 7f00 efbf bdef bfbd efbf bd00 efbf bdef ................ 00000070: bfbd efbf bd00 efbf bdef bfbd efbf bd00 ................

The file also starts as 1406 bytes, and ends up as 1526 bytes.

Any idea why this would be? The code seems pretty simple, so I'm not sure where the bug exists.



3 points by Oscar-Belletti 2623 days ago | link

I don't know, this is strange...

Perhaps you could try using readb and writeb.

-----

3 points by zck 2622 days ago | link

Readb and writeb work! Thanks so much!

I'm not sure why it doesn't work otherwise, but it points to something about interpreting the bytes as characters.

-----

4 points by aw 2622 days ago | link

Not all binary sequences encode valid utf-8 characters.

-----

4 points by akkartik 2622 days ago | link

This was my first thought as well. But why does `readc` (Racket's `read-char`) silently accept invalid utf-8?

-----

5 points by rocketnia 2593 days ago | link

It's reading the invalid sequence as � U+FFFD REPLACEMENT CHARACTER, which translates back to UTF-8 as EF BF BD (as we can see in the actual results above). The replacement character is what Unicode offers for use as a placeholder for corrupt sequences in encoded Unicode text, just like the way it's being used here.

-----

3 points by akkartik 2623 days ago | link

The files seem to diverge after a 0 byte. Can you check if it's the first 0 byte in the file?

Edit: never mind, I was hallucinating.

-----

2 points by zck 2623 days ago | link

I just checked; it's not the first 0 byte.

Just to provide a little extra information, the image I'm testing with is my favicon: http://zck.me/favicon.ico

-----