|Strings as lists are very useful; they made Arkani, the wiki in arc, much easier to implement. Scanners allow us to treat strings as lists; this also makes a parser combinator library, such as raymyers' treeparse.arc, very much useable for string parsing.|
pg once mentioned that he might actually, some day, implement strings such that their interface would be identical to that of lists. Memory inefficiency concerns aside (I've posted a memory-cheap implementation of lists which uses arrays for much of the list run; it has all the semantics of lists but has some of the access times of arrays), I've found it very useful in implementing Arkani (Arki?), the wiki in Anarki. (It's in the file wiki-arc.arc on what used to be arc-wiki.git)
Scanners are an attempt to, primarily, use strings as lists. pg hasn't implemented it in ArcN yet because it's "misleading", since 'scar (= (car foo) 42) and 'scdr (= (cdr foo) 42) won't work properly on strings. However, scanners represent the realization that 'scar and 'scdr are pretty rare anyway; so you might as well create an abstract "limited" form of list, which supports only 'car and 'cdr operations. These scanners, among other things, can be used to scan into strings.
For example, in Arkani, the history list is modelled as exactly that: a list of changes between revisions of the article. However, it has to be stored on-disk, as a string of UTF-8 (it seems that mzscheme can actually handle this properly). We could store it as an Arc-readable representation of the list, but this has the drawback that it makes the metadata longer. As an example, this is how the diff between revisions might look:
Instead, Arkani uses its own format:
((4 skip) delete (insert "article.\r\n\r\n") delete (68 skip))
(the newlines exist because of the \r\n sequence). Each one-letter command might have a number before it, representing the number of times it is executed, or in the case of `i' the number of characters to insert. An `@' ends the diff list. However, this is obviously not parseable by 'read.
In addition, most of the time we expect that users would be more interested in changes in more recent versions of the article rather than in older ones. If we were to use the built-in Arc reader, it would parse the entire history; however, scanners are inherently lazy, and won't execute the 'cdr unless you actually ask for it (and will also take the liberty of memoizing it).
Now although the scanner library I created includes scanners for strings (as lists of characters), it also allows you to create your own scanner. In the case of the Arkani history reader, it reads through the string, decomposing each history entry in the string and creating a virtual object for each history entry for us.
However, it doesn't scan through the entire string. Instead it just computes the 'car of the history list, and then adds a promise for the 'cdr - the promise being to call itself, but with the index set to after the end of the current entry.
Another part of Arkani which uses scanners is the paragraph divider. When rendering the page, Arkani first tries to figure out paragraph divisions. Similar to the way the history log scanner works, the paragraph divider first scans through the text, ignoring empty lines until it reaches a set of non-empty lines. It then ends the paragraph just prior to an empty line, and adds a promise to look at the next paragraph starting after the empty line.
However, the main advantage of scanners is really the way in which they can be used in conjuction with treeparse.arc . Treeparse.arc was designed for use with lists, not strings; however, fortunately by using scanners, strings are lists (or rather, can be wrapped by something which quacks convincingly like a list).
For example, to detect links [[like this]], we have the following code in wiki-arc.arc:
'seq-str is an extension to 'seq, and simply wraps the string in a scanner that 'seq can understand. Basically, it simply searches for the literal sequence of characters in the given string. 'seq, of course, simply scans for the given series of sub-parsers. 'many means 0 or more instances of a parser, while 'anything-but means any element, except for the elements listed. 'pred adds a predicate function, so p-alphadig means that we add an 'alphadig predicate to 'anything.<p>'sem is used to add "semantics" or meaning. 'sem accepts a function and a parser. If the parser succeeds, 'sem passes the parsed sublist to the function; in the case above, the function 'on-plain-wiki-link stores the link destination, while 'on-wiki-link-completed prints the link's text (which includes the trailing alphanumeric characters on [[link]]s).
(pred alphadig:car anything))
; should really be (many anything), however treeparse.arc
; currently does not do backtracking on 'many
(sem on-plain-wiki-link (many (anything-but #\| close-br)))
(sem on-wiki-link-completed (many p-alphadig))))
Using the treeparse library on scanners is quite easy, and allows us to use the same library for lists (the original intended function for treeparse) and strings (possible by the user of scanners)