Lemonodor: Chio

September 29, 2003

Chio

Jonathan Springarn's Chio, a string processing library, kind of freaks me out.

An example from the Chio page:

(let* ((str "  paper 16  --  stone 25  ---  scissors 36  ")
       (word #~"\w*")
       (num #~"[0-9]+")
       (test (compile-test (:& (:&0 word) #~"\s+" (:&1 num)))))
  (with-test-split foo (#~"-+" test str :until (= foo-count 2) :tags (:map))
    (cons (mref foo 0 :R)                ; read the word as a symbol
          (mref foo 1 :I))))

Posted by jjwiseman at September 29, 2003 07:37 PM

Comments

Careful there... From the copyright page:

The materials on this website, including, but not limited to, the document Chio: a String Processing Library for Common Lisp, and the Chio examples, may not be reproduced without the written consent of its author. Online distribution is restricted to the author's site: http://www.toiling-in-obscurity.net/chio/

Posted by: Andreas Fuchs on September 30, 2003 12:58 AM

Uh, regardless of the author's intentions, this is defensible as fair use.

Posted by: Reilly Hayes on September 30, 2003 06:25 AM

I checked out the page with interest after the announcement, but I found the code difficult to read. I'm glad I'm not the only one!

Posted by: Zach Beane on September 30, 2003 08:49 AM

Thanks for noticing Chio.

Sorry to "freak" you out! But let me point out that my
Chio package comes with documentation. Learning to use Chio
is something like learning Perl. You can't just look at a few
examples and be using it like a pro. It's like a new language
-- perhaps a good comparison would be Lisp's "LOOP" macro (which
I admit I don't use) or Lisp's "FORMAT" function. Each of these,
like Chio, is a sort of sub-language of Lisp. If you are a Lisp
programmer, I believe that you will find Chio to be more powerful,
flexible, and intuitive than Perl, in terms of its string processing
abilities. If you want to work with strings, I suggest making
the effort to read the manual. Before looking at the examples on
the Chio site, please at least read the manual's introduction.

Here is an example similar to the one you posted on Sept 29.
This example uses WITH-TEST-BINDS instead of WITH-TEST-SPLIT.
(When learning Chio, it is best to start with with-test-binds,
then with-test-format, and finally with-test-split).

(let* ((str "  paper 16  --  stone 25  ---  scissors 36  ")
       (word #~"\w*")
       (num #~"[0-9]+")
       (test (compile-test (:& (:&0 word) #~"\s+" (:&1 num)))))
  (with-test-binds foo (test str :tags (:map))
    (cons (mref foo 0 :R)         ; read the word as a symbol
          (mref foo 1 :I))))      ; read the number as a fixnum

In this example,

WORD = #~"\w*" is a regular expression that matches zero or
more word (alpha or numerical) characters

  NUM = #~"[0-9]+" matches one or more digits

  #~"\s+" matches one or more whitespace characters

In the string STR = " paper 16 -- stone 25 --- scissors 36 "
we want to find matches for a concatenation of those three tests: that is, a WORD followed by one or more whitespaces followed by a NUM. There are three such substrings that match this concatenation:
     "paper 16"
     "stone 25"
     "scissors 36"
If that were all we wanted to do, we would use the "binding-tree"
    (:&0 word #~"\s+" num)
because keywords starting with & represent concatenation in Chio.

But we want to do more than this -- we also want to separately access the WORD part and the NUM part of each match. In order to capture the WORD part, we replace WORD with (:&0 WORD) and to capture the NUM part, we replace NUM with (:&1 NUM). Chio knows it is supposed to capture the results from the :&0 and :&1 because the symbol-names "&0" and "&1" have length greater than one.

The symbol FOO in the example is a label we attach to the macro call. You need this label in order to refer to match results. Thus
(MREF FOO 0 :R) refers to the WORD part, and
(MREF FOO 1 :I) refers to the NUM part.

The keywords :R and :I tell Chio how to read the matching substring.

:R tells Chio to read the match the way the Lisp reader would
read it, so for example (mref foo 0 :r) reads "paper" as the
symbol PAPER and (mref foo 1 :r) would read "16" as the fixnum 16.

:I would tell Chio to read the match as an integer, so for example
(mref foo 0 :i) would try to read "paper" as an integer and report
an error, but (mref foo 1 :i) would read "16" as the fixnum 16.

There are many other possibilities not mentioned here.
For instance,
:S would tell Chio to read the match as a string, so for example
(mref foo 0 :s) would read "paper" as the string "paper" and
(mref foo 1 :s) would read "16" as the string "16".

The :MAP tag tells Chio to collect the results in a list. For each match, the result is (CONS (MREF FOO 0 :R) (MREF FOO 1 :I)), so the macro returns a list of three conses:
((PAPER . 16) (STONE . 25) (SCISSORS . 36))
There are many other tags that can be used to customize the behavior
of Chio in various ways.

- Jonathan Spingarn

PS: please note the correct spelling of my name: S P I N G A R N
also, about posting an example, I agree with you - fair use. Honored, in fact. Let me know if you have any questions.

Posted by: Jonathan Spingarn on October 6, 2003 10:45 PM

Jonathan,

I for one would like to know what Chio buys me that regular expressions (for example cl-ppcre) do not. Any clear advantages?

Posted by: waiter on October 7, 2003 02:28 AM

The most significant difference between CL-PPCRE and Chio is that where cl-ppcre takes a regex as an argument, Chio takes instead a "binding-tree". A binding-tree differs from a regex in that it can contain arbitrary Lisp code, whereas the regex (or the parse-tree that ppcre creates from a regex) allows only certain tokens. Using Chio, it is as if you could write your own code and insert it smack into the middle of a regex. Chio's contribution is the library of routines that helps you to write such code and the syntactic sugar that makes it convenient to use. This allows you to encapsulate specific search tasks as functions ("simple-tests") and then use (and reuse) those function search objects whenever you need them, by placing them into other binding-trees. A Perl programmer waits for new features to be added to the language and then says wow look what Perl can do now! A Lisp programmer shouldn't need to wait for new features.

There have to be some trade-offs for this added flexibility. A tool like ppcre sets for itself a more specific and limited task, namely to parse trees having a specific structure using only certain tokens. You would expect it, therefore, to probably be faster (although how much faster, I do not know -- I have done no comparisons). I have not studied its internals, but I suppose that it does a search almost exclusively by putting stuff onto a stack and popping it off. Chio uses a stack whenever it can, but it also sometimes substitutes function calls for a lot of what would otherwise be put on the stack. So instead of popping little stuff off of the stack, you are sometimes instead tracing back through a chain of function calls. That has to be slower. Remember though, that Chio allows you to use arbitrary Lisp code so if speed did turn out to be a problem, in speed critical situations you are free to insert the worlds-fastest-regex-engine, whatever that might be.

In addition to the one major difference, there are several other, more syntactic, ones:

Chio does not attempt to duplicate Perl's regular expressions. While the basics (* + . \w \s \b etc.) are the same, Chio adds some new uses for ! < and > and does not have any of Perl's more esoteric features. I don't think it needs them either, since its added flexibility allows so much freedom.

cl-ppcre uses newly allocated (I think) vectors to contain match results. Chio leaves those results instead on a global *stack* simple-vector so that no new vectors are allocated by the macro. Chio then provides a convenient interface to those results using the macro MREF. MREF reads match results in a number of different formats, depending on a keyword argument:
:S to read match as a string
:I to read as with parse-integer
:R to read as would the Lisp reader
:A to read the position of the start of the match
:E to read the position of the end of the match
Instead of one of these keyword arguments you can use the name of any Lisp function (or lambda expression) to read the match any way you want.

cl-ppcre macros surround the BODY code with a block named NIL allowing an exit by RETURN. WITH-TEST-BINDS forces the user to provide a NAME for the enclosing block. This allows the user two ways to exit early: a RETURN-FROM NAME exits the block with any return value, overriding whatever value would have been returned otherwise, whereas a RETURN-FROM NIL merely stops processing additional matches (assuming the :g tag is used) without disturbing the return value. NAME is also used as an argument to MREF to refer to match results.

cl-ppcre provides several macros to return results in different ways. With Chio, these options are governed more simply by the use of tag arguments.
The :map tag causes a list of values to be returned. The :map-if tag does the same but omits NIL values from the list. There is no need to use a different macro in order to return a list of strings rather than a list of positions: with :map you return a list of whatever the BODY evaluates to.
The :g tag (as in Perl) causes additional matches to be sought after the first is found. You are free to specify a form to provide a return value. Otherwise, NIL is returned and your BODY can be executed for any desired side-effects, once for each match.

Like ppcre, Chio provides substitution and splitting abilities. Those macros have great flexibility, and I will be glad to expound on them if someone gives me an excuse, but I fear I am getting long winded, so I'll stop for now. The essence of Waiter's question was answered in the first paragraph.

Here are two examples with some comments:

********** example 1: **********

  Define a function (CHARS-TO-TRIM CHAR-BAG STRING) that does the same
  as the Common Lisp function STRING-TRIM, except that it returns the
  coordinates rather than the trimmed string: If all characters in
  STRING belong to CHAR-BAG, return NIL. Otherwise, return
    (1) index of first character in STRING not belonging to CHAR-BAG and
    (2) index after last character not belonging to CHAR-BAG. CHAR-BAG can
        be any sequence of characters. For example, if CHAR-BAG contains
        all the whitespace characters then CHARS-TO-TRIM gives you the
        information you would need to cut off all leading and trailing
        whitespace characters.

  (defun chars-to-trim (char-bag string)
    (let ((in-bag (chtes+! char-bag nil nil nil :starp))    ; longest run of >=0 characters in bag (no back-tracking)
          (out-char (chtes char-bag nil nil :negatep)))     ; single character not in bag
      (with-test-binds AA ((:& in-bag (:&0 dot* out-char)) string)    ; dot* is same as #~".*"
        (values (mref AA 0 :a) (mref AA 0 :e)))))    ; :a and :e read start and of match, respectively

  comments:
  (CHTES+! char-set &optional char-ranges fns negatep starp case-insensitivep)
  returns a test for a longest possible run of length >0 (with no backtracking) of
  characters belonging to CHAR-SET, CHAR-RANGES, or satisfying FNS. Negate if NEGATEP is
  not NIL. Length >=0 if STARP is not NIL. Case insensitive if CASE-INSENSITIVEP is not NIL.

  CHTES is a similar macro that returns a test for just a single character.

Notice that no regular expressions are used in this example. IN-BAG, OUT-CHAR,
and DOT* are Lisp functions, not some sort of tokens.

Can anyone tell me how to do this with cl-ppcre, without knowing in advance
what characters are in CHAR-BAG ?

********** example 2: **********

Search for two words such that
(1) each word contains exactly two vowels, and
(2) the words have the same vowels.

(let ((vowels '(#\a #\e #\i #\o #\u))
      save)                                                  ; to save the two vowels found in the first word
  (let ((has-two-vowels (predicate-simple-test (start)       ; simple-test that accepts any string with two vowels
                          (let ((count 0) save0)
                            (dolist (ch vowels)
                              (when (find ch *string* :start start :end *end* :test #'char-equal)
                                (incf count)
                                (push ch save0)))
                            (when (= count 2)
                              (setf save save0)))))
        (has-same-vowels (predicate-simple-test (start)      ; simple-test accepting any string with same 2 vowels as SAVE
                           (let ((count 0))
                             (when
                               (dolist (ch vowels t)
                                 (when (find ch *string* :start start :end *end* :test #'char-equal)
                                   (when (or (> (incf count) 2) (not (member ch save))) (return nil))))
                               (= count 2))))))
    (with-test-binds foo ((:& (:&0 (t_and #~"\b\a+!" has-two-vowels)) #~".+<" (:&1 (t_and #~"\b\a+!" has-same-vowels)))
                      "How pleasant the banks of the clear-winding Devon,
                         With green-spreading bushes, and flowers blooming fair!
                       But the bonniest flower on the banks of the Devon
                         Was once a sweet bud on the braes of the Ayr.")   ; from Robert Burns "The banks of the Devon"
      (format t "~s has the same vowels as ~s~%" (mref foo 1 :S) (mref foo 0 :S)))))

prints: "clear" has the same vowels as "pleasant", returns: NIL

comments:
  PREDICATE-SIMPLE-TEST is a macro that helps you to write
  code to filter strings satisfying a given property. For example,
  HAS-TWO-VOWELS does what it's name implies, but it also has a
  side-effect -- when it finds a match, it saves the
  vowels where they can be seen by HAS-SAME-VOWELS.

  (T_AND first-test &rest other-tests) is a Chio function that
  creates a new test by accepting strings that pass first-test and
  also all of other-tests. Thus (t_and #~"\b\a+!" has-two-vowels)
  accepts runs of alphabetical characters having exactly two vowels.

  The exclamation mark ! is used by Chio to indicate that only the
  first ending found should be used. In other words, it eliminates
  back-tracking. It makes sense to use it here because it would
  not make sense, if a second word were not found to match "pleasant",
  to try again with "pleasan", "pleasa", and so forth. The same
  effect could be achieved by using #~"\b\a+\b" instead of #~"\b\a+!".

  The < is used by Chio (in certain contexts) to return results
  in ascending order. Thus the regex #~".+<" is the same as #~".+"
  except that it prefers shorter matches.

  I do not doubt that this could be done using Perl, but I would
  hate to read the code! If you were using a regex tool like cl-ppcre,
  you would have to check the properties (vowels, same) inside
  the body of the macro since those tests couldn't be put into the
  parse-tree. I think what you gain here is mostly flexibility and
  clarity of code. I do not know what the costs are in terms of speed.
  It is also possible that my macros may expand into a larger volume of
  code. I will be interested in hearing about experiences from people
  who try it out.

Posted by: Jonathan Spingarn on October 7, 2003 12:10 PM

Jonathan,

Thanks for your answer. I have to take a new look at the Chio package to understand it more fully, and also to get some awareness of what sort of problems it is suited to solve. The documentation seems extensive, so from there I'll hopefully understand better what learning the Chio syntax buys me.

Again, thanks for your answer and examples.

Posted by: waiter on October 7, 2003 04:47 PM