This page was generated at 2007-05-07 07:33:29 GMT.
(syn r14389, pugs-tests r16195)

TITLE

Synopsis 5: Regexes and Rules

AUTHORS

Damian Conway <damian@conway.org> and Allison Randal <al@shadowed.net>

VERSION

   Maintainer: Patrick Michaud <pmichaud@pobox.com> and
               Larry Wall <larry@wall.org>
   Date: 24 Jun 2002
   Last Modified: 27 Apr 2007
   Number: 5
   Version: 58

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

New match result and capture variables

The underlying match result object is now available as the $/ variable, which is implicitly lexically scoped. All access to the current (or most recent) match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. See the section below on "Longest-token matching".

- Show the snippet from t/09-ratchet.t (line 77 ~ line 97) -

Simplified lexical parsing

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the usual interpolative semantics.) Quotes create a quantifiable atom, so while

    moose*

quantifies only the 'e' and match "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed

 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

- Show the snippet from t/09-ratchet.t (line 806 ~ line 844) -

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

- Show the snippet from t/09-ratchet.t (line 171 ~ line 181) -

Modifiers

Changed metacharacters

New metacharacters

Bracket rationalization

Variable (non-)interpolation

Extensible metasyntax (<...>)

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion:

The following tokens include angles but are not required to balance:

Backslash reform

Regexes are now first-class language, not strings

Backtracking control

Within those portions of a pattern that are considered procedural rather than declarative, you may control the backtracking behavior.

Named Regexes

Nothing is illegal

Longest-token matching

Instead of representing temporal alternation, | now represents logical alternation with declarative longest-token semantics. (You may now use || to indicate the old temporal alternation. That is, | and || now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that | has tighter precedence than ||.)

- Show the snippet from t/09-ratchet.t (line 1097 ~ line 1106) -

Historically regex processing has proceeded in Perl via a backtracking NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the patterns are considered in parallel, and eventually the grammar decides which action to fire off. While the default Perl view of parsing is essentially top-down (perhaps with a bottom-up "middle layer" to handle operator precedence), it is extremely useful for user understanding if at least the token processing proceeds deterministically. So for regex matching purposes we define token patterns as those patterns containing no whitespace that can be matched without side effects or self-reference.

To that end, every regex in Perl 6 is required to be able to distinguish its "pure" patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve self reference, which is not allowed in traditional regular expressions). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically. (However, in the case of a tie between alternatives, the textually earlier alternative does take precedence.)

This longest token prefix corresponds roughly to the notion of "token" in other parsing systems that use a lexer, but in the case of Perl this is largely an epiphenomenon derived automatically from the grammar definition. However, despite being automatically calculated, the set of tokens can be modified by the user; various constructs within a regex declaratively tell the grammar engine that it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token declaration and start the "action" part of the pattern include:

Subpatterns (captures) specifically do not terminate the token pattern, but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out after the longest token is determined. (Alternately, if DFA semantics are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.)

Ordinary quantifiers and characters classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay.

Oddly enough, the token keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace. In contrast, the rule keyword (which assumes :sigspace) defines a pattern that tends to disqualify itself on the first whitespace. So most of the token patterns will end up coming from token declarations. For instance, a token declaration such as

    token list_composer { \[ <expr> \] }

considers its "longest token" to be just the left square bracket, because the first thing the expr rule will do is traverse optional whitespace.

The initial token matcher must take into account case sensitivity (or any other canonicalization primitives) and do the right thing even when propagated up to rules that don't have the same canonicalization. That is, they must continue to represent the set of matches that the lower rule would match.

The || form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including all | possibilities) are exhausted on its left. The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every || establishes a new longest-token matcher. That is, if you use | on the right side of ||, that right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's longest-token automaton is invisible to the left of the || or outside the regex containing the ||.

Return values from matches

Match objects

Subpattern captures

Accessing captured subpatterns

Nested subpattern captures

Quantified subpattern captures

- Show the snippet from t/09-ratchet.t (line 450 ~ line 513) -

Indirectly quantified subpattern captures

- Show the snippet from t/09-ratchet.t (line 514 ~ line 547) -

Subpattern numbering

Subrule captures

- Show the snippet from t/09-ratchet.t (line 335 ~ line 345) -

Accessing captured subrules

Repeated captures of the same subrule

- Show the snippet from t/09-ratchet.t (line 882 ~ line 902) -

Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

Named scalar aliasing to subpatterns

- Show the snippet from t/09-ratchet.t (line 305 ~ line 334) -
- Show the snippet from t/09-ratchet.t (line 637 ~ line 649) -

Named scalar aliases applied to non-capturing brackets

- Show the snippet from t/09-ratchet.t (line 650 ~ line 658) -

Named scalar aliasing to subrules

- Show the snippet from t/09-ratchet.t (line 659 ~ line 667) -

Numbered scalar aliasing

Scalar aliases applied to quantified constructs

Array aliasing

Hash aliasing

External aliasing

Capturing from repeated matches

:keepall

Grammars

Syntactic categories

For writing your own backslash and assertion subrules or macros, you may use the following syntactic categories:

     token rule_backslash:<w> { ... }     # define your own \w and \W
     token rule_assertion:<*> { ... }     # define your own <*stuff>
     macro rule_metachar:<,> { ... }     # define a new metacharacter
     macro rule_mod_internal:<x> { ... } # define your own /:x() stuff/
     macro rule_mod_external:<x> { ... } # define your own m:x()/stuff/

As with any such syntactic shenanigans, the declaration must be visible in the lexical scope to have any effect. It's possible the internal/external distinction is just a trait, and that some of those things are subs or methods rather than subrules or macros. (The numeric regex modifiers are recognized by fallback macros defined with an empty operator name.)

Pragmas

Various pragmas may be used to control various aspects of regex compilation and usage not otherwise provided for. These are tied to the particular declarator in question:

    use s :foo;         # control s defaults
    use m :foo;         # control m defaults
    use rx :foo;        # control rx defaults
    use regex :foo;     # control regex defaults
    use token :foo;     # control token defaults
    use rule :foo;      # control rule defaults

(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)

Transliteration

Substitution

There are also method forms of m// and s///:

     $str.match(/pat/);
     $str.subst(/pat/, "replacement");
     $str.subst(/pat/, {"replacement"});
     $str.=subst(/pat/, "replacement");
     $str.=subst(/pat/, {"replacement"});

There is no syntactic sugar here, so in order to get deferred evaluation of the replacement you must put it into a closure. The syntactic sugar is provided only by the quotelike forms. First there is the standard "triple quote" form:

    s/pattern/replacement/

Only non-bracket characters may be used for the "triple quote". The right side is always evaluated as if it were a double-quoted string regardless of the quote chosen.

As with Perl 5, a bracketing form is also supported, but unlike Perl 5, Perl 6 uses the brackets only around the pattern. The replacement is then specified as if it were an ordinary item assignment, with ordinary quoting rules. To pick your own quotes on the right just use one of the q forms. The substitution above is equivalent to:

    s[pattern] = "replacement"

or

    s[pattern] = qq[replacement]

This is not a normal assigment, since the right side is evaluated each time the substitution matches (much like the pseudo-assignment to declarators can happen at strange times). It is therefore treated as a "thunk", that is, as if it has implicit curlies around it. In fact, it makes no sense at all to say

    s[pattern] = { doit }

because that would try to substitute a closure into the string.

Any scalar assignment operator may be used; the substitution macro knows how to turn

    $target ~~ s:g[pattern] op= expr

into something like:

    $target.subst(rx:g[pattern], { $() op expr })

So, for example, you can multiply every dollar amount by 2 with:

    s:g[\$ <( \d+ )>] *= 2

(Of course, the optimizer is free to do something faster than an actual method call.)

You'll note from the last example that substitutions only happen on the "official" string result of the match, that is, the $() value. (Here we captured $() using the <(...)> pair; otherwise we would have had to use lookbehind to match the $.)

Positional matching, fixed width types

Matching against non-strings