This page was generated by Text::SmartLinks v0.01 at 2010-06-24 20:02:18 GMT.
(syn r31436)
  [ Index of Synopses ]

TITLE

Synopsis 5: Regexes and Rules

AUTHORS

    Damian Conway <damian@conway.org>
    Allison Randal <al@shadowed.net>
    Patrick Michaud <pmichaud@pobox.com>
    Larry Wall <larry@wall.org>

VERSION

    Created: 24 Jun 2002
    Last Modified: 11 Jun 2010
    Version: 125

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

Overview

In essence, Perl 6 natively implements Parsing Expression Grammars (PEGs) as an extension of regular expression notation. PEGs require that you provide a "pecking order" for ambiguous parses. Perl 6's pecking order is determined by a multi-level tie-breaking test:

    1) Longest token matching: food\s+ beats foo by 2 or more positions
    2) Longest literal prefix: food\w* beats foo\w* by 1 position
    3) Declaration from most-derived grammar beats less-derived
    4) Within a given compilation unit, earlier declaration wins
    5) Declaration with least number of 'uses' wins

Note that tiebreaker #5 can occur only when a grammar is monkey-patched from another compilation unit. Like #3, it privileges local declarations over distant ones.

In addition to this pecking order, if any rule chosen under the pecking backtracks, the next best rule is chosen. That is, the pecking order determines a candidate list; just because one candidate is chosen does not mean the rest are thrown away. They may, however, be explicitly thrown away by an appropriate backtracking control (sometimes called a "cut" operator, but Perl6 has several of them, depending on how much you want to cut).

New match result and capture variables

The underlying match object is now available via the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".

Simplified lexical parsing of patterns

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while

    moose*

quantifies only the 'e' and matches "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed
 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

The semicolon character is specifically reserved as a non-meaningful metacharacter; if an unquoted semicolon is seen, the compiler will complain that the regex is missing its terminator.

Modifiers

Changed metacharacters

New metacharacters

Bracket rationalization

Variable (non-)interpolation

Extensible metasyntax (<...>)

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion:

The following tokens include angles but are not required to balance:

Predefined Subrules

These are the predefined subrules for any grammar or regex:

Backslash reform

Regexes constitute a first-class language, rather than just being strings

Backtracking control

Within those portions of a pattern that are considered procedural rather than declarative, you may control the backtracking behavior.

Regex Routines, Named and Anonymous

Nothing is illegal

Longest-token matching

Instead of representing temporal alternation, | now represents logical alternation with declarative longest-token semantics. (You may now use || to indicate the old temporal alternation. That is, | and || now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that | has tighter precedence than ||.)

Historically regex processing has proceeded in Perl via a backtracking NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the patterns are considered in parallel, and eventually the grammar decides which action to fire off. While the default Perl view of parsing is essentially top-down (perhaps with a bottom-up "middle layer" to handle operator precedence), it is extremely useful for user understanding if at least the token processing proceeds deterministically. So for regex matching purposes we define token patterns as those patterns that can be matched without potential side effects or self-reference. (Since whitespace often has side effects at line transitions, it is usually excluded from such patterns, give or take a little lookahead.) Basically, Perl automatically derives a lexer from the grammar without you having to write one yourself.

To that end, every regex in Perl 6 is required to be able to distinguish its "pure" patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve self reference, which is not allowed in traditional regular expressions). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically.

However, if two alternatives match at the same length, the tie is broken first by specificity. The alternative that starts with the longest fixed string wins; that is, an exact match counts as closer than a match made using character classes. If that doesn't work, the tie broken by one of two methods. If the alternatives are in different grammars, standard MRO (method resolution order) determines which one to try first. If the alternatives are in the same grammar file, the textually earlier alternative takes precedence. (If a grammar's rules are defined in more than one file, the order is undefined, and an explicit assertion must be used to force failure if the wrong one is tried first.)

This longest token prefix corresponds roughly to the notion of "token" in other parsing systems that use a lexer, but in the case of Perl this is largely an epiphenomenon derived automatically from the grammar definition. However, despite being automatically calculated, the set of tokens can be modified by the user; various constructs within a regex declaratively tell the grammar engine that it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token declaration and start the "action" part of the pattern include:

Subpatterns (captures) specifically do not terminate the token pattern, but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out after the longest token is determined. (Alternately, if DFA semantics are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.)

Greedy quantifiers and character classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay.

Because such assertions can be part of the token, the lexer engine must be able to recover from the failure of such an assertion and backtrack to the next best token candidate, which might be the same length or shorter, but can never be longer than the current candidate.

For a pattern that starts with a positive lookahead assertion, the assertion is assumed to be more specific than the subsequent pattern, so the lookahead's pattern is treated as the longest token; the longest-token matcher will be smart enough to rematch any text traversed by the lookahead when (and if) it continues the match.

Oddly enough, the token keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace. In contrast, the rule keyword (which assumes :sigspace) defines a pattern that tends to disqualify itself on the first whitespace. So most of the token patterns will end up coming from token declarations. For instance, a token declaration such as

    token list_composer { \[ <expr> \] }

considers its "longest token" to be just the left square bracket, because the first thing the expr rule will do is traverse optional whitespace. As an exception to this, and in order to promote readability, a special exception is made for alternations inside rules. If an alteration in a rule, or any other context where :sigspace is active, has whitespace before a group of alternations, then any leading whitespace on the alternatives is ignored. That is, rule { [ a | b ] } is treated as if it were rule { [a |b ] }, and the LTM match begins with the first non-sigspace atom.

The initial token matcher must take into account case sensitivity (or any other canonicalization primitives) and do the right thing even when propagated up to rules that don't have the same canonicalization. That is, they must continue to represent the set of matches that the lower rule would match.

The || form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including all | possibilities) are exhausted on its left. The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every || establishes a new longest-token matcher. That is, if you use | on the right side of ||, that right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's longest-token automaton is invisible to the left of the || or outside the regex containing the ||.

Return values from matches

Match objects

Subpattern captures

Accessing captured subpatterns

Nested subpattern captures

Quantified subpattern captures

Indirectly quantified subpattern captures

Subpattern numbering

Subrule captures

Accessing captured subrules

Repeated captures of the same subrule

Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

Named scalar aliasing to subpatterns

Named scalar aliases applied to non-capturing brackets

Named scalar aliasing to subrules

Numbered scalar aliasing

Scalar aliases applied to quantified constructs

Array aliasing

Hash aliasing

External aliasing

Capturing from repeated matches

Grammars

Syntactic categories

For writing your own backslash and assertion subrules, you may augment (your copy of) the Regex sublanguage, using the following syntactic categories:

    augment slang Regex {
        token backslash:sym<y> { ... }   # define your own \y and \Y
        token assertion:sym<*> { ... }   # define your own <*stuff>
        token metachar:sym<,> { ... }    # define a new metacharacter
        multi method tweak (:$x) {...}   # define your own :x modifier
    }

Pragmas

Various pragmas may be used to control various aspects of regex compilation and usage not otherwise provided for. These are tied to the particular declarator in question:

    use s :foo;         # control s defaults
    use m :foo;         # control m defaults
    use rx :foo;        # control rx defaults
    use regex :foo;     # control regex defaults
    use token :foo;     # control token defaults
    use rule :foo;      # control rule defaults

(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)