This page was generated at 2009-07-03 01:01:28 GMT.
(syn r27372, pugs-smoke 19912)
  [ Index of Synopses ]

TITLE

Synopsis 5: Regexes and Rules

AUTHORS

    Damian Conway <damian@conway.org>
    Allison Randal <al@shadowed.net>
    Patrick Michaud <pmichaud@pobox.com>
    Larry Wall <larry@wall.org>

VERSION

    Created: 24 Jun 2002
    Last Modified: 30 May 2009
    Version: 99

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

New match result and capture variables

The underlying match object is now available via the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".

From t/spec/S05-metasyntax/longest-alternative.t lines 6–54 (no results): (skip)

  Highlighted: small|full

Simplified lexical parsing of patterns

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while

From t/spec/S05-metasyntax/single-quotes.t lines 16–28 (no results): (skip)

  Highlighted: small|full
    moose*

quantifies only the 'e' and matches "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed
 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

From t/spec/S05-metasyntax/unknown.t lines 6–16 (no results): (skip)

  Highlighted: small|full

Modifiers

Changed metacharacters

From t/spec/S05-metasyntax/changed.t lines 6–40 (no results): (skip)

  Highlighted: small|full

New metacharacters

Bracket rationalization

Variable (non-)interpolation

Extensible metasyntax (<...>)

From t/spec/S05-metasyntax/angle-brackets.t lines 16–103 (no results): (skip)

  Highlighted: small|full

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion: