Synopsis 5: Regexes and Rules
Damian Conway <damian@conway.org>
Allison Randal <al@shadowed.net>
Patrick Michaud <pmichaud@pobox.com>
Larry Wall <larry@wall.org>
Created: 24 Jun 2002
Last Modified: 30 May 2009
Version: 99
This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.
In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.
The underlying match object is now available via the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.
By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.
The following regex features use the same syntax as in Perl 5:
From t/spec/S05-mass/rx.t lines 88–246 (no results): (skip)
Highlighted: small|fullWhile the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".
From t/spec/S05-metasyntax/longest-alternative.t lines 6–54 (no results): (skip)
Highlighted: small|fullUnlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).
All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.
Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while
From t/spec/S05-metasyntax/single-quotes.t lines 16–28 (no results): (skip)
Highlighted: small|fullmoose*
quantifies only the 'e' and matches "mooseee", saying
'moose'*
quantifies the whole string and would match "moosemoose".
Here is a table that summarizes the distinctions:
Alphanumerics Non-alphanumerics Mixed
Literal glyphs a 1 _ \* \$ \. \\ \' K\-9\! Metasyntax \a \1 \_ * $ . \ ' \K-\9! Quoted glyphs 'a' '1' '_' '*' '$' '.' '\\' '\'' 'K-9!'
In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.
Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.
From t/spec/S05-metasyntax/unknown.t lines 6–16 (no results): (skip)
Highlighted: small|full/x) is no longer required...it's the default. (In fact, it's pretty much mandatory--the only way to get back to the old syntax is with the :Perl5/:P5 modifier.)
From t/spec/S05-modifier/perl5_2.t lines 7–120 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_3.t lines 7–117 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_9.t lines 7–106 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_6.t lines 7–127 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_4.t lines 7–117 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_7.t lines 7–123 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_5.t lines 7–123 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_0.t lines 9–124 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_8.t lines 7–136 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/perl5_1.t lines 7–119 (no results): (skip)
Highlighted: small|full/s or /m modifiers (changes to the meta-characters replace them - see below).
/e evaluation modifier on substitutions; instead use:
s/pattern/{ doit() }/
or:
s[pattern] = doit()
Instead of /ee say:
s/pattern/{ eval doit() }/
or:
s[pattern] = eval doit()
m:g:i/\s* (\w*) \s* ,?/;
Every modifier must start with its own colon. The delimiter must be separated from the final modifier by whitespace if it would otherwise be taken as an argument to the preceding modifier (which is true if and only if the next character is a left parenthesis.)
:i :ignorecase
:a :ignoreaccent
:g :global
:i (or :ignorecase) modifier causes case distinctions to be ignored in its lexical scope, but not in its dynamic scope. That is, subrules always use their own case settings.
From t/spec/S05-modifier/ignorecase.t lines 24–50 (4 √, 2 ×): (skip)
The :ii (or :samecase) variant may be used on a substitution to change the substituted string to the same case pattern as the matched string.
From t/spec/S05-modifier/ii.t lines 10–32 (no results): (skip)
Highlighted: small|fullIf the pattern is matched without the :sigspace modifier, case info is carried across on a character by character basis. If the right string is longer than the left one, the case of the final character is replicated. Titlecase is carried across if possible regardless of whether the resulting letter is at the beginning of a word or not; if there is no titlecase character available, the corresponding uppercase character is used. (This policy can be modified within a lexical scope by a language-dependent Unicode declaration to substitute titlecase according to the orthographic rules of the specified language.) Characters that carry no case information leave their corresponding replacement character unchanged.
If the pattern is matched with :sigspace, then a slightly smarter algorithm is used which attempts to determine if there is a uniform capitalization policy over each matched word, and applies the same policy to each replacement word. If there doesn't seem to be a uniform policy on the left, the policy for each word is carried over word by word, with the last pattern word replicated if necessary. If a word does not appear to have a recognizable policy, the replacement word is translated character for character as in the non-sigspace case. Recognized policies include:
From t/spec/S05-modifier/ii.t lines 33–53 (no results): (skip)
Highlighted: small|full lc()
uc()
ucfirst(lc())
lcfirst(uc())
capitalize()
In any case, only the officially matched string part of the pattern match counts, so any sort of lookahead or contextual matching is not included in the analysis.
:a (or :ignoreaccent) modifier scopes exactly like :ignorecase except that it ignores accents instead of case. It is equivalent to taking each grapheme (in both target and pattern), converting both to NFD (maximally decomposed) and then comparing the two base characters (Unicode non-mark characters) while ignoring any trailing mark characters. The mark characters are ignored only for the purpose of determining the truth of the assertion; the actual text matched includes all ignored characters, including any that follow the final base character.
From t/spec/S05-modifier/ignoreaccent.t lines 15–29 (no results): (skip)
Highlighted: small|fullThe :aa (or :sameaccent) variant may be used on a substitution to change the substituted string to the same accent pattern as the matched string. Accent info is carried across on a character by character basis. If the right string is longer than the left one, the remaining characters are substituted without any modification. (Note that NFD/NFC distinctions are usually immaterial, since Perl encapsulates that in grapheme mode.) Under :sigspace the preceding rules are applied word by word.
:c (or :continue) modifier causes the pattern to continue scanning from the specified position (defaulting to $/.to):
From t/spec/S05-modifier/continue.t lines 6–20 (1 √, 3 ×): (skip)
m:c($p)/ pattern / # start scanning at position $p
Note that this does not automatically anchor the pattern to the starting location. (Use :p for that.) The pattern you supply to split has an implicit :c modifier.
String positions are of type StrPos and should generally be treated as opaque.
:p (or :pos) modifier causes the pattern to try to match only at the specified string position:
From t/spec/S05-modifier/pos.t lines 15–56 (no results): (skip)
Highlighted: small|fullm:pos($p)/ pattern / # match at position $p
If the argument is omitted, it defaults to $/.to. (Unlike in Perl 5, the string itself has no clue where its last match ended.) All subrule matches are implicitly passed their starting position. Likewise, the pattern you supply to a Perl macro's is parsed trait has an implicit :p modifier.
Note that
m:c($p)/pattern/
is roughly equivalent to
m:p($p)/.*? <( pattern )> /
:s (:sigspace) modifier causes whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, <.ws>. That is,
From t/spec/S05-grammar/ws.t lines 5–26 (no results): (skip)
Highlighted: small|fullm:s/ next cmd '=' <condition>/
is the same as:
m/ <.ws> next <.ws> cmd <.ws> '=' <.ws> <condition>/
which is effectively the same as:
m/ \s* next \s+ cmd \s* '=' \s* <condition>/
But in the case of
m:s{(a|\*) (b|\+)}
or equivalently,
m { (a|\*) <.ws> (b|\+) }
<.ws> can't decide what to do until it sees the data. It still does the right thing. If not, define your own ws and :sigspace will use that.
In general you don't need to use :sigspace within grammars because the parser rules automatically handle whitespace policy for you. In this context, whitespace often includes comments, depending on how the grammar chooses to define its whitespace rule. Although the default <.ws> subrule recognizes no comment construct, any grammar is free to override the rule. The <.ws> rule is not intended to mean the same thing everywhere.
From t/spec/S05-grammar/ws.t lines 6–26 (no results): (skip)
Highlighted: small|fullIt's also possible to pass an argument to :sigspace specifying a completely different subrule to apply. This can be any rule, it doesn't have to match whitespace. When discussing this modifier, it is important to distinguish the significant whitespace in the pattern from the "whitespace" being matched, so we'll call the pattern's whitespace sigspace, and generally reserve whitespace to indicate whatever <.ws> matches in the current grammar. The correspondence between sigspace and whitespace is primarily metaphorical, which is why the correspondence is both useful and (potentially) confusing.
The :ss (or :samespace) variant may be used on substitutions to do smart space mapping. For each sigspace-induced call to <ws> on the left, the matched whitespace is copied over to the corresponding slot on the right, as represented by a single whitespace character in the replacement string wherever space replacement is desired. If there are more whitespace slots on the right than the left, those righthand characters remain themselves. If there are not enough whitespace slots on the right to map all the available whitespace slots from the match, the algorithm tries to minimize information loss by randomly splicing "common" whitespace characters out of the list of whitespace. From least valuable to most, the pecking order is:
spaces
tabs
all other horizontal whitespace, including Unicode
newlines (including crlf as a unit)
all other vertical whitespace, including Unicode
The primary intent of these rules is to minimize format disruption when substitution happens across line boundaries and such. There is, of course, no guarantee that the result will be exactly what a human would do.
The :s modifier is considered sufficiently important that match variants are defined for them:
mm/match some words/ # same as m:sigspace
ss/match some words/replace those words/ # same as s:samespace
Note that ss/// is defined in terms of :ss, so:
$_ = "a b\nc\td";
ss/b c d/x y z/;
ends up with a value of "a x\ny\tz".
m:bytes / .**2 / # match two bytes
m:codes / .**2 / # match two codepoints
m:graphs / .**2 / # match two language-independent graphemes
m:chars / .**2 / # match two characters at current max level
There are corresponding pragmas to default to these levels. Note that the :chars modifier is always redundant because dot always matches characters at the highest level allowed in scope. This highest level may be identical to one of the other three levels, or it may be more specific than :graphs when a particular language's character rules are in use. Note that you may not specify language-dependent character processing without specifying which language you're depending on. [Conjecture: the :chars modifier could take an argument specifying which language's rules to use for this match.]
:Perl5/:P5 modifier allows Perl 5 regex syntax to be used instead. (It does not go so far as to allow you to put your modifiers at the end.) For instance,
m:P5/(?mi)^(?:[a-z]|\d){1,2}(?=\s)/
is equivalent to the Perl 6 syntax:
m/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /
x, it means repetition. Use :x(4) for the general form. So
From t/spec/S05-modifier/repetition.t lines 6–13 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-modifier/repetition-exhaustive.t lines 16–29 (no results): (skip)
Highlighted: small|fulls:4x [ (<.ident>) '=' (\N+) $$] = "$0 => $1";
is the same as:
s:x(4) [ (<.ident>) '=' (\N+) $$] = "$0 => $1";
which is almost the same as:
s:c[ (<.ident>) '=' (\N+) $$] = "$0 => $1" for 1..4;
except that the string is unchanged unless all four matches are found. However, ranges are allowed, so you can say :x(1..4) to change anywhere from one to four matches.
st, nd, rd, or th, it means find the Nth occurrence. Use :nth(3) for the general form. So
From t/spec/S05-modifier/counted.t lines 13–344 (no results): (skip)
Highlighted: small|fulls:3rd/(\d+)/@data[$0]/;
is the same as
s:nth(3)/(\d+)/@data[$0]/;
which is the same as:
m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$0]/;
Lists and junctions are allowed: :nth(1|2|3|5|8|13|21|34|55|89).
So are closures: :nth({.is_fibonacci})
:ov (:overlap) modifier, the current regex will match at all possible character positions (including overlapping) and return all matches in list context, or a disjunction of matches in item context. The first match at any position is returned. The matches are guaranteed to be returned in left-to-right order with respect to the starting positions.
From t/spec/S05-modifier/overlapping.t lines 22–59 (no results): (skip)
Highlighted: small|full$str = "abracadabra";
if $str ~~ m:overlap/ a (.*) a / {
@substrings = @@(); # bracadabr cadabr dabr br
}
:ex (:exhaustive) modifier, the current regex will match every possible way (including overlapping) and return all matches in a list context, or a disjunction of matches in item context. The matches are guaranteed to be returned in left-to-right order with respect to the starting positions. The order within each starting position is not guaranteed and may depend on the nature of both the pattern and the matching engine. (Conjecture: or we could enforce backtracking engine semantics. Or we could guarantee no order at all unless the pattern starts with "::" or some such to suppress DFAish solutions.)
From t/spec/S05-modifier/repetition-exhaustive.t lines 17–29 (no results): (skip)
Highlighted: small|full$str = "abracadabra";
if $str ~~ m:exhaustive/ a (.*?) a / {
say "@()"; # br brac bracad bracadabr c cad cadabr d dabr br
}
Note that the ~~ above can return as soon as the first match is found, and the rest of the matches may be performed lazily by @().
:rw modifier causes this regex to claim the current string for modification rather than assuming copy-on-write semantics. All the captures in $/ become lvalues into the string, such that if you modify, say, $1, the original string is modified in that location, and the positions of all the other fields modified accordingly (whatever that means). In the absence of this modifier (especially if it isn't implemented yet, or is never implemented), all pieces of $/ are considered copy-on-write, if not read-only.
[Conjecture: this should really associate a pattern with a string variable, not a (presumably immutable) string value.]
:keepall modifier causes this regex and all invoked subrules to remember everything, even if the rules themselves don't ask for their subrules to be remembered. This is for forcing a grammar that throws away whitespace and comments to keep them instead.
:ratchet modifier causes this regex to not backtrack by default. (Generally you do not use this modifier directly, since it's implied by token and rule declarations.) The effect of this modifier is to imply a : after every construct that could backtrack, including bare *, +, and ? quantifiers, as well as alternations. (Note: for portions of patterns subject to longest-token analysis, a : is ignored in any case, since there will be no backtracking necessary.)
From t/spec/S05-mass/rx.t lines 80–87 (no results): (skip)
Highlighted: small|full:panic modifier causes this regex and all invoked subrules to try to backtrack on any rules that would otherwise default to not backtracking because they have :ratchet set. Never panic unless you're desperate and want the pattern matcher to do a lot of unnecessary work. If you have an error in your grammar, it's almost certainly a bad idea to fix it by backtracking.
:i, :s, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped):
From t/spec/S05-modifier/ignorecase.t lines 18–23 (1 √, 0 ×): (skip)
m/:s alignment '=' [:i left|right|cent[er|re]] /
As with modifiers outside, only parentheses are recognized as valid brackets for args to the adverb. In particular:
m/:foo[xxx]/ Parses as :foo [xxx]
m/:foo{xxx}/ Parses as :foo {xxx}
m/:foo<xxx>/ Parses as :foo <xxx>
m:fuzzy/pattern/;
m:fuzzy('bare')/pattern/;
m:fuzzy (pattern);
or you'll end up with:
m:fuzzy(fuzzyargs); pattern ;
From t/spec/S05-metasyntax/changed.t lines 6–40 (no results): (skip)
Highlighted: small|full. now matches any character including newline. (The /s modifier is gone.)
^ and $ now always match the start/end of a string, like the old \A and \z. (The /m modifier is gone.) On the right side of an embedded ~~ or !~~ operator they always match the start/end of the indicated submatch because that submatch is logically being treated as a separate string.
$ no longer matches an optional preceding \n so it's necessary to say \n?$ if that's what you mean.
\n now matches a logical (platform independent) newline not just \x0a.
From t/spec/S05-metachars/newline.t lines 13–43 (no results): (skip)
Highlighted: small|full\A, \Z, and \z metacharacters are gone.
/x is default:
# now always introduces a comment. If followed by an opening bracket character (and if not in the first column), it introduces an embedded comment that terminates with the closing bracket. Otherwise the comment terminates at the newline.
:sigspace modifier described above).
^^ and $$ match line beginnings and endings. (The /m modifier is gone.) They are both zero-width assertions. $$ matches before any \n (logical newline), and also at the end of the string if the final character was not a \n. ^^ always matches the beginning of the string and after any \n that is not the final character in the string.
. matches an anything, while \N matches an anything except newline. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.
& metacharacter separates conjunctive terms. The patterns on either side must match with the same beginning and end point. Note: if you don't want your two terms to end at the same point, then you really want to use a lookahead instead.
As with the disjunctions | and ||, conjuctions come in both & and && forms. The & form is considered declarative rather than procedural; it allows the compiler and/or the run-time system to decide which parts to evaluate first, and it is erroneous to assume either order happens consistently. The && form guarantees left-to-right order, and backtracking makes the right argument vary faster than the left. In other words, && and || establish sequence points. The left side may be backtracked into when backtracking is allowed into the construct as a whole.
From t/spec/S05-metasyntax/sequential-alternation.t lines 5–21 (no results): (skip)
Highlighted: small|fullThe & operator is list associative like |, but has slightly tighter precedence. Likewise && has slightly tighter precedence than ||. As with the normal junctional and short-circuit operators, & and | are both tighter than && and ||.
~~ and !~~ operators cause a submatch to be performed on whatever was matched by the variable or atom on the left. String anchors consider that submatch to be the entire string. So, for instance, you can ask to match any identifier that does not contain the word "moose":
<ident> !~~ 'moose'
In contrast
<ident> !~~ ^ 'moose' $
would allow any identifier (including any identifier containing "moose" as a substring) as long as the identifier as a whole is not equal to "moose". (Note the anchors, which attach the submatch to the beginning and end of the identifier as if that were the entire match.) When used as part of a longer match, for clarity it might be good to use extra brackets:
[ <ident> !~~ ^ 'moose' $ ]
The precedence of ~~ and !~~ fits in between the junctional and sequential versions of the logical operators just as it does in normal Perl expressions (see S03). Hence
<ident> !~~ 'moose' | 'squirrel'
parses as
<ident> !~~ [ 'moose' | 'squirrel' ]
while
<ident> !~~ 'moose' || 'squirrel'
parses as
[ <ident> !~~ 'moose' ] || 'squirrel'
~ operator is a helper for matching nested subrules with a specific terminator as the goal. It is designed to be placed between an opening and closing bracket, like so:
From t/spec/S05-metachars/tilde.t lines 6–66 (no results): (skip)
Highlighted: small|full '(' ~ ')' <expression>
However, it mostly ignores the left argument, and operates on the next two atoms (which may be quantified). Its operation on those next two atoms is to "twiddle" them so that they are actually matched in reverse order. Hence the expression above, at first blush, is merely shortand for:
'(' <expression> ')'
But beyond that, when it rewrites the atoms it also inserts the apparatus that will set up the inner expression to recognize the terminator, and to produce an appropriate error message if the inner expression does not terminate on the required closing atom. So it really does pay attention to the left bracket as well, and it actually rewrites our example to something more like:
$<OPEN> = '(' <SETGOAL: ')'> <expression> [ $GOAL || <FAILGOAL> ]
Note that you can use this construct to set up expectations for a closing construct even when there's no opening bracket:
<?> ~ ')' \d+
Here <?> returns true on the first null string.
By default the error message uses the name of the current rule as an indicator of the abstract goal of the parser at that point. However, often this is not terribly informative, especially when rules are named according to an internal scheme that will not make sense to the user. The :dba("doing business as") adverb may be used to set up a more informative name for what the following code is trying to parse:
token postfix:sym<[ ]> {
:dba('array subscript')
'[' ~ ']' <expression>
}
Then instead of getting a message like:
Unable to parse expression in postfix:sym<[ ]>; couldn't find final ']'
you'll get a message like:
Unable to parse expression in array subscript; couldn't find final ']'
(The :dba adverb may also be used to give names to alternations and alternatives, which helps the lexer give better error messages.)
(...) still delimits a capturing group. However the ordering of these groups is hierarchical rather than linear. See "Nested subpattern captures".
[...] is no longer a character class. It now delimits a non-capturing group.
From t/spec/S05-match/non-capturing.t lines 11–42 (no results): (skip)
Highlighted: small|full{...} is no longer a repetition quantifier. It now delimits an embedded closure. It is always considered procedural rather than declarative; it establishes a sequence point between what comes before and what comes after. (To avoid this use the <?{...}> assertion syntax instead.)
/ (\S+) { print "string not blank\n"; $text = $0; }
\s+ { print "but does contain whitespace\n" }
/
An explicit reduction using the make function generates the abstract syntax tree object (abstract object or ast for short) for this match:
From t/spec/S05-match/make.t lines 7–12 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-grammar/action-stubs.t lines 37–91 (no results): (skip)
Highlighted: small|full / (\d) { make $0.sqrt } Remainder /;
This has the effect of capturing the square root of the numified string, instead of the string. The Remainder part is matched and returned as part of the Match object but is not returned as part of the abstract object. Since the abstract object usually represents the top node of an abstract syntax tree, the abstract object may be extracted from the Match object by use of the .ast method.
A second call to make overrides any previous call to make.
These closures are invoked with a topic ($_) of the current match state (a Cursor object). Within a closure, the instantaneous position within the search is denoted by the .pos method on that object. As with all string positions, you must not treat it as a number unless you are very careful about which units you are dealing with.
The Cursor object can also return the original item that we are matching against; this is available from the .orig method.
The closure is also guaranteed to start with a $/ Match object representing the match so far. However, if the closure does its own internal matching, its $/ variable will be rebound to the result of that match until the end of the embedded closure. (The match will actually continue with the current value of the $¢ object after the closure. $/ and $¢ just start out the same in your closure.)
fail:
/ (\d+) { $0 < 256 or fail } /
Since closures establish a sequence point, they are guaranteed to be called at the canonical time even if the optimizer could prove that something after them can't match. (Anything before is fair game, however. In particular, a closure often serves as the terminator of a longest-token pattern.)
** for maximal matching, with a corresponding **? for minimal matching. (All such quantifier modifiers now go directly after the **.) Space is allowed on either side of the complete quantifier. This space is considered significant under :sigspace, and will be distributed as a call to <.ws> between all the elements of the match but not on either end.
From t/spec/S05-metasyntax/repeat.t lines 19–49 (no results): (skip)
Highlighted: small|fullThe next token will determine what kind of repetition is desired:
If the next thing is an integer, then it is parsed as either as an exact count or a range:
. ** 42 # match exactly 42 times
<item> ** 3..* # match 3 or more times
This form is considered declarational.
If you supply a closure, it should return either an Int or a Range object.
'x' ** {$m} # exact count returned from closure
<foo> ** {$m..$n} # range returned from closure
/ value was (\d **? {1..6}) with ([ <alpha>\w* ]**{$m..$n}) /
It is illegal to return a list, so this easy mistake fails:
/ [foo] ** {1,3} /
The closure form is always considered procedural, so the item it is modifying is never considered part of the longest token.
If you supply any other atom (which may be quantified), it is interpreted as a separator (such as an infix operator), and the initial item is quantified by the number of times the separator is seen between items:
<alt> ** '|' # repetition controlled by presence of character
<addend> ** <addop> # repetition controlled by presence of subrule
<item> ** [ \!?'==' ] # repetition controlled by presence of operator
<file>**\h+ # repetition controlled by presence of whitespace
A successful match of such a quantifier always ends "in the middle", that is, after the initial item but before the next separator. Therefore
/ <ident> ** ',' /
can match
foo
foo,bar
foo,bar,baz
but never
foo,
foo,bar,
It is legal for the separator to be zero-width as long as the pattern on the left progresses on each iteration:
. ** <?same> # match sequence of identical characters
The separator never matches independently of the next item; if the separator matches but the next item fails, it backtracks all the way back through the separator. Likewise, this matching of the separator does not count as "progress" under :ratchet semantics unless the next item succeeds.
When significant space is used under :sigspace with the separator form, it applies on both sides of the separator, so
mm/<element> ** ','/
mm/<element>** ','/
mm/<element> **','/
all allow whitespace around the separator like this:
/ <element>[<.ws>','<.ws><element>]* /
while
mm/<element>**','/
excludes all significant whitespace:
/ <element>[','<element>]* /
Of course, you can always match whitespace explicitly if necessary, so to allow whitespace after the comma but not before, you can say:
/ <element>**[','\s*] /
<...> are now extensible metasyntax delimiters or assertions (i.e. they replace Perl 5's crufty (?...) syntax).
'...' literal (i.e. it does not treat the interpolated string as a subpattern). In other words, a Perl 6:
From t/spec/S05-metasyntax/litvar.t lines 17–66 (no results): (skip)
Highlighted: small|full/ $var /
is like a Perl 5:
/ \Q$var\E /
However, if $var contains a Regex object, instead of attempting to convert it to a string, it is called as a subrule, as if you said <$var>. (See assertions below.) This form does not capture, and it fails if $var is tainted.
However, a variable used as the left side of an alias or submatch operator is not used for matching.
$x = <ident>
$0 ~~ <ident>
If you do want to match $0 again and then use that as the submatch, you can force the match using double quotes:
"$0" ~~ <ident>
On the other hand, it is non-sensical to alias to something that is not a variable:
"$0" = <ident> # ERROR
$0 = <ident> # okay
$x = <ident> # okay, temporary capture
$<x> = <ident> # okay, persistent capture
<x=ident> # same thing
Variables declared in capture aliases are lexically scoped to the rest of the regex. You should not confuse this use of = with either ordinary assignment or ordinary binding. You should read the = more like the pseudoassignment of a declarator than like normal assignment. It's more like the ordinary := operator, since at the level regexes work, strings are immutable, so captures are really just precomputed substr values. Nevertheless, when you eventually use the values independently, the substr may be copied, and then it's more like it was an assignment originally.
Capture variables of the form $<ident> may persist beyond the lexical scope; if the match succeeds they are remembered in the Match object's hash, with a key corresponding to the variable name's identifier. Likewise bound numeric variables persist as $0, etc.
The capture performed by = creates a new lexical variable if it does not already exist in the current lexical scope. To capture to an outer lexical variable you must supply an OUTER:: as part of the name, or perform the assignment from within a closure.
$x = [...] # capture to our own lexical $x
$OUTER::x = [...] # capture to existing lexical $x
[...] -> $tmp { let $x = $tmp } # capture to existing lexical $x
Note however that let (and temp) are not guaranteed to be thread safe on shared variables, so don't do that.
From t/spec/S05-metasyntax/sequential-alternation.t lines 22–35 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-metasyntax/litvar.t lines 67–80 (no results): (skip)
Highlighted: small|full/ @cmds /
is matched as if it were an alternation of its elements. Ordinarily it matches using junctive semantics:
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
However, if it is a direct member of a || list, it uses sequential matching semantics, even it's the only member of the list. Conveniently, you can put || before the first member of an alternation, hence
/ || @cmds /
is equivalent to
/ [ @cmds[0] || @cmds[1] || @cmds[2] || ... ] /
Or course, you can also
/ | @cmds /
to be clear that you mean junctive semantics.
As with a scalar variable, each element is matched as a literal unless it happens to be a Regex object, in which case it is matched as a subrule. As with scalar subrules, a tainted subrule always fails. All string values pay attention to the current :ignorecase and :ignoreaccent settings, while Regex values use their own :ignorecase and :ignoreaccent settings.
When you get tired of writing:
token sigil { '$' | '@' | '@@' | '%' | '&' | '::' }
you can write:
token sigil { < $ @ @@ % & :: > }
as long as you're careful to put a space after the initial angle so that it won't be interpreted as a subrule. With the space it is parsed like angle quotes in ordinary Perl 6 and treated as a literal array value.
<sym>, like this:
proto token sigil { }
multi token sigil:sym<$> { <sym> }
multi token sigil:sym<@> { <sym> }
multi token sigil:sym<@@> { <sym> }
multi token sigil:sym<%> { <sym> }
multi token sigil:sym<&> { <sym> }
multi token sigil:sym<::> { <sym> }
(The multi is optional and generally omitted with a grammar.)
This can be viewed as a form of multiple dispatch, except that it's based on longest-token matching rather than signature matching. The advantage of writing it this way is that it's easy to add additional rules to the same category in a derived grammar. All of them will be matched in parallel when you try to match /<sigil>/.
If there are formal parameters on multi regex methods, matching still proceeds via longest-token rules first. If that results in a tie, a normal multiple dispatch is made using the arguments to the remaining variants, assuming they can be differentiated by type.
<...>)From t/spec/S05-metasyntax/angle-brackets.t lines 16–103 (no results): (skip)
Highlighted: small|fullBoth < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.
For matched pairs, the first character after < determines the nature of the assertion:
< adam & eve > # equivalent to [ 'adam' | '&' | 'eve' ]
Note that the space before the ending > is optional and therefore < adam & eve> would be acceptable.
/ <sign>? <mantissa> <exponent>? /
The first character after the identifier determines the treatment of the rest of the text before the closing angle. The underlying semantics is that of a function or method call, so if the first character is a left parenthesis, it really is a call:
<foo('bar')>
If the first character after the identifier is an =, then the identifier is taken as an alias for what follows. In particular,
<foo=bar>
is just shorthand for
$<foo> = <bar>
If the first character after the identifier is whitespace, the subsequent text (following any whitespace) is passed as a regex, so:
<foo bar>
is more or less equivalent to
<foo(/bar/)>
To pass a regex with leading whitespace you must use the parenthesized form.
If the first character is a colon followed by whitespace, the rest of the text is taken as a list of arguments to the method, just as in ordinary Perl syntax. So these mean the same thing:
<foo('foo', $bar, 42)>
<foo: 'foo', $bar, 42>
No other characters are allowed after the initial identifier.
Subrule matches are considered declarative to the extent that the front of the subrule is itself considered declarative. If a subrule contains a sequence point, then so does the subrule match. Longest-token matching does not proceed past such a subrule, for instance.
. causes a named assertion not to capture what it matches (see "Subrule captures". For example:
From t/spec/S05-metasyntax/angle-brackets.t lines 104–209 (no results): (skip)
Highlighted: small|full / <ident> <ws> / # $/<ident> and $/<ws> both captured
/ <.ident> <ws> / # only $/<ws> captured
/ <.ident> <.ws> / # nothing captured
The assertion is otherwise parsed identically to an assertion beginning with an identifier, provided the next thing after the dot is an identifier. As with the identifier form, any extra arguments pertaining to the matching engine are automatically supplied to the argument list.
If the dot is not followed by an identifier, it is parsed as a "dotty" postfix of some type, such as an indirect method call:
<.$indirect($depth, $binding, $fate, @args)>
In this case the object passed as the invocant is the current match state, and the method is expected to return a new match state object. The extra pattern matching arguments ($depth, $binding, and $fate) must be supplied explicitly.
The non-capturing behavior may be overridden with a :keepall.
$ indicates an indirect subrule. The variable must contain either a Regex object, or a string to be compiled as the regex. The string is never matched literally.
Such an assertion is not captured. (No assertion with leading punctuation is captured by default.) You may always capture it explicitly, of course.
A subrule is considered declarative to the extent that the front of it is declarative, and to the extent that the variable doesn't change. Prefix with a sequence point to defeat repeated static optimizations.
:: indicates a symbolic indirect subrule:
/ <::($somename)> /
The variable must contain the name of a subrule. By the rules of single method dispatch this is first searched for in the current grammar and its ancestors. If this search fails an attempt is made to dispatch via MMD, in which case it can find subrules defined as multis rather than methods. This form is not captured by default. It is always considered procedural, not declarative.
@ matches like a bare array except that each element is treated as a subrule (string or Regex object) rather than as a literal. That is, a string is forced to be compiled as a subrule instead of being matched literally. (There is no difference for a Regex object.)
This assertion is not automatically captured.
{ indicates code that produces a regex to be interpolated into the pattern at that point as a subrule:
/ (<.ident>) <{ %cache{$0} //= get_body_for($0) }> /
The closure is guaranteed to be run at the canonical time; it declares a sequence point, and is considered to be procedural.
& interpolates the return value of a subroutine call as a regex. Hence
<&foo()>
is short for
<{ foo() }>
This is considered procedural.
Regex object, it is not recompiled. If it is a string, the compiled form is cached with the string so that it is not recompiled next time you use it unless the string changes. (Any external lexical variable names must be rebound each time though.) Subrules may not be interpolated with unbalanced bracketing. An interpolated subrule keeps its own inner match results as a single item, so its parentheses never count toward the outer regexes groupings. (In other words, parenthesis numbering is always lexically scoped.)
?{ or !{ indicates a code assertion:
From t/spec/S05-metasyntax/assertions.t lines 5–21 (no results): (skip)
Highlighted: small|full / (\d**1..3) <?{ $0 < 256 }> /
/ (\d**1..3) <!{ $0 < 256 }> /
Similar to:
/ (\d**1..3) { $0 < 256 or fail } /
/ (\d**1..3) { $0 < 256 and fail } /
Unlike closures, code assertions are considered declarative; they are not guaranteed to be run at the canonical time if the optimizer can prove something later can't match. So you can sneak in a call to a non-canonical closure that way:
token { foo .* <?{ do { say "Got here!" } or 1 }> .* bar }
The do block is unlikely to run unless the string ends with "bar".
[ indicates an enumerated character class. Ranges in enumerated character classes are indicated with ".." rather than "-".
From t/spec/S05-mass/rx.t lines 247–263 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-mass/rx.t lines 284–450 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-metasyntax/charset.t lines 22–26 (no results): (skip)
Highlighted: small|full/ <[a..z_]>* /
Whitespace is ignored within square brackets:
/ <[ a..z _ ]>* /
- indicates a complemented character class:
From t/spec/S05-mass/rx.t lines 264–283 (no results): (skip)
Highlighted: small|fullFrom t/spec/S05-metasyntax/charset.t lines 27–31 (no results): (skip)
Highlighted: small|full / <-[a..z_]> <-alpha> /
/ <- [a..z_]> <- alpha> / # whitespace allowed after -
This is essentially the same as using negative lookahead and dot:
/ <![a..z_]> . <!alpha> . /
Whitespace is ignored after the initial -.
+ may also be supplied to indicate that the following character class is to matched in a positive sense.
From t/spec/S05-mass/rx.t lines 534–2232 (no results): (skip)