SlideShare una empresa de Scribd logo
1 de 61
Descargar para leer sin conexión
Perly Parsers:
Perl-byacc
Parse::Yapp
Parse::RecDescent
Regex::Grammar
Steven Lembark
Workhorse Computing
lembark@wrkhors.com
Grammars are the guts of compilers
● Compilers convert text from one form to another.
– C compilers convert C source to CPU-specific assembly.
– Databases compile SQL into RDBMS op's.
● Grammars define structure, precedence, valid inputs.
– Realistic ones are often recursive or context-sensitive.
– The complexity in defining grammars led to a variety of tools for defining
them.
– The standard format for a long time has been “BNF”, which is the input to
YACC.
● They are wasted on for 'flat text'.
– If “split /t/” does the job skip grammars entirely.
The first Yet Another: YACC
● Yet Another Compiler Compiler
– YACC takes in a standard-format grammar structure.
– It processes tokens and their values, organizing the
results according to the grammar into a structure.
● Between the source and YACC is a tokenizer.
– This parses the inputs into individual tokens defined by
the grammar.
– It doesn't know about structure, only breaking the text
stream up into tokens.
Parsing is a pain in the lex
● The real pain is gluing the parser and tokenizer
together.
– Tokenizers deal in the language of patterns.
– Grammars are defined in terms of structure.
● Passing data between them makes for most of the
difficulty.
– One issue is the global yylex call, which makes having
multiple parsers difficult.
– Context-sensitive grammars with multiple sub-
grammars are painful.
The perly way
● Regexen, logic, glue... hmm... been there before.
– The first approach most of us try is lexing with regexen.
– Then add captures and if-blocks or excute (?{code})
blocks inside of each regex.
● The problem is that the grammar is embedded in
your code structure.
– You have to modify the code structure to change the
grammar or its tokens.
– Hubris, maybe, but Truly Lazy it ain't.
– Was the whole reason for developing standard
grammars & their handlers in the first place.
Early Perl Grammar Modules
● These take in a YACC grammar and spit out
compiler code.
● Intentionally looked like YACC:
– Able to re-cycle existing YACC grammar files.
– Benefit from using Perl as a built-in lexer.
– Perl-byacc & Parse::Yapp.
● Good: Recycles knowledge for YACC users.
● Bad: Still not lazy: The grammars are difficult to
maintain and you still have to plug in post-
processing code to deal with the results.
%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%right '^'
%%
input: #empty
| input line { push(@{$_[1]},$_[2]); $_[1] }
;
line: 'n' { $_[1] }
| exp 'n' { print "$_[1]n" }
| error 'n' { $_[0]->YYErrok }
;
exp: NUM
| VAR { $_[0]->YYData->{VARS}{$_[1]} }
| VAR '=' exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] }
| exp '+' exp { $_[1] + $_[3] }
| exp '-' exp { $_[1] - $_[3] }
| exp '*' exp { $_[1] * $_[3] }
Example: Parse::Yapp grammar
The Swiss Army Chainsaw
● Parse::RecDescent extended the original BNF
syntax, combining the tokens & handlers.
● Grammars are largely declarative, using OO Perl to
do the heavy lifting.
– OO interface allows multiple, context sensitive parsers.
– Rules with Perl blocks allows the code to do anything.
– Results can be acquired from a hash, an array, or $1.
– Left, right, associative tags simplify messy situations.
Example P::RD
● This is part
of an infix
formula
compiler I
wrote.
● It compiles
equations to
a sequence
of closures.
add_op : '+' | '-' | '%' { $item[ 1 ] }
mult_op : '*' | '/' | '^' { $item[ 1 ] }
add : <leftop: mult add_op mult>
{
compile_binop @{ $item[1] }
}
mult : <leftop: factor mult_op factor>
{
compile_binop @{ $item[1] }
}
Just enough rope to shoot yourself...
● The biggest problem: P::RD is sloooooooowsloooooooow.
● Learning curve is perl-ish: shallow and long.
– Unless you really know what all of it does you may not
be able to figure out the pieces.
– Lots of really good docs that most people never read.
● Perly blocks also made it look too much like a job-
dispatcher.
– People used it for a lot of things that are not compilers.
– Good & Bad thing: it really is a compiler.
R.I.P. P::RD
● Supposed to be replaced with Parse::FastDescent.
– Damian dropped work on P::FD for Perl6.
– His goal was to replace the shortcomings with P::RD with
something more complete, and quite a bit faster.
● The result is Perl6 Grammars.
– Declarative syntax extends matching with rules.
– Built into Perl6 as a structure, not an add-on.
– Much faster.
– Not available in Perl5
Regex::Grammars
● Perl5 implementation derived from Perl6.
– Back-porting an idea, not the Perl6 syntax.
– Much better performance than P::RD.
● Extends the v5.10 recursive matching syntax,
leveraging the regex engine.
– Most of the speed issues are with regex design, not the
parser itself.
– Simplifies mixing code and matching.
– Single place to get the final results.
– Cleaner syntax with automatic whitespace handling.
Extending regexen
● “use Regexp::Grammar” turns on added syntax.
– block-scoped (avoids collisions with existing code).
● You will probably want to add “xm” or “xs”
– extended syntax avoids whitespace issues.
– multi-line mode (m) simplifies line anchors for line-
oriented parsing.
– single-line mode (s) makes ignoring line-wrap
whitespace largely automatic.
– I use “xm” with explicit “n” or “s” matches to span
lines where necessary.
What you get
● The parser is simply a regex-ref.
– You can bless it or have multiple parsers for context
grammars.
● Grammars can reference one another.
– Extending grammars via objects or modules is
straightforward.
● Comfortable for incremental development or
refactoring.
– Largely declarative syntax helps.
– OOP provides inheritance with overrides for rules.
my $compiler
= do
{
use Regexp::Grammars;
qr
{
<data>
<rule: data > <[text]>+
<rule: text > .+
}xm
};
Example: Creating a compiler
● Context can be
a do-block,
subroutine, or
branch logic.
● “data” is the
entry rule.
● All this does is
read lines into
an array with
automatic ws
handling.
Results: %/
● The results of parsing are in a tree-hash named %/.
– Keys are the rule names that produced the results.
– Empty keys ('') hold input text (for errors or
debugging).
– Easy to handle with Data::Dumper.
● The hash has at least one key for the entry rule, one
empty key for input data if context is being saved.
● For example, feeding two lines of a Gentoo emerge
log through the line grammar gives:
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
data =>
{
'' => '1367874132: Started emerge on: May 06, 2013
21:02:12
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk',
text =>
[
'1367874132: Started emerge on: May 06, 2013
21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
]
Parsing a few lines of logfile
Getting rid of context
● The empty-keyed values are useful for
development or explicit error messages.
● They also get in the way and can cost a lot of
memory on large inputs.
● You can turn them on and off with <context:> and
<nocontext:> in the rules.
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <text>+ # oops, left off the []!
<rule: text > .+
}xm;
warn | Repeated subrule <text>+ will only capture its
final match
| (Did you mean <[text]>+ instead?)
|
{
data => {
text => '
1367874132: *** emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y
--deep talk'
}
}
You usually want [] with +
{
data =>
{
text => the [text] parses to an array of text
[
'1367874132: Started emerge on: May 06, 2013 21:02:12',
'
1367874132: *** emerge --jobs --autounmask-write –...
],
...
qr
{
<nocontext:> # turn off globally
<data>
<rule: data > <[text]>+
<rule: text > (.+)
}xm;
An array[ref] of text
Breaking up lines
● Each log entry is prefixed with an entry id.
● Parsing the ref_id off the front adds:
<data>
<rule: data > <[line]>+
<rule: line > <ref_id> <[text]>
<token: ref_id > ^(d+)
<rule: text > .+
line =>
[
{
ref_id => '1367874132',
text => ': Started emerge on: May 06, 2013 21:02:12'
},
…
]
Removing cruft: “ws”
● Be nice to remove the leading “: “ from text lines.
● In this case the “whitespace” needs to include a
colon along with the spaces.
● Whitespace is defined by <ws: … >
<rule: line> <ws:[s:]+> <ref_id> <text>
{
ref_id => '1367874132',
text => '*** emerge --jobs –autounmask-wr...
}
The '***' prefix means something
● Be nice to know what type of line was being
processed.
● <prefix= regex > asigns the regex's capture to the
“prefix” tag:
<rule: line > <ws:[s:]*> <ref_id> <entry>
<rule: entry >
<prefix=([*][*][*])> <text>
|
<prefix=([>][>][>])> <text>
|
<prefix=([=][=][=])> <text>
|
<prefix=([:][:][:])> <text>
|
<text>
{
entry => {
text => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
prefix => '***',
text => 'emerge --jobs –autounmask-write...
},
ref_id => '1367874132'
},
{
entry => {
prefix => '>>>',
text => 'emerge (1 of 2) sys-apps/...
},
ref_id => '1367874256'
}
“entry” now contains optional prefix
Aliases can also assign tag results
● Aliases assign a
key to rule
results.
● The match from
“text” is aliased
to a named type
of log entry.
<rule: entry>
<prefix=([*][*][*])> <command=text>
|
<prefix=([>][>][>])> <stage=text>
|
<prefix=([=][=][=])> <status=text>
|
<prefix=([:][:][:])> <final=text>
|
<message=text>
{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write –...
prefix => '***'
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.',
prefix => '***'
},
ref_id => '1367874133'
},
Generic “text” replaced with a type:
Parsing without capturing
● At this point we don't really need the prefix strings
since the entries are labeled.
● A leading '.' tells R::G to parse but not store the
results in %/:
<rule: entry >
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
{
entry => {
message => 'Started emerge on: May 06, 2013 21:02:12'
},
ref_id => '1367874132'
},
{
entry => {
command => 'emerge --jobs --autounmask-write -...
},
ref_id => '1367874132'
},
{
entry => {
command => 'terminating.'
},
ref_id => '1367874133'
},
“entry” now has typed keys:
The “entry” nesting gets in the way
● The named subrule is not hard to get rid of: just
move its syntax up one level:
<ws:[s:]*> <ref_id>
(
<.prefix=([*][*][*])> <command=text>
|
<.prefix=([>][>][>])> <stage=text>
|
<.prefix=([=][=][=])> <status=text>
|
<.prefix=([:][:][:])> <final=text>
|
<message=text>
)
data => {
line => [
{
message => 'Started emerge on: May 06, 2013 21:02:12',
ref_id => '1367874132'
},
{
command => 'emerge --jobs --autounmask-write --keep-
going --load-average=4.0 --complete-graph --with-bdeps=y --deep
talk',
ref_id => '1367874132'
},
{
command => 'terminating.',
ref_id => '1367874133'
},
{
message => 'Started emerge on: May 06, 2013 21:02:17',
ref_id => '1367874137'
},
Result: array of “line” with ref_id & type
Funny names for things
● Maybe “command” and “status” aren't the best way
to distinguish the text.
● You can store an optional token followed by text:
<rule: entry > <ws:[s:]*> <ref_id> <type>? <text>
<token: type>
(
[*][*][*]
|
[>][>][>]
|
[=][=][=]
|
[:][:][:]
)
Entrys now have “text” and “type”
entry => [
{
ref_id => '1367874132',
text => 'Started emerge on: May 06, 2013 21:02:12'
},
{
ref_id => '1367874133',
text => 'terminating.',
type => '***'
},
{
ref_id => '1367874137',
text => 'Started emerge on: May 06, 2013 21:02:17'
},
{
ref_id => '1367874137',
text => 'emerge --jobs --autounmask-write –...
type => '***'
},
prefix alternations look ugly.
● Using a count works:
[*]{3} | [>]{3} | [:]{3} | [=]{3}
but isn't all that much more readable.
● Given the way these are used, use a block:
[*>:=] {3}
qr
{
<nocontext:>
<data>
<rule: data > <[entry]>+
<rule: entry >
<ws:[s:]*>
<ref_id> <prefix>? <text>
<token: ref_id > ^(d+)
<token: prefix > [*>=:]{3}
<token: text > .+
}xm;
This is the skeleton parser:
● Doesn't take much:
– Declarative syntax.
– No Perl code at all!
● Easy to modify by
extending the
definition of “text”
for specific types of
messages.
Finishing the parser
● Given the different line types it will be useful to
extract commands, switches, outcomes from
appropriate lines.
– Sub-rules can be defined for the different line types.
<rule: command> “emerge”
<.ws><[switch]>+
<token: switch> ([-][-]S+)
● This is what makes the grammars useful: nested,
context-sensitive content.
Inheriting & Extending Grammars
● <grammar: name> and <extends: name> allow a
building-block approach.
● Code can assemble the contents of for a qr{} without
having to eval or deal with messy quote strings.
● This makes modular or context-sensitive grammars
relatively simple to compose.
– References can cross package or module boundaries.
– Easy to define a basic grammar in one place and reference
or extend it from multiple other parsers.
The Non-Redundant File
● NCBI's “nr.gz” file is a list if sequences and all of
the places they are known to appear.
● It is moderately large: 140+GB uncompressed.
● The file consists of a simple FASTA format with
heading separated by ctrl-A char's:
>Heading 1
[amino-acid sequence characters...]
>Heading 2
...
Example: A short nr.gz FASTA entry
● Headings are grouped by species, separated by ctrl-A
(“cA”) characters.
– Each species has a set of sources & identifier pairs
followed by a single description.
– Within-species separator is a pipe (“|”) with optional
whitespace.
– Species counts in some header run into the thousands.
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827
[Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI
RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1|
calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1|
hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ...
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK...
VQKLLNPDQ
First step: Parse FASTA
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start> <head> <.ws> <[body]>+
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ [nw-]+
}xm;
● Instead of defining an entry rule, this just defines a
name “Parse::Fasta”.
– This cannot be used to generate results by itself.
– Accessible anywhere via Rexep::Grammars.
The output needs help, however.
● The “<seq>” token captures newlines that need to be
stripped out to get a single string.
● Munging these requires adding code to the parser using
Perl's regex code-block syntax: (?{...})
– Allows inserting almost-arbitrary code into the regex.
– “almost” because the code cannot include regexen.
seq =>
[ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP
VQKLLNPDQ
'
]
Munging results: $MATCH
● The $MATCH and %MATCH can be assigned to alter
the results from the current or lower levels of the parse.
● In this case I take the “seq” match contents out of %/,
join them with nothing, and use “tr” to strip the
newlines.
– join + split won't work because split uses a regex.
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
One more step: Remove the arrayref
● Now the body is a single string.
● No need for an arrayref to contain one string.
● Since the body has one entry, assign offset zero:
body =>
[
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ'
],
<rule: fasta> <.start> <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
Result: a generic FASTA parser.
{
fasta => [
{
body =>
'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK
DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT
KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ',
head => 'gi|66816243|ref|XP_642131.1| hypothetical p
rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556
|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C
AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium
discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]
'
}
]
}
● The head and body are easily accessible.
● Next: parse the nr-specific header.
Deriving a grammar
● Existing grammars are “extended”.
● The derived grammars are capable of producing
results.
● In this case:
● References the grammar and extracts a list of fasta
entries.
<extends: Parse::Fasta>
<[fasta]>+
Splitting the head into identifiers
● Overloading fasta's “head” rule handles allows
splitting identifiers for individual species.
● Catch: cA is separator, not a terminator.
– The tail item on the list does't have a cA to anchor on.
– Using “.+[cAn] walks off the header onto the sequence.
– This is a common problem with separators & tokenizers.
– This can be handled with special tokens in the grammar,
but R::G provides a cleaner way.
First pass: Literal “tail” item
● This works but is ugly:
– Have two rules for the main list and tail.
– Alias the tail to get them all in one place.
<rule: head> <[ident]>+ <[ident=final]>
(?{
# remove the matched anchors
tr/cAn//d for @{ $MATCH{ ident } };
})
<token: ident > .+? cA
<token: final > .+ n
Breaking up the header
● The last header item is aliased to “ident”.
● Breaks up all of the entries:
head => {
ident => [
'gi|66816243|ref|XP_642131.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]',
'gi|1705556|sp|P54670.1|CAF1_DICDI RecName:
Full=Calfumirin-1; Short=CAF-1',
'gi|793761|dbj|BAA06266.1| calfumirin-1
[Dictyostelium discoideum]',
'gi|60470106|gb|EAL68086.1| hypothetical protein
DDB_G0277827 [Dictyostelium discoideum AX4]'
]
}
Dealing with separators: '% <sep>
● Separators happen often enough:
– 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces
– g-c-a-g-t-t-a-c-a # characters by dashes
– /usr/local/bin # basenames by dir markers
– /usr:/usr/local:bin # dir's separated by colons
that R::G has special syntax for dealing with them.
● Combining the item with '%' and a seprator:
<rule: list> <[item]>+ % <separator> # one-or-more
<rule: list_zom> <[item]>* % <separator> # zero-or-more
Cleaner nr.gz header rule
● Separator syntax cleans things up:
– No more tail rule with an alias.
– No code block required to strip the separators and trailing
newline.
– Non-greedy match “.+?” avoids capturing separators.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
<token: ident > .+?
}xm
Nested “ident” tag is extraneous
● Simpler to replace the “head” with a list of
identifiers.
● Replace $MATCH from the “head” rule with the
nested identifier contents:
qr
{
<nocontext:>
<extends: Parse::Fasta>
<[fasta]>+
<rule: head > <[ident]>+ % [cA]
(?{
$MATCH = delete $MATCH{ ident };
})
<token: ident > .+?
}xm
Result:
{
fasta => [
{
body => 'MASTQNIVEEVQKMLDT...NPDQ',
head => [
'gi|66816243|ref|XP_6...rt=CAF-1',
'gi|793761|dbj|BAA0626...oideum]',
'gi|60470106|gb|EAL68086...m discoideum AX4]'
]
}
]
}
● The fasta content is broken into the usual “body” plus
a “head” broken down on cA boundaries.
● Not bad for a dozen lines of grammar with a few
lines of code:
One more level of structure: idents.
● Species have <source > | <identifier> pairs followed
by a description.
● Add a separator clause “ % (?:s*|s*)”
– This can be parsed into a hash something like:
gi|66816243|ref|XP_642131.1|hypothetical ...
Becomes:
{
gi => '66816243',
ref => 'XP_642131.1',
desc => 'hypothetical...'
}
Munging the separated input
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc }
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<token: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
Result: head with sources, “desc”
{
fasta => {
body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN',
head => [
{
desc => '30S ribosomal protein S18 [Lactococ...
gi => '15674171',
ref => 'NP_268346.1'
},
{
desc => '30S ribosomal protein S18 [Lactoco...
gi => '116513137',
ref => 'YP_812044.1'
},
...
Balancing R::G with calling code
● The regex engine could process all of nr.gz.
– Catch: <[fasta]>+ returns about 250_000 keys and literally
millions of total identifiers in the head's.
– Better approach: <fasta> on single entries, but chunking input
on '>' removes it as a leading charactor.
– Making it optional with <.start>? fixes the problem:
local $/ = '>';
while( my $chunk = readline )
{
chomp;
length $chunk or do { --$.; next };
$chunk =~ $nr_gz;
# process single fasta record in %/
}
Fasta base grammar: 3 lines of code
qr
{
<grammar: Parse::Fasta>
<nocontext:>
<rule: fasta > <.start>? <head> <.ws> <[body]>+
(?{
$MATCH{ body } = $MATCH{ body }[0];
})
<rule: head > .+ <.ws>
<rule: body > ( <[seq]> | <.comment> ) <.ws>
(?{
$MATCH = join '' => @{ delete $MATCH{ seq } };
$MATCH =~ tr/n//d;
})
<token: start > ^ [>]
<token: comment > ^ [;] .+
<token: seq > ^ ( [nw-]+ )
}xm;
Extension to Fasta: 6 lines of code.
qr
{
<nocontext:>
<extends: Parse::Fasta>
<fasta>
(?{
my $identz = delete $MATCH{ fasta }{ head }{ ident };
for( @$identz )
{
my $pairz = $_->{ taxa };
my $desc = pop @$pairz;
$_ = { @$pairz, desc => $desc };
}
$MATCH{ fasta }{ head } = $identz;
})
<rule: head > <[ident]>+ % [cA]
<rule: ident > <[taxa]>+ % (?: s* [|] s* )
<token: taxa > .+?
}xm
Result: Use grammars
● Most of the “real” work is done under the hood.
– Regexp::Grammars does the lexing, basic compilation.
– Code only needed for cleanups or re-arranging structs.
● Code can simplify your grammar.
– Too much code makes them hard to maintain.
– Trick is keeping the balance between simplicity in the
grammar and cleanup in the code.
● Either way, the result is going to be more
maintainable than hardwiring the grammar into code.
Aside: KwikFix for Perl v5.18
● v5.17 changed how the regex engine handles inline
code.
● Code that used to be eval-ed in the regex is now
compiled up front.
– This requires “use re 'eval'” and “no strict 'vars'”.
– One for the Perl code, the other for $MATCH and friends.
● The immediate fix for this is in the last few lines of
R::G::import, which push the pragmas into the caller:
● Look up $^H in perlvars to see how it works.
require re; re->import( 'eval' );
require strict; strict->unimport( 'vars' );
Use Regexp::Grammars
● Unless you have old YACC BNF grammars to
convert, the newer facility for defining the
grammars is cleaner.
– Frankly, even if you do have old grammars...
● Regexp::Grammars avoids the performance pitfalls
of P::RD.
– It is worth taking time to learn how to optimize NDF
regexen, however.
● Or, better yet, use Perl6 grammars, available today
at your local copy of Rakudo Perl6.
More info on Regexp::Grammars
● The POD is thorough and quite descriptive
[comfortable chair, enjoyable beverage suggested].
● The ./demo directory has a number of working – if
un-annotated – examples.
● “perldoc perlre” shows how recursive matching in
v5.10+.
● PerlMonks has plenty of good postings.
● Perl Review article by brian d foy on recursive
matching in Perl 5.10.

Más contenido relacionado

La actualidad más candente

Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석Jongseok Choi
 
typemap in Perl/XS
typemap in Perl/XS  typemap in Perl/XS
typemap in Perl/XS charsbar
 
Introduction to PHP 5.3
Introduction to PHP 5.3Introduction to PHP 5.3
Introduction to PHP 5.3guestcc91d4
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPANdaoswald
 
Use perl creating web services with xml rpc
Use perl creating web services with xml rpcUse perl creating web services with xml rpc
Use perl creating web services with xml rpcJohnny Pork
 
Open Gurukul Language PL/SQL
Open Gurukul Language PL/SQLOpen Gurukul Language PL/SQL
Open Gurukul Language PL/SQLOpen Gurukul
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonTristan Penman
 
10 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.610 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.6Webline Infosoft P Ltd
 
Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014Hari K T
 
Internet Technology and its Applications
Internet Technology and its ApplicationsInternet Technology and its Applications
Internet Technology and its Applicationsamichoksi
 
Perl 5.10 in 2010
Perl 5.10 in 2010Perl 5.10 in 2010
Perl 5.10 in 2010guest7899f0
 
Modern Perl Catch-Up
Modern Perl Catch-UpModern Perl Catch-Up
Modern Perl Catch-UpDave Cross
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!Franklin Chen
 

La actualidad más candente (20)

Hyperledger 구조 분석
Hyperledger 구조 분석Hyperledger 구조 분석
Hyperledger 구조 분석
 
typemap in Perl/XS
typemap in Perl/XS  typemap in Perl/XS
typemap in Perl/XS
 
Introduction to PHP 5.3
Introduction to PHP 5.3Introduction to PHP 5.3
Introduction to PHP 5.3
 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
 
Use perl creating web services with xml rpc
Use perl creating web services with xml rpcUse perl creating web services with xml rpc
Use perl creating web services with xml rpc
 
Php
PhpPhp
Php
 
Cs3430 lecture 15
Cs3430 lecture 15Cs3430 lecture 15
Cs3430 lecture 15
 
Javascript
JavascriptJavascript
Javascript
 
Open Gurukul Language PL/SQL
Open Gurukul Language PL/SQLOpen Gurukul Language PL/SQL
Open Gurukul Language PL/SQL
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
 
10 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.610 Most Important Features of New PHP 5.6
10 Most Important Features of New PHP 5.6
 
Perl Basics with Examples
Perl Basics with ExamplesPerl Basics with Examples
Perl Basics with Examples
 
Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014Aura for PHP at Fossmeet 2014
Aura for PHP at Fossmeet 2014
 
Internet Technology and its Applications
Internet Technology and its ApplicationsInternet Technology and its Applications
Internet Technology and its Applications
 
Introduction to Perl and BioPerl
Introduction to Perl and BioPerlIntroduction to Perl and BioPerl
Introduction to Perl and BioPerl
 
Perl Programming - 01 Basic Perl
Perl Programming - 01 Basic PerlPerl Programming - 01 Basic Perl
Perl Programming - 01 Basic Perl
 
Perl 5.10 in 2010
Perl 5.10 in 2010Perl 5.10 in 2010
Perl 5.10 in 2010
 
Modern Perl Catch-Up
Modern Perl Catch-UpModern Perl Catch-Up
Modern Perl Catch-Up
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
 
Php
PhpPhp
Php
 

Similar a Perly Parsing with Regexp::Grammars

Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017Ayush Sharma
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracketjnewmanux
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsGleicon Moraes
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScriptJorg Janke
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Functional Smalltalk
Functional SmalltalkFunctional Smalltalk
Functional SmalltalkESUG
 
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...digitalwave
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...PVS-Studio
 
Perl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one linersPerl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one linersKirk Kimmel
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.Poznań Ruby User Group
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Workhorse Computing
 

Similar a Perly Parsing with Regexp::Grammars (20)

Angular JS in 2017
Angular JS in 2017Angular JS in 2017
Angular JS in 2017
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Killing the Angle Bracket
Killing the Angle BracketKilling the Angle Bracket
Killing the Angle Bracket
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
Os Wilhelm
Os WilhelmOs Wilhelm
Os Wilhelm
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Functional Smalltalk
Functional SmalltalkFunctional Smalltalk
Functional Smalltalk
 
JavaScripts & jQuery
JavaScripts & jQueryJavaScripts & jQuery
JavaScripts & jQuery
 
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
msc_pyparser - ModSecurity config parser presentation @CRS Community Summit i...
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 
Perl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one linersPerl - laziness, impatience, hubris, and one liners
Perl - laziness, impatience, hubris, and one liners
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
7986-lect 7.pdf
7986-lect 7.pdf7986-lect 7.pdf
7986-lect 7.pdf
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
 

Más de Workhorse Computing

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpWorkhorse Computing
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlWorkhorse Computing
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Workhorse Computing
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationWorkhorse Computing
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationWorkhorse Computing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.Workhorse Computing
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Workhorse Computing
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Workhorse Computing
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
 

Más de Workhorse Computing (20)

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
mro-every.pdf
mro-every.pdfmro-every.pdf
mro-every.pdf
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add Up
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
 
Unit Testing Lots of Perl
Unit Testing Lots of PerlUnit Testing Lots of Perl
Unit Testing Lots of Perl
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in Posgresql
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command Interpolation
 
Findbin libs
Findbin libsFindbin libs
Findbin libs
 
Memory Manglement in Raku
Memory Manglement in RakuMemory Manglement in Raku
Memory Manglement in Raku
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
 
Getting Testy With Perl6
Getting Testy With Perl6Getting Testy With Perl6
Getting Testy With Perl6
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
Neatly folding-a-tree
Neatly folding-a-treeNeatly folding-a-tree
Neatly folding-a-tree
 

Último

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Perly Parsing with Regexp::Grammars

  • 2. Grammars are the guts of compilers ● Compilers convert text from one form to another. – C compilers convert C source to CPU-specific assembly. – Databases compile SQL into RDBMS op's. ● Grammars define structure, precedence, valid inputs. – Realistic ones are often recursive or context-sensitive. – The complexity in defining grammars led to a variety of tools for defining them. – The standard format for a long time has been “BNF”, which is the input to YACC. ● They are wasted on for 'flat text'. – If “split /t/” does the job skip grammars entirely.
  • 3. The first Yet Another: YACC ● Yet Another Compiler Compiler – YACC takes in a standard-format grammar structure. – It processes tokens and their values, organizing the results according to the grammar into a structure. ● Between the source and YACC is a tokenizer. – This parses the inputs into individual tokens defined by the grammar. – It doesn't know about structure, only breaking the text stream up into tokens.
  • 4. Parsing is a pain in the lex ● The real pain is gluing the parser and tokenizer together. – Tokenizers deal in the language of patterns. – Grammars are defined in terms of structure. ● Passing data between them makes for most of the difficulty. – One issue is the global yylex call, which makes having multiple parsers difficult. – Context-sensitive grammars with multiple sub- grammars are painful.
  • 5. The perly way ● Regexen, logic, glue... hmm... been there before. – The first approach most of us try is lexing with regexen. – Then add captures and if-blocks or excute (?{code}) blocks inside of each regex. ● The problem is that the grammar is embedded in your code structure. – You have to modify the code structure to change the grammar or its tokens. – Hubris, maybe, but Truly Lazy it ain't. – Was the whole reason for developing standard grammars & their handlers in the first place.
  • 6. Early Perl Grammar Modules ● These take in a YACC grammar and spit out compiler code. ● Intentionally looked like YACC: – Able to re-cycle existing YACC grammar files. – Benefit from using Perl as a built-in lexer. – Perl-byacc & Parse::Yapp. ● Good: Recycles knowledge for YACC users. ● Bad: Still not lazy: The grammars are difficult to maintain and you still have to plug in post- processing code to deal with the results.
  • 7. %right '=' %left '-' '+' %left '*' '/' %left NEG %right '^' %% input: #empty | input line { push(@{$_[1]},$_[2]); $_[1] } ; line: 'n' { $_[1] } | exp 'n' { print "$_[1]n" } | error 'n' { $_[0]->YYErrok } ; exp: NUM | VAR { $_[0]->YYData->{VARS}{$_[1]} } | VAR '=' exp { $_[0]->YYData->{VARS}{$_[1]}=$_[3] } | exp '+' exp { $_[1] + $_[3] } | exp '-' exp { $_[1] - $_[3] } | exp '*' exp { $_[1] * $_[3] } Example: Parse::Yapp grammar
  • 8. The Swiss Army Chainsaw ● Parse::RecDescent extended the original BNF syntax, combining the tokens & handlers. ● Grammars are largely declarative, using OO Perl to do the heavy lifting. – OO interface allows multiple, context sensitive parsers. – Rules with Perl blocks allows the code to do anything. – Results can be acquired from a hash, an array, or $1. – Left, right, associative tags simplify messy situations.
  • 9. Example P::RD ● This is part of an infix formula compiler I wrote. ● It compiles equations to a sequence of closures. add_op : '+' | '-' | '%' { $item[ 1 ] } mult_op : '*' | '/' | '^' { $item[ 1 ] } add : <leftop: mult add_op mult> { compile_binop @{ $item[1] } } mult : <leftop: factor mult_op factor> { compile_binop @{ $item[1] } }
  • 10. Just enough rope to shoot yourself... ● The biggest problem: P::RD is sloooooooowsloooooooow. ● Learning curve is perl-ish: shallow and long. – Unless you really know what all of it does you may not be able to figure out the pieces. – Lots of really good docs that most people never read. ● Perly blocks also made it look too much like a job- dispatcher. – People used it for a lot of things that are not compilers. – Good & Bad thing: it really is a compiler.
  • 11. R.I.P. P::RD ● Supposed to be replaced with Parse::FastDescent. – Damian dropped work on P::FD for Perl6. – His goal was to replace the shortcomings with P::RD with something more complete, and quite a bit faster. ● The result is Perl6 Grammars. – Declarative syntax extends matching with rules. – Built into Perl6 as a structure, not an add-on. – Much faster. – Not available in Perl5
  • 12. Regex::Grammars ● Perl5 implementation derived from Perl6. – Back-porting an idea, not the Perl6 syntax. – Much better performance than P::RD. ● Extends the v5.10 recursive matching syntax, leveraging the regex engine. – Most of the speed issues are with regex design, not the parser itself. – Simplifies mixing code and matching. – Single place to get the final results. – Cleaner syntax with automatic whitespace handling.
  • 13. Extending regexen ● “use Regexp::Grammar” turns on added syntax. – block-scoped (avoids collisions with existing code). ● You will probably want to add “xm” or “xs” – extended syntax avoids whitespace issues. – multi-line mode (m) simplifies line anchors for line- oriented parsing. – single-line mode (s) makes ignoring line-wrap whitespace largely automatic. – I use “xm” with explicit “n” or “s” matches to span lines where necessary.
  • 14. What you get ● The parser is simply a regex-ref. – You can bless it or have multiple parsers for context grammars. ● Grammars can reference one another. – Extending grammars via objects or modules is straightforward. ● Comfortable for incremental development or refactoring. – Largely declarative syntax helps. – OOP provides inheritance with overrides for rules.
  • 15. my $compiler = do { use Regexp::Grammars; qr { <data> <rule: data > <[text]>+ <rule: text > .+ }xm }; Example: Creating a compiler ● Context can be a do-block, subroutine, or branch logic. ● “data” is the entry rule. ● All this does is read lines into an array with automatic ws handling.
  • 16. Results: %/ ● The results of parsing are in a tree-hash named %/. – Keys are the rule names that produced the results. – Empty keys ('') hold input text (for errors or debugging). – Easy to handle with Data::Dumper. ● The hash has at least one key for the entry rule, one empty key for input data if context is being saved. ● For example, feeding two lines of a Gentoo emerge log through the line grammar gives:
  • 17. { '' => '1367874132: Started emerge on: May 06, 2013 21:02:12 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', data => { '' => '1367874132: Started emerge on: May 06, 2013 21:02:12 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', text => [ '1367874132: Started emerge on: May 06, 2013 21:02:12', ' 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk' ] Parsing a few lines of logfile
  • 18. Getting rid of context ● The empty-keyed values are useful for development or explicit error messages. ● They also get in the way and can cost a lot of memory on large inputs. ● You can turn them on and off with <context:> and <nocontext:> in the rules.
  • 19. qr { <nocontext:> # turn off globally <data> <rule: data > <text>+ # oops, left off the []! <rule: text > .+ }xm; warn | Repeated subrule <text>+ will only capture its final match | (Did you mean <[text]>+ instead?) | { data => { text => ' 1367874132: *** emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk' } } You usually want [] with +
  • 20. { data => { text => the [text] parses to an array of text [ '1367874132: Started emerge on: May 06, 2013 21:02:12', ' 1367874132: *** emerge --jobs --autounmask-write –... ], ... qr { <nocontext:> # turn off globally <data> <rule: data > <[text]>+ <rule: text > (.+) }xm; An array[ref] of text
  • 21. Breaking up lines ● Each log entry is prefixed with an entry id. ● Parsing the ref_id off the front adds: <data> <rule: data > <[line]>+ <rule: line > <ref_id> <[text]> <token: ref_id > ^(d+) <rule: text > .+ line => [ { ref_id => '1367874132', text => ': Started emerge on: May 06, 2013 21:02:12' }, … ]
  • 22. Removing cruft: “ws” ● Be nice to remove the leading “: “ from text lines. ● In this case the “whitespace” needs to include a colon along with the spaces. ● Whitespace is defined by <ws: … > <rule: line> <ws:[s:]+> <ref_id> <text> { ref_id => '1367874132', text => '*** emerge --jobs –autounmask-wr... }
  • 23. The '***' prefix means something ● Be nice to know what type of line was being processed. ● <prefix= regex > asigns the regex's capture to the “prefix” tag: <rule: line > <ws:[s:]*> <ref_id> <entry> <rule: entry > <prefix=([*][*][*])> <text> | <prefix=([>][>][>])> <text> | <prefix=([=][=][=])> <text> | <prefix=([:][:][:])> <text> | <text>
  • 24. { entry => { text => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { prefix => '***', text => 'emerge --jobs –autounmask-write... }, ref_id => '1367874132' }, { entry => { prefix => '>>>', text => 'emerge (1 of 2) sys-apps/... }, ref_id => '1367874256' } “entry” now contains optional prefix
  • 25. Aliases can also assign tag results ● Aliases assign a key to rule results. ● The match from “text” is aliased to a named type of log entry. <rule: entry> <prefix=([*][*][*])> <command=text> | <prefix=([>][>][>])> <stage=text> | <prefix=([=][=][=])> <status=text> | <prefix=([:][:][:])> <final=text> | <message=text>
  • 26. { entry => { message => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { command => 'emerge --jobs --autounmask-write –... prefix => '***' }, ref_id => '1367874132' }, { entry => { command => 'terminating.', prefix => '***' }, ref_id => '1367874133' }, Generic “text” replaced with a type:
  • 27. Parsing without capturing ● At this point we don't really need the prefix strings since the entries are labeled. ● A leading '.' tells R::G to parse but not store the results in %/: <rule: entry > <.prefix=([*][*][*])> <command=text> | <.prefix=([>][>][>])> <stage=text> | <.prefix=([=][=][=])> <status=text> | <.prefix=([:][:][:])> <final=text> | <message=text>
  • 28. { entry => { message => 'Started emerge on: May 06, 2013 21:02:12' }, ref_id => '1367874132' }, { entry => { command => 'emerge --jobs --autounmask-write -... }, ref_id => '1367874132' }, { entry => { command => 'terminating.' }, ref_id => '1367874133' }, “entry” now has typed keys:
  • 29. The “entry” nesting gets in the way ● The named subrule is not hard to get rid of: just move its syntax up one level: <ws:[s:]*> <ref_id> ( <.prefix=([*][*][*])> <command=text> | <.prefix=([>][>][>])> <stage=text> | <.prefix=([=][=][=])> <status=text> | <.prefix=([:][:][:])> <final=text> | <message=text> )
  • 30. data => { line => [ { message => 'Started emerge on: May 06, 2013 21:02:12', ref_id => '1367874132' }, { command => 'emerge --jobs --autounmask-write --keep- going --load-average=4.0 --complete-graph --with-bdeps=y --deep talk', ref_id => '1367874132' }, { command => 'terminating.', ref_id => '1367874133' }, { message => 'Started emerge on: May 06, 2013 21:02:17', ref_id => '1367874137' }, Result: array of “line” with ref_id & type
  • 31. Funny names for things ● Maybe “command” and “status” aren't the best way to distinguish the text. ● You can store an optional token followed by text: <rule: entry > <ws:[s:]*> <ref_id> <type>? <text> <token: type> ( [*][*][*] | [>][>][>] | [=][=][=] | [:][:][:] )
  • 32. Entrys now have “text” and “type” entry => [ { ref_id => '1367874132', text => 'Started emerge on: May 06, 2013 21:02:12' }, { ref_id => '1367874133', text => 'terminating.', type => '***' }, { ref_id => '1367874137', text => 'Started emerge on: May 06, 2013 21:02:17' }, { ref_id => '1367874137', text => 'emerge --jobs --autounmask-write –... type => '***' },
  • 33. prefix alternations look ugly. ● Using a count works: [*]{3} | [>]{3} | [:]{3} | [=]{3} but isn't all that much more readable. ● Given the way these are used, use a block: [*>:=] {3}
  • 34. qr { <nocontext:> <data> <rule: data > <[entry]>+ <rule: entry > <ws:[s:]*> <ref_id> <prefix>? <text> <token: ref_id > ^(d+) <token: prefix > [*>=:]{3} <token: text > .+ }xm; This is the skeleton parser: ● Doesn't take much: – Declarative syntax. – No Perl code at all! ● Easy to modify by extending the definition of “text” for specific types of messages.
  • 35. Finishing the parser ● Given the different line types it will be useful to extract commands, switches, outcomes from appropriate lines. – Sub-rules can be defined for the different line types. <rule: command> “emerge” <.ws><[switch]>+ <token: switch> ([-][-]S+) ● This is what makes the grammars useful: nested, context-sensitive content.
  • 36. Inheriting & Extending Grammars ● <grammar: name> and <extends: name> allow a building-block approach. ● Code can assemble the contents of for a qr{} without having to eval or deal with messy quote strings. ● This makes modular or context-sensitive grammars relatively simple to compose. – References can cross package or module boundaries. – Easy to define a basic grammar in one place and reference or extend it from multiple other parsers.
  • 37. The Non-Redundant File ● NCBI's “nr.gz” file is a list if sequences and all of the places they are known to appear. ● It is moderately large: 140+GB uncompressed. ● The file consists of a simple FASTA format with heading separated by ctrl-A char's: >Heading 1 [amino-acid sequence characters...] >Heading 2 ...
  • 38. Example: A short nr.gz FASTA entry ● Headings are grouped by species, separated by ctrl-A (“cA”) characters. – Each species has a set of sources & identifier pairs followed by a single description. – Within-species separator is a pipe (“|”) with optional whitespace. – Species counts in some header run into the thousands. >gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQ... KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEK... VQKLLNPDQ
  • 39. First step: Parse FASTA qr { <grammar: Parse::Fasta> <nocontext:> <rule: fasta > <.start> <head> <.ws> <[body]>+ <rule: head > .+ <.ws> <rule: body > ( <[seq]> | <.comment> ) <.ws> <token: start > ^ [>] <token: comment > ^ [;] .+ <token: seq > ^ [nw-]+ }xm; ● Instead of defining an entry rule, this just defines a name “Parse::Fasta”. – This cannot be used to generate results by itself. – Accessible anywhere via Rexep::Grammars.
  • 40. The output needs help, however. ● The “<seq>” token captures newlines that need to be stripped out to get a single string. ● Munging these requires adding code to the parser using Perl's regex code-block syntax: (?{...}) – Allows inserting almost-arbitrary code into the regex. – “almost” because the code cannot include regexen. seq => [ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYD KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDP VQKLLNPDQ ' ]
  • 41. Munging results: $MATCH ● The $MATCH and %MATCH can be assigned to alter the results from the current or lower levels of the parse. ● In this case I take the “seq” match contents out of %/, join them with nothing, and use “tr” to strip the newlines. – join + split won't work because split uses a regex. <rule: body > ( <[seq]> | <.comment> ) <.ws> (?{ $MATCH = join '' => @{ delete $MATCH{ seq } }; $MATCH =~ tr/n//d; })
  • 42. One more step: Remove the arrayref ● Now the body is a single string. ● No need for an arrayref to contain one string. ● Since the body has one entry, assign offset zero: body => [ 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDT KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ' ], <rule: fasta> <.start> <head> <.ws> <[body]>+ (?{ $MATCH{ body } = $MATCH{ body }[0]; })
  • 43. Result: a generic FASTA parser. { fasta => [ { body => 'MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDK DNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDIT KDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ', head => 'gi|66816243|ref|XP_642131.1| hypothetical p rotein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556 |sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=C AF-1gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ' } ] } ● The head and body are easily accessible. ● Next: parse the nr-specific header.
  • 44. Deriving a grammar ● Existing grammars are “extended”. ● The derived grammars are capable of producing results. ● In this case: ● References the grammar and extracts a list of fasta entries. <extends: Parse::Fasta> <[fasta]>+
  • 45. Splitting the head into identifiers ● Overloading fasta's “head” rule handles allows splitting identifiers for individual species. ● Catch: cA is separator, not a terminator. – The tail item on the list does't have a cA to anchor on. – Using “.+[cAn] walks off the header onto the sequence. – This is a common problem with separators & tokenizers. – This can be handled with special tokens in the grammar, but R::G provides a cleaner way.
  • 46. First pass: Literal “tail” item ● This works but is ugly: – Have two rules for the main list and tail. – Alias the tail to get them all in one place. <rule: head> <[ident]>+ <[ident=final]> (?{ # remove the matched anchors tr/cAn//d for @{ $MATCH{ ident } }; }) <token: ident > .+? cA <token: final > .+ n
  • 47. Breaking up the header ● The last header item is aliased to “ident”. ● Breaks up all of the entries: head => { ident => [ 'gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]', 'gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1', 'gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum]', 'gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]' ] }
  • 48. Dealing with separators: '% <sep> ● Separators happen often enough: – 1, 2, 3 , 4 ,13, 91 # numbers by commas, spaces – g-c-a-g-t-t-a-c-a # characters by dashes – /usr/local/bin # basenames by dir markers – /usr:/usr/local:bin # dir's separated by colons that R::G has special syntax for dealing with them. ● Combining the item with '%' and a seprator: <rule: list> <[item]>+ % <separator> # one-or-more <rule: list_zom> <[item]>* % <separator> # zero-or-more
  • 49. Cleaner nr.gz header rule ● Separator syntax cleans things up: – No more tail rule with an alias. – No code block required to strip the separators and trailing newline. – Non-greedy match “.+?” avoids capturing separators. qr { <nocontext:> <extends: Parse::Fasta> <[fasta]>+ <rule: head > <[ident]>+ % [cA] <token: ident > .+? }xm
  • 50. Nested “ident” tag is extraneous ● Simpler to replace the “head” with a list of identifiers. ● Replace $MATCH from the “head” rule with the nested identifier contents: qr { <nocontext:> <extends: Parse::Fasta> <[fasta]>+ <rule: head > <[ident]>+ % [cA] (?{ $MATCH = delete $MATCH{ ident }; }) <token: ident > .+? }xm
  • 51. Result: { fasta => [ { body => 'MASTQNIVEEVQKMLDT...NPDQ', head => [ 'gi|66816243|ref|XP_6...rt=CAF-1', 'gi|793761|dbj|BAA0626...oideum]', 'gi|60470106|gb|EAL68086...m discoideum AX4]' ] } ] } ● The fasta content is broken into the usual “body” plus a “head” broken down on cA boundaries. ● Not bad for a dozen lines of grammar with a few lines of code:
  • 52. One more level of structure: idents. ● Species have <source > | <identifier> pairs followed by a description. ● Add a separator clause “ % (?:s*|s*)” – This can be parsed into a hash something like: gi|66816243|ref|XP_642131.1|hypothetical ... Becomes: { gi => '66816243', ref => 'XP_642131.1', desc => 'hypothetical...' }
  • 53. Munging the separated input <fasta> (?{ my $identz = delete $MATCH{ fasta }{ head }{ ident }; for( @$identz ) { my $pairz = $_->{ taxa }; my $desc = pop @$pairz; $_ = { @$pairz, desc => $desc } } $MATCH{ fasta }{ head } = $identz; }) <rule: head > <[ident]>+ % [cA] <token: ident > <[taxa]>+ % (?: s* [|] s* ) <token: taxa > .+?
  • 54. Result: head with sources, “desc” { fasta => { body => 'MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKR...EDQN', head => [ { desc => '30S ribosomal protein S18 [Lactococ... gi => '15674171', ref => 'NP_268346.1' }, { desc => '30S ribosomal protein S18 [Lactoco... gi => '116513137', ref => 'YP_812044.1' }, ...
  • 55. Balancing R::G with calling code ● The regex engine could process all of nr.gz. – Catch: <[fasta]>+ returns about 250_000 keys and literally millions of total identifiers in the head's. – Better approach: <fasta> on single entries, but chunking input on '>' removes it as a leading charactor. – Making it optional with <.start>? fixes the problem: local $/ = '>'; while( my $chunk = readline ) { chomp; length $chunk or do { --$.; next }; $chunk =~ $nr_gz; # process single fasta record in %/ }
  • 56. Fasta base grammar: 3 lines of code qr { <grammar: Parse::Fasta> <nocontext:> <rule: fasta > <.start>? <head> <.ws> <[body]>+ (?{ $MATCH{ body } = $MATCH{ body }[0]; }) <rule: head > .+ <.ws> <rule: body > ( <[seq]> | <.comment> ) <.ws> (?{ $MATCH = join '' => @{ delete $MATCH{ seq } }; $MATCH =~ tr/n//d; }) <token: start > ^ [>] <token: comment > ^ [;] .+ <token: seq > ^ ( [nw-]+ ) }xm;
  • 57. Extension to Fasta: 6 lines of code. qr { <nocontext:> <extends: Parse::Fasta> <fasta> (?{ my $identz = delete $MATCH{ fasta }{ head }{ ident }; for( @$identz ) { my $pairz = $_->{ taxa }; my $desc = pop @$pairz; $_ = { @$pairz, desc => $desc }; } $MATCH{ fasta }{ head } = $identz; }) <rule: head > <[ident]>+ % [cA] <rule: ident > <[taxa]>+ % (?: s* [|] s* ) <token: taxa > .+? }xm
  • 58. Result: Use grammars ● Most of the “real” work is done under the hood. – Regexp::Grammars does the lexing, basic compilation. – Code only needed for cleanups or re-arranging structs. ● Code can simplify your grammar. – Too much code makes them hard to maintain. – Trick is keeping the balance between simplicity in the grammar and cleanup in the code. ● Either way, the result is going to be more maintainable than hardwiring the grammar into code.
  • 59. Aside: KwikFix for Perl v5.18 ● v5.17 changed how the regex engine handles inline code. ● Code that used to be eval-ed in the regex is now compiled up front. – This requires “use re 'eval'” and “no strict 'vars'”. – One for the Perl code, the other for $MATCH and friends. ● The immediate fix for this is in the last few lines of R::G::import, which push the pragmas into the caller: ● Look up $^H in perlvars to see how it works. require re; re->import( 'eval' ); require strict; strict->unimport( 'vars' );
  • 60. Use Regexp::Grammars ● Unless you have old YACC BNF grammars to convert, the newer facility for defining the grammars is cleaner. – Frankly, even if you do have old grammars... ● Regexp::Grammars avoids the performance pitfalls of P::RD. – It is worth taking time to learn how to optimize NDF regexen, however. ● Or, better yet, use Perl6 grammars, available today at your local copy of Rakudo Perl6.
  • 61. More info on Regexp::Grammars ● The POD is thorough and quite descriptive [comfortable chair, enjoyable beverage suggested]. ● The ./demo directory has a number of working – if un-annotated – examples. ● “perldoc perlre” shows how recursive matching in v5.10+. ● PerlMonks has plenty of good postings. ● Perl Review article by brian d foy on recursive matching in Perl 5.10.