ALINK="#FF0000">

LINUX GAZETTE

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]

"Linux Gazette...making Linux just a little more fun!"


Learning Perl, part 2

By Ben Okopnik


"I realized at that point that there was a huge ecological niche between the C language and Unix shells. C was good for manipulating complex things  - you can call it 'manipulexity.' And the shells were good at whipping up things - what I call 'whipupitude.' But there was this big blank area where neither C nor shell were good, and that's where I aimed Perl."
 -- Larry Wall, author of Perl

Overview

In the first part, we talked about some basics and general issues in Perl - writing a script, hash-bangs, style - as well as a number of specifics, such as scalars, arrays, hashes, operators, and quoting methods. This month, we'll take a look at the intrinsic Perl tools that make it so easy to use from the command line, as well as their equivalents in scripts. We'll also go a little deeper into quoting methods, and get a bit of a start on regexes (regular expressions, or REs) - one of the most powerful tools in Perl, and one that deserves an entire book all its own. [1]
 
 

Quote Mechanisms

Most of you will be familiar with the standard quoting mechanisms in Unix: the single and the double quote, which I'd already mentioned in my previous article, have much the same functionality in Perl as they do in the shell. Sometimes, though, escaping all the in-line metacharacters can be a bit painful. Imagine trying to print a string like this:

``/// Don't say "shan't," "can't," or "won't." ///''

Good grief! What can we do with a mess like that?

Well, we could put in a whole bunch of escapes ("\"), but that would be a pain - as well as a case of the LTS ("Leaning Toothpick Syndrome"):

print '\`\`\/\/\/ Don\'t...

<shudder> Obviously not a good answer. For times like these, Perl provides alternate quoting mechanisms:

q//        # Single quotes
qq//       # Double quotes
qx//       # Back quotes, for shell execution
qw//       # Word list - useful for populating arrays

Note also that the delimiter does not have to be '/', but can be any character. Now our job becomes a bit easier:

print q-``/// Don't say "shan't," "can't," or "won't." ///''-;

Simple, eh? By the way, this is something you would use only inside a script; the shell interpretation mechanism would make a horrendous mess of this if you tried it from the command line, especially things like back quotes and slashes.
 
 

Perl Invocation

"Hear my plea, O Perl of Great Wisdom!" Oh, never mind; I think that was standard in Perl3, and is now deprecated... :)

The most commonly-used switch in invoking Perl, if you're running it from the command line, is '-e'; this one tells Perl to execute whatever comes immediately after it. In fact, '-e' must be the last switch used on the command line because everything after it is considered to be part of the script!

perl -we 'print "The Gods send thread for the Web begun.\n"'

"-w" is the "warn" switch that I mentioned the last time. It tells you about all the non-fatal errors in your code, including variables that you set but didn't use (invaluable for finding mistyped variable names) as well as many, many other things. You should always - yes, always - use "-w", whether on the command line or in a script.

"-n" is the "non-printing loop" switch, which causes Perl to iterate over the input, one line at a time - somewhat like "awk". If you want to print a given line, you'll need to specify a condition for it:

perl -wne 'print if /holiday/' schedule.txt

Perl will loop through "schedule.txt" and print any line that contains the word "holiday", so you can get depressed about how little time off you actually have.

"-p" is the invocation for a "printing loop", which acts just like "-n" except that it prints every line that it loops over. This is very useful for "sed"-like operations, like modifying a file and writing it back out (we'll discuss 's///', the substitution operator, in just a bit):

perl -wpe 's/holiday/Party time!/' schedule.txt

This will perform the substitution on the first occurrence of the word 'holiday' in any given line (see "perldoc perlre" for discussion of modifiers used with 's///', such as 'g'lobal.)

The "-i" switch works well in combination with either of the above, depending on the desired action; it allows you to perform an "in-place" edit, i.e. make the changes in the specified file (optionally performing a backup beforehand) rather than printing them out to the screen. Note that we can't just tack an "i" onto the "wpe" string: it takes an optional argument - the extension to be appended to the backup copy - and the text that follows it is what specifies that extension.

perl -i~ -wpe 's/holiday/Party time!/' schedule.txt

The above line will produce a "schedule.txt" with the modified text in it, and a "schedule.txt~" that is the original file. "-i" without any extension overwrites the original file; this is far more convenient than producing a modified file and renaming it back to the original, but be sure that your code is correct, or you'll wipe out your original data!
 
 

RegExes, or "Has The Cat Been Walking On My Keyboard Again?"

One of the most powerful tools available in Perl, the regular expression is the way to match almost any imaginable character arrangement. Here (necessarily) I'll cover only the very basics; if you find that you need more information, dig into the "perlre" manpage that comes with Perl. That should keep you busy for a while. :)

REs are used for pattern matching, most commonly with the "m//" (matching) and "s///" substitution) operators. Note that the delimiters in these, just like in the quoting mechanisms, are not restricted to '/'; in fact, the leading 'm' in the matching operator is required only if a non-default delimiter is used. Otherwise, just the "//" is sufficient.

Here are some of the metacharacters used with REs. Note that there are many more; these are just enough to get us started:

.        Matches any character except the newline
^        Match the beginning of the line
$        Match the end of the line
|        Alternation (match "left|right|up|down|sideways")
*        Match 0 or more times
+        Match 1 or more times
?        Match 0 or 1 times
{n}      Match exactly n times
{n,}     Match at least n times
{n,m}    Match at least n but not more than m times
 

As an example, let's say that we have a file with a list of names:

Anne Bonney
Bartholomew Roberts
Charles Bellamy
Diego Grillo
Edward Teach
Francois Gautier
George Watling
Henry Every
Israel Hands
John Derdrake
KuoHsing Yeh
...

and we want to replace the first name with 'Captain'. Obviously, we would go through the file with a printing loop and do a substution if it matched our criteria:

s/^.+ /Captain /;

The caret ('^') matches at the beginning of the line, the ".+" says "any character, repeated 1 or more times", and the space matches a space. Once we find what we're looking for, we're going to replace it with 'Captain' followed by a space - since the string that we're replacing contains one, we'll need to put it back.

Let's say that we also knew that somewhere in the file, there are a couple of names that contain apostrophes (Francois L'Ollonais), and we wanted to skip them - or anything else that contained 'non-letter' characters. Let's expand the regex a bit:

s/^[A-Z][a-z]* /Captain /;

We've used the "character class" specifiers, "[]", to first match one character between 'A' and 'Z' - note that only one character is matched by this mechanism, a very important distinction! - followed by a one-character match of 'a' through 'z' and an asterisk, which, again, says "zero or more of the preceding  character".

Oops, wait! How about "KuoHsing"? The match would fail on the 'H', since upper-case characters were not included in the specified range. OK, we'll modify the regex:

s/^\w* /Captain /;

The '\w' is a "word character" - once again, it matches only one character - that includes 'A-Z', 'a-z', and '_'. It is preferable to [A-Za-z_] because it uses the value of $LOCALE (a system value) to determine what characters should or should not be part of words - and this is important in languages other than English. As well, '\w' is easier to type than '[A-Za-z_]'.

Let's try something a bit different: What if we still wanted to match all the first names, but now, rather than replacing them, we wanted to swap them around with the last names, separate the two with a comma, and precede the last name with the word 'Captain'? With regexes at our command, it's not a problem:

s/^(\w*) (\w*)$/Captain $2, $1/;

Note the parentheses and the "$1" and "$2" variables: the parentheses "capture" the enclosed part of the regex, which we can then refer to via the variables (the first captured piece is $1, the second is $2, and so on.) So, here is the above regex in English:

Starting from the beginning of the line, (begin capture into $1) match any "word character" repeated zero or more times (end capture) and followed by a space, (begin capture into $2) followed by any "word character" repeated zero or more times (end capture) until the end of the line. Return the word 'Captain' followed by a space, which is followed by the value of $2, a comma, a space, and the value of $1.

I'd say that regexes are a very compact way to say all of the above. At times like these, it becomes pretty obvious that Larry Wall is a professional linguist. :)

These are just simple examples of what goes into building a regex. I must admit to cheating a bit: name-parsing is probably one of the biggest challenges out there, and I could have spun these example out as long as I wanted. Considering that the possibilities include "John deJongh", "Jan M.
van de Geijn", "Kathleen O'Hara-Mears", "Siu Tim Au Yeung", "Nang-Soa-Anee Bongoj Niratpattanasai", and "Mjölby J. de Wærn" (remember to use those LOCALE-aware matches, right?), the field is pretty broad and very odd in spots. (Miss Niratpattanasai, after looking at something like "John Smith". would probably agree. :)
 

Here's an important factor to be aware of in the regex mechanism: by default, it does "greedy matching". In other words, given a phrase like

Acciones son amores, no besos ni apachurrones

and a regex like

/A.*es/

it would match the following:

Acciones son amores, no besos ni apachurrones
|___________________________________________|

Hmmm. Everything from the first 'A' (followed by zero or more of any character) to the last 'es'. How can we match just the first instance, then? To counteract the greed, Perl provides a "generosity" modifier to quantifiers such as '*', '+', and '?':

/A.*?es/

Acciones son amores, no besos ni apachurrones
|______|

There. Much better. For future reference, remember: if you're breaking up a string by matching its pieces with a series of regexes, and the last "chunks" are coming up empty, you've probably got a "greed" problem.
 
 

The Default Buffer/Variable

Some of you, especially those who have done some programming in the past, have probably been curious about some of the code constructs above, like

print if /holiday/;

"Print what if what? Where is the variable that we're checking for the match? Shouldn't it be something like 'if $x == /holiday/', the way it is in the shell?"

I'm glad you asked that question. :)

Perl uses an interesting concept, found in a few other languages, of the default buffer - also referred to as the default variable and the default pattern space. Not surprisingly, it's used in the looping constructs - when we use the "-n/-p" syntax in the Perl invocation, it is the variable used to hold the current line - as well as in substitution and matching, and a number of other places. The '$_' variable is the default for all of the above; when a variable is not specified in a place where you'd expect one, '$_' is usually the "culprit." In fact, '$_' is rather difficult to explain - it turns up in so many places that coming up with an algorithm is seemingly impossible - but it is wonderfully easy and intuitive to use, once you get the idea.

Consider the following:

perl -wne 'if ( $_ =~ /Henry/ ) { print $_; } pirates

If a line in the "pirates" file, above, matches "Henry", it will be printed. Fine; but now, let's play some amateur "Perl Golf" - that's a contest among Perl hackers to see how many (key)strokes can be taken off a piece of code and still leave it functional.

Since we already know that Perl reads each line into '$_', we'll just get rid of all the explicit declarations of it:

perl -wne 'if ( /Henry/ ) { print; } pirates

Perl "knows" that we're matching against the default variable, and it "knows" that the "print" statement applies to the same thing. Now, we apply a little Perl idiom:

perl -wne 'print if /Henry/' pirates

Isn't that nice? Perl actually allows you to write out your code with the condition following the action; kinda the way you'd say things in English. Oh, and we've snipped off the semicolon on the end because we don't need it: it's a statement separator, and there's no statement following
"/Henry/".

<grin> For those of you playing along at home, try

perl -ne'/Henry/&&print' pirates

It shouldn't be that hard to figure out; the '&&' operator in Perl works the same way as it does in the shell. Perl Golf is fun to play, but be careful: it's easy to write code that will work but will require lots of head-scratching to understand. Don't Do That. I may have to maintain your code tomorrow... just like you may have to maintain mine.
 

In the first example, note the "binding operator", '=~', which checks for a match in the supplied variable. This is what you would use if you were matching against a variable other than "$_". There is also a "negative match" operator, '!~', which returns true if the match fails (the inverse of '=~'.)

Note also that the available modifiers for simple statements, like that above, include not only the "if", but also "unless", "while", "until", and "for". All of these, and more, are coming up in Part 3...
 
 

Ben Okopnik
perl -we '$perl=0;JsP $perl "perl"; $perl->perl(0)'\
 2>&1|perl -ne '{print ((split//)[19,29,20,4,5,1,2,
15,13,14,12,52,5,21,12,52,8,5,14,1,6,37,12,52,75])}'



[1]. And in fact, has one - "Mastering Regular Expressions" by Jeffrey E. Friedl is considered to be a reference on the subject. It includes some wonderful examples, and literally teaches the reader to "think in regex".
 

References:

Relevant Perl man pages (available on any pro-Perl-y configured system):

perl      - overview              perlfaq   - Perl FAQ
perltoc   - doc TOC               perldata  - data structures
perlsyn   - syntax                perlop    - operators/precedence
perlrun   - execution             perlfunc  - builtin functions
perltrap  - traps for the unwary  perlstyle - style guide

"perldoc", "perldoc -q" and "perldoc -f"


Copyright © 2001, Ben Okopnik.
Copying license http://www.linuxgazette.net/copying.html
Published in Issue 64 of Linux Gazette, March 2001

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]