ALINK="#FF0000"> [ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]
LINUX GAZETTE
...making Linux just a little more fun!
Perl One-Liner of the Month: The Adventure of the Misnamed Files
By Ben Okopnik

A REPORTER'S NOTE

Recently, I became acquainted with a set of documents which journal the adventures and experiences of none other than Woomert Foonly, the world-famous but strangely reticent Hard-Nosed Computer Detective. To the best of my knowledge, the information they contain is authentic. My anonymous correspondent - whom I suspect to be Frink Ooblick, the great man's sidekick and confidant, although my sworn promise forbids me to investigate - had emailed me a short note which engaged my interest, then sent me an encrypted file which took several nights of concerted hacking effort to decrypt (he seems to think that this indicates a sense of humor on his part). This has become the pattern: once in a while, I'll receive a file from a sender whose name matches a complex regular expression (the procmail recipe for this has grown to several pages, and now seems to be mutating on its own). I then have to drop whatever I'm doing and begin hacking at high speed - the encryption method is, in some manner which I've been unable to puzzle out, time-dependent, and becomes much more difficult to break in a few short hours.

In our early exchanges, I had been granted permission to publish this material. My correspondent has stated that he chooses to keep his identity private, and is satisfied to receive no credit for his work. I present it here, though I can't claim authorship, in the hope of casting some light on the work of that great detective whose exploits had until now been shrouded in deepest mystery.

Ben Okopnik
On board S/V "Ulysses", October 10th, 2002


The filesystem was quiet and dark; all the disk writes had been synched, the hard drive had spun down, and the CPU had shifted into low-power mode. Even the usually exuberant Frink seemed subdued on this occasion, and was quietly double-checking their remote-system passwords and permissions, a necessary precaution before they could leave the comfort of their '/home' in the armored SSH transporter.

Woomert, however, felt calm and ready for action. This was where he preferred to operate, in the twilight zone between power modes; in these conditions, even the dreaded Heisenbugs [1] - though his current assignment did not involve anything nearly that dangerous - would be somnolent, and could often be trapped unaware.

His client, severely distraught and sobbing into a lace handkerchief, had confessed that her file naming scheme had gone completely out of control - wild strings had invaded her precious naming convention, formerly so full of sense and intuitively obvious to even the casual user. The employee responsible for this outrage had been severely LARTed [2], but the police detectives had simply shrugged in bafflement, and all other avenues pointed to grim scenarios of manually renaming hundreds, if not thousands, of files. True, the files contained the preferred names enclosed in the HTML '<title>' tags - but the amount of work necessary to do it by hand was staggering. Woomert was her last hope.

 *   *   *   *   *   *   *   *   *   *

Moving quietly, Woomert approached the inode marked "/var/apache/htdocs". Finding it had taken a bit of top-down traversal, but his familiarity with the File::Find module [3] had made short work of that; the client had sobbed out that the rogue file names matched the /^[A-Z][0-9]+\.html?$/ pattern [4] - in other words, they all started with a capital letter followed by one or more digits, and ended with a ".htm" or a ".html" extension. Given that hint, it had taken only seconds to locate the rogues' lair.

As he entered, the disgraceful state of affairs became immediately evident: disreputable-looking filenames hung around on every corner, misbehaving and intimidating the passerby; others, dressed in nothing more than transparent symlinks, leaned out of xterm windows and lewdly propositioned the passing data. Something had to be done, and soon - the newer filenames that came into the area were almost immediately corrupted by the ubiquitous bad examples.

 - "Sheesh, Woomert," whispered Frink, "this looks bad. What are you going to do? There's thousands of them!"

 - "Don't worry, kid." Woomert calmly ambled up to the command line interface, his hat pulled down low against the headlight glare of the heavy HTTP traffic. "I've just downloaded the latest version of Perl. They don't stand a chance." Pulling on his typing gloves, he rapidly executed a one-line command string.


perl -wlne'END{print$n}eof&&$n++;/<title>([^<]+)/i&&$n--' *

The results were astonishing: even as the monitor displayed a large '0', every one of the miscreants suddenly stopped whatever they were doing and whirled around to stare at the two of them. They could obviously sense the sudden danger represented by these two strangers in trenchcoats; the largest of them, an ugly character with "X6664934755666.htm" tattooed on his chest immediately headed in their direction while reaching into his pocket. His intentions were clearly not related to presenting Woomert and Frink with flowers and the private DSA key to his domain.

 - "Quick, Woomert," cried Frink, "do something! It looks like he's going to throw a Nimda, or even a Code Red!"

Woomert glanced over at his nervous sidekick.

 - "I told you, kid, relax. Number one, we've got Perl..." His lightning-fast fingers drummed out another virtuoso solo on the console:


perl -wlne'/title>([^<]+)/i&&rename$ARGV,"$1.html"' *

 - "...and number two," as the wild scene around them faded, only to reform as a neat, clean neighborhood with neatly arranged files proudly wearing names like
'History 227, Lecture 35: Origins of the Roman Revolution.html', "we're running Linux. Viruses are pretty much someone else's problem."

 *   *   *   *   *   *   *   *   *   *

Later that evening, after they had collected their well-earned fee from the grateful client and were relaxing with a fine high-altitude Lee Shan tea that Woomert had brought back from his recent telnet to the Far East, Frink finally ventured to ask the questions that had been on his mind ever since that fateful showdown.

 - "Woomert, I saw you fire off those command lines, but I couldn't follow what you were doing. I could see the regular expression, and even noticed the implicit loop, but what was all the rest of it?"

 - "Elementary, my dear Frink. If you'll recall the first line..."


perl -wlne'END{print$n}eof&&$n++;/<title>([^<]+)/i&&$n--' *

"...you'll see that I invoked Perl with the following command switches:"
-w Enable warnings
-l Enable line-end processing
-n Implicit non-printing loop
-e Execute the following commands
"By enabling warnings, I had told Perl to check my syntax, something that should be done every time you run a script. I then specified line-end processing, in effect adding a newline to each printed string. Then, I told it to loop through the contents of each file, and run the string in the single quotes as a script."

"As you had so astutely noted, I had indeed set up a loop. What you may have missed, however, was that there were actually two concurrent loops: I had specified a list of files via the shell filespec of '*', and Perl would read them in, one at a time. It's also important to note that the syntax of the regex inside the quotes which enclose the script looks similar to but is very different from the regex outside - the former is parsed by Perl, using its internal regex engine; the latter is handled by the shell, which uses a far simpler system called 'globbing'."

 - "Wonderful!" Frink was as excited as a young pup on his first hunt. "And what did you do in the script itself?"

"First, I wanted to double-check that my regex actually matched what I thought it should. The easiest way was to count the number of files - I incremented '$n' every time the 'eof' (end-of-file) function returned 'true' - and subtract the number of matches. If the total had been greater than 0, that would have indicated a mismatch somewhere. Fortunately..."

 - "Yes, I remember - it printed a zero."

"Which meant that everything was correct. The 'END{print$n}' statement was executed at the end of the run - note that this is independent of its position in the script, although most people put it at the end. I saved a keystroke by putting it first - a statement that follows a block, as in the case of that 'eof&&$n++', does not require a semicolon. In Perl Golf [5], every stroke matters!"

"Next, let us examine the regex that I'd used; that's the heart of this script."

/         # Begin the regex
title>    # Match this literal string
([^<]+)   # Capture one or more characters not matching '<' in $1
/i        # End regex, use the "ignore case" modifier

The '/'s enclose the regex that we're trying to match; that's standard Perl syntax, which you seem to have recognized. See that '+', there? I'm taking advantage of Perl's "greediness" in regex interpretation: '+' means 'one or more of the preceding character or group', but with the implication of 'match as many as possible'. It will grab everything until a literal '<' (beginning of an HTML tag) or the end of the current line. So, every time the pattern matched the line, I updated '$n' by using the logical 'and' coupled with the decrement operator."

"As an overview, here is an equivalent script that shows all of the above in a more readable fashion:"


#!/usr/bin/perl -w
while ( <> ){ # Equivalent to "-n"
        $n++ if eof;
        $n-- if /<title>([^<]+)/i;
}
print "$n\n" # The "\n" would have been added by "-l"

"Obviously, this script would be invoked as 'perl <scriptname> *', or simply './scriptname *' if it had been made executable."

"As a final item of note, look at the 'active' statement in the script, the only one that makes any changes or creates any output. It is simply 'print'. In fact, the whole line was a test - I wanted to make certain that everything worked properly before committing actual changes to disk, something I believe to be a wise policy. I could see, from the ugly looks of that crowd, that I would not get two chances at the actual renaming; at least one of them had an 'rm -rf /', and would not have hesitated to use it."

 - "Heavens, Woomert!" Frink's shock was evident in his features. "You must be as brave as a lion, to face something like that."

The famous detective glanced at the shiny stainless-steel and Kevlar "chroot" call that he had extracted from his pocket and smiled.

 - "Well, I had a few tricks held in reserve, anyway. On to the actual renaming, eh? If you remember the expression itself..."


perl -wlne'/title>([^<]+)/i&&rename$ARGV,"$1.html"' *

"...you'll note that much of it resembles the first one; however, there are a few novel features. First, I still used "-l" in the option set, but now the reason was somewhat different: since the captured strings in '$1' were likely to contain a newline ('\n'), we needed a way to get rid of it. Perl is very clever about removing leading and trailing whitespace and handling odd characters when using 'rename', but 'filename\n.html' would cause problems. Fortunately, '-n' also 'auto-chomps' the incoming lines - meaning that it will remove the newline character as the line is read in."

"Next, '$ARGV' is a Perl variable containing the name of the file that is currently being read. Since '$1' held the result of our first capture within the regex (the contents within the first set of parentheses are stored in '$1', the second in '$2', and so on), renaming was an easy task. It would also let us regularize the extensions - 'html' for all of them."

"If I were to spell out the above line in a more conservative - and perhaps more readable - fashion, it would look like this:


#!/usr/bin/perl -w
while ( <> ){
        chomp;                      # Equivalent to "-l"
        if ( /title>([^<]+)/i ){
                rename $ARGV, "$1.html"
        }
}

 - "Since they were bearing down on us, though..."

 - "Precisely; those extra characters could have made the difference between life and death. I must say that I didn't expect them to react so fiercely to a simple match-and-print, but they say that filesystems are getting smarter and smarter - according to a Western guru [6] with whom I once held converse, there were at least five journaling filesystems available for Linux, and I've heard of many FS-related kernel patches since. Fortunately, we were more than equal to the task."

"Now, if you'll be good enough to pass me that Rotterdam redfish paste and that San Francisco sourdough, I'll tell you of the next case that's coming up. Pay attention, young Frink - this promises to be a good one..."
 
 


[1] (From the Jargon File:) A bug that disappears or alters its behavior when one attempts to probe or isolate it.

[2] (From the Jargon File:) Luser Attitude Readjustment Tool (properly applied upside the head of the appropriate clueless person.)

[3] See "perldoc File::Find".

[4] Matching patterns in Perl consist of so-called "regular expressions". For more information on REs, see "perldoc perlre".

[5] Perl Golf is a highly twisted form of Perl programming in which brevity is king, and readability is gleefully thrown out of the nearest window. Woomert is an avid golfer who often produces unreadable (but effective) gibberish in Perl; one-liners (lines of Perl less than 80 characters in length) are often examples of Perl Golf. NOTE: This game is played to impress other Perl hackers, and to produce short but effective command-line tools. Using this in code that others are supposed to work with or code that you write for pay is truly bad form, and can (should!) come back to haunt you.

[6] Per Jim Dennis, 2001. There may be even more today...


Copyright © 2002, Ben Okopnik. Copying license http://www.linuxgazette.net/copying.html
Published in Issue 84 of Linux Gazette, November 2002

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]