ALINK="#FF0000"> << Prev  |  TOC  |  Front Page  |  Talkback  |  FAQ  |  Next >>
LINUX GAZETTE
...making Linux just a little more fun!
Perl One-Liner of the Month: The Adventure of the Runaway Files
By Ben Okopnik

 - "Well, well - what have we here?"

Woomert Foonly had been working with his collection of rare airplanes, and was concentrating on the finer details of turbocharger gate flows and jet fuel cracking pressures. Nevertheless, the slight noise behind him that heralded an unannounced visitor (Woomert could recognize Frink's step quite well) caused him to instantly spin around and apply a hold from his Pentjak Silat repertoire to the unfortunate sneak, causing the latter to resemble a fancy pretzel (if pretzels could produce choked, squeaking sounds, that is). The question was asked in calm, measured tones, but there was an obvious undertone of "this hold could get much more painful very quickly, so don't waste my time" that changed the helpless squeaking to slightly more useful words.

 - "Ow! I'm - ow! - sorry, Mr. Foonly, but I just had to come see you! I've got this bad problem, and - ow, ow! - I really didn't want anybody to know, and - ouch! - I didn't want to use the front door, 'cause somebody might have spotted me! I didn't mean any - ow! - harm, really!"

Woomert sighed and released his grip, then helped the stranger untangle himself, since he clearly would not be able to, for example, untie his left shoelace from his right wrist - especially since it was tied behind his back. He smiled briefly to himself while working; the old skills were still in shape, and would be there when he really needed them.

 - "Next time, I suggest calling or emailing me ahead of time. The Zigamorph Gang, whom I helped apprehend when I solved the Bank Round-Downs Mystery, is out of prison and threatening various sorts of mayhem; I can handle them and their plotting, but it's just not a smart idea to sneak up on me right now - or at any time. Who are you, anyway?"

The visitor shook himself and made a forlorn attempt at straightening out his rumpled jacket. Since it now resembled a piece of wrung-out laundry, he gave up after a few moments and shook his head mournfully.

 - "Well... my name is Willard Furrfu. You see, Mr. Foonly, I'm working as a data entry operator, but I've been trying to learn some programming skills after work so I can get ahead. I've managed to install a C compiler in my home directory, and I've been experimenting with loops... and I managed to really screw things up. I'm hoping you can help me, because if anybody finds out what happened, I'm toast!"

While Willard was talking, Woomert quickly cleaned up his workbench and closed the plane's cowling. When he was done, he beckoned his guest out of the hangar and into the house. Once inside, he started a pot of tea, then sat down and examined his guest.

 - "Tell me exactly what happened."

 - "Well... I'm not really certain. I wanted to practice some of the stuff I've learned by copying an existing file to a random filename one line at a time; unfortunately, it seems like the function that I wrote looped over the file creation subroutine as well as the line copy function. It took me only a few seconds to realize it and kill the process, but there are now thousands and thousands of files in my home directory where there used to be only fifty or sixty! Worse yet, given the naming scheme for the valid files, it's impossible to tell which ones they are - the names look kinda random in the first place - and I can't even imagine doing this by hand, it's impossible. I don't mind telling you, Mr. Foonly, that I'm in a panic. I tried writing some kind of a function that would loop through and compare each file with every other one in the directory and get rid of the duplicates, but I realized half-way through that, one, I'm not up to that skill level, and two, it adds up to a pretty horrendous number of comparisons overall - I'll never get it done in time. Tomorrow morning, when I'm supposed to enter more data into these files, I'll be in deep, deep trouble - and I'd heard of you and how you've helped people with programming problems before. Please, Mr. Foonly - I don't know what I'll do if you turn me down!"

 - "Hmm. Interesting." Woomert sniffed the brewing tea and closed the lid tightly, then sat down again. "What kind of files are these?"

 - "Text files, all of them."

 - "Are they very large?"

 - "Well, they're all under 100kB, most of them under 50kB. I'd thought of taking one file of each size, but it turns out a number of them are different even though the size is the same."

 - "Do you care what the actual remaining file names are, as long as the files are unique?"

 - "Why, no, not at all - when there are only the original files, I can go through them all in just a few minutes and identify them. Mr Foonly, do you mean that you see a solution to this problem? Is it possible?"

Woomert shrugged.

 - "Let's take a look at it first, shall we? No point in guessing until we have the solid facts in hand. However, it doesn't look all that difficult. You're right in saying that comparing the actual files to each other would be a very long process; tomorrow morning would probably not suffice unless it was a very powerful computer..." At Willard's hangdog look, Woomert went on. "I didn't suppose it was, from the way it sounded. Well, let's give it a shot. How do we get there from here?"

Willard brightened up.

 - "I'd followed a number of your cases in the papers, Mr. Foonly, and knew that you preferred SSH. In fact, I had just convinced our sysadmin to switch to it - we'd been using telnet, and after I showed him some of what you'd said about it (I had to censor it a bit, of course), he became convinced and talked the management into it as well."

 - "Not bad, Willard. You're starting off right - in some ways, anyway. Whatever language you choose to learn, you need to be careful. You never know what the negative effects could be, so until you're at least semi-competent, you need to stay away from live systems. When this is over, I suggest you talk to your sysadmin about setting up a chroot jail, where you can experiment safely without endangering your working environment."

  - "I'll do that, Mr. Foonly, as soon as I get back to the company. Do you think that fixing this will take long?"

 - "Let's see. Go ahead and use that machine over there to log in, and we'll see what it tells us. What do you know - ``ls -l|head -1'' says ``total 27212'', which tells us that's how many files you've got. So far, so good. All right - first of all, what did you call the program that did this?"

 - "Um, ``randfile''. I've still got the source..."

 - "That's good, because we're going to delete it. I'd hate to have you accidentally undo everything after it's fixed! Now, let's see... yep, these look like all text, no problem. Another notch for you, Willard: accurate problem reporting is a good skill to have, and you seem to be doing well. All right then..."


perl -MDigest::MD5=md5 -0we'@a=@ARGV;@h{map{md5($_)}<>}=@a;@b=values%h;print"@b\n"' *
Woomert's fingers flew over the keyboard as he fired off the one-liner. After about a second, he smiled but kept watching the screen - which, after a  another second or two, printed a list of filenames.

 - "There you are, Willard - a list of unique names. I'm glad your system had the module that I needed - it's a common one, but I wasn't certain. Copy those off to another directory, delete all the others, and copy them back, and you're all done. You could even automate the process by writing..." A mischievous grin flashed over Woomert's face as he paused for a second. "...a program. Well, a one-line shell script, anyway."

 - "That... that's it???" Willard stared in hope and disbelief at the screen where the short list of files beckoned for action. He quickly created a subdirectory in "/tmp", copied the files by carefully using "cp" and backticks around Woomert's script, and scanned them by using "less". When he turned toward Woomert a few seconds later, his face was shining with joy.

 - "Mr. Foonly... you've saved me. I promise I'll be far more careful from now on, and I'll talk to our administrator about setting up a - what did you call it, a ``chroot jail''? - anyway, I'm really grateful. How can I ever repay you?"

 - "Well, you could bring me large loads of gold and jewels..." Woomert stopped and laughed at the look of dismay on the young man's face, "just kidding. I have a suggestion for you, though, that you might put some thought into. You seem to have some aptitude for programming - I was just looking at your "randfile.c", and except for the obvious errors, you were doing pretty well. I'd suggest you take a few programming courses at the local vocational school as a start - when you're just starting out, it's difficult to get anywhere, particularly in languages like C and C++ where there are many, many traps and pitfalls for the unwary. They work well for their specific purposes, mind you - but you should have some formal training to understand the background of what you're doing, or you end up with a mess."

 - "A vocational school." Willard seemed struck by the idea. "Say, I never thought of that; I just knew that college was too expensive for me right now, and I wanted to learn somehow. Great idea, Mr. Foonly; I'll run down there and find out what it takes as soon as possible! I'll even put practicing C aside for now, until I do learn some of the background... what about the stuff that you were using? I'd heard of PERL before."

 - "Well, it's not called ``PERL'', since it's not an abbreviation - although some people have come up with back-formations for what it stands for [1]. It's ``Perl'' if you're talking about the language, and "perl" for the the executable name. Yes, I think that learning Perl would be a very good idea, especially if you're going to back it up with a later study of C; you'll find that it's easy to learn and keep learning, allows you to become competent quickly, and avoids many of the problems of the older languages that have you dealing with abstruse issues like memory management and bad pointers. I'd suggest picking up a good book - be careful, there are many poorly-written books on Perl, but I can definitely recommend "Learning Perl'' by Randal Schwartz and Tom Phoenix - and studying it. An evening or two of that, and you'll be able to get in trouble even more efficiently than you did with your C program." Woomert grinned at the somewhat woebegone-looking Willard, who finally grinned back.

 - "Well, I've actually read up on it a little bit before, but I'd read all kinds of things on the Net about Perl being hard to read, or hard to understand, so I was a little reticent about studying it. Actually, " Willard looked abashed, "after seeing your code, I know what they mean. Is it always that complicated?"

 - "Not at all. I use these one-liners because I understand Perl well, and because they're not code that I'm leaving for someone else to use. In fact, if you're interested, I can explain what I did and show how it would look in a script."

- "Mr. Foonly, I'd be fascinated. After all, I'm going to be learning this stuff - what better way to start than by hearing you explain it?"

Smiling, Woomert extracted his cell phone from the quick-release waterproof stainless steel holder that he'd recently invented.

"Hold on while I get Frink. He'd like to see this too, I'm sure. Hello, Frink? Got a case here... actually, it's solved already, but you might want to see the method. Ten minutes? See you then." He returned the phone to its holster. "We'll just have some of this excellent brew that I've made up until he gets here. It's a pure, fine-pluck, high-altitude rolled Nepalese tea that's got a wonderful smoky flavor. A cup for you?..."

A bit later, Frink showed up, looking like he'd torn himself away from some project or another. He also looked disappointed, but Woomert immediately forestalled him.

 - "Frink, I know that you strongly prefer to participate in my cases; I do also, since you're now going to be my partner. However, there are times when a case just sneaks up on you and turns into a knotty problem before you can blink, and you have to get things tied up before it loops and replicates itself into some huge number of variables." Both of them glanced over at Willard who was by now unsuccessfully trying to choke down his laughter. "Willard, for example, understands precisely what I mean. Anyway, be assured that I would not have left you out if there was not a time element involved; as it turned out, I was able to solve the problem quickly, but there was always the chance that we'd need every available second. Let me tell you about it and judge for yourself."

A few moments sufficed to explain what had come before, and Frink nodded and smiled at Woomert.

 - "Thanks, Woomert. I was feeling left out, and I appreciate your explaining that. Good communications between partners are important, aren't they? That's a lesson all its own." The two of them grinned at each other before turning to the computer.

 - "Go ahead, Frink. Can you break this one out for Willard? I'll be right here, so if you get stuck, I'll keep it going."

- "All right, then. Let's see." Frink stared at the code on the screen, forehead furrowed in concentration.

perl -MDigest::MD5=md5 -0we'@a=@ARGV;@h{map{md5($_)}<>}=@a;@b=values%h;print"@b\n"' *

 - "All right. ``-MDigest::MD5=md5'' is pretty easy: you're loading the ``Digest::MD5'' module and importing the ``md5'' method from it, just as we've talked about before. ``-we'', we know about - enable warnings and execute what follows as a script. ``-0'', now... ah, I remember - a number as an option is the octal code of the end-of-line definition for the files we're reading in. Oh, I get it! You're effectively disabling the EOL, thus ``slurping'' entire files, one at a time. Right?"

Woomert silently applauded; Frink grinned and turned back to the screen before him.

 - "Next. You copy @ARGV right at the start - this saves the list of file names so you can re-use them, since @ARGV is going to change as we read in the files. Furthermore, you didn't have to use a BEGIN procedure to do this since we're not looping the entire script, as we would be with a ``-n'' or a ``-p'' switch. Next... uh, next it gets pretty tricky. I'll admit that you've just lost me, although I can explain what you did further on: you copied the values in the %h hash to an array so you could use Perl's "pretty print" mechanism: an array in double-quotes is printed with spaces between the elements, which was what you wanted. The ``\n'' at the end also deserves a comment: normally, you'd use the ``-l'' switch on the command line which would append the EOL to every line that was printed, but you'd redefined EOL as a null, so that wouldn't help - so you had to use the ``\n''. How's that?"

 - "Well done, partner. Now, here's the rest of the story - are you following this, Willard? Speak up if you don't understand something. While Frink is ``chanting his beads'', so to speak, and learning in the process, you're our reviewer for this run: if it's not being clearly explained, we'd like to hear from you."

Willard cleared his throat.

 - "Well - actually, I understand it all so far. I'm guessing that a ``module'' is like a C library, and ``Digest::MD5'' probably has to do with, well, generating MD5 sums - I've heard of this but am not really sure of what that means. Other than that, yes, I think I've got it."

Frink spoke up.

 - "An MD5 digest, or sum (sometimes also called a hash), is used as a unique ID for strings, most commonly file contents. If you get a file and its MD5 hash, you can check it using commonly available tools to make sure that the file hasn't changed in any way by generating a new sum from the file and comparing it with the one you've received. In fact, here's a useful little utility that I use to do exactly that, instead of having to visually compare them:

#!/usr/bin/perl # "md5check" created by Ben Okopnik on Wed Apr 9 21:27:05 EDT 2003 use warnings; use strict; use Digest::MD5; die "Usage: ", $0 =~ /([^\/]+)$/, " <filename> <md5_hex_digest>\n" unless @ARGV == 2; open Fh, shift or die "Can't open: $!\n"; my $d = Digest::MD5 -> new -> addfile( *Fh ) -> hexdigest; print "MD5 sums ", ($d eq shift) ? "" : "*DO NOT* ", "match.\n"

Makes it a little easier, I think. Anyway, back to Woomert's explanation... I'd like to see how he pulled off this particular trick."

Woomert smiled at his partner.

 - "Obviously, you're talking about the ``@h{map{md5($_)}<>}=@a'' bit, right? Yeah, that one is a little complex if you're not used to it. What I did there is use a hash slice to populate %h - it's a neat little idiom to keep in mind. If you think about how a hash is structured:
key1 => value1
key2 => value2
key3 => value3
key4 => value4
key5 => value5
...
you'll see that it's an array of keys which point to an array of values. Consequently, we can treat it as such; as an example, we can create a hash of the alphabet and letters' numerical positions by saying
@alpha{ 1 .. 26 } = "a" .. "z";             # The range operator, '..' generates the two lists
The ``@'' sigil before the hash name simply indicates the context of what is going on; what tells us about the type of variable we're using are the curly braces following the variable name - that indicates a hash. If we saw square braces, we'd know we were dealing with an array slice instead.

Still, that doesn't explain everything - so here's the rest of it. Since we're reading in the file contents one large slurp at a time, meaning that we get one entire file's worth when we read the special ``<>'' filehandle, I simply used the map function to do an implicit loop over it - and run the ``md5()'' routine over each of those chunks of text. I would have had to do something very different if these weren't text files - a file that contained a null would have thrown off the count - but they were. My safety margin was in the fact that the ``-w'' switch would warn me if I had an unbalanced hash - which would happen if there was a null anywhere in there. So, I created a hash of keys which were MD5 digests of the file contents, and assigned the array of file names that I'd created earlier as the values. It's important to note that hashes do not store the key-value pairs in the order that they're assigned... but it wasn't a factor here, since we were really dealing with arrays which are stored in order.

Now, Frink, I'll leave this one thing to you. Why did this produce a list of unique file names?"

Frink laughed.

 - "Thanks, Woomert. I actually do know this one. Since a hashes keys are unique - values don't have to be, but keys do - every time that you added a key/value pair where the key already existed in the hash, the old value for that key simply got overwritten. Voila - a unique list. In fact, I can now break all this out in a script... mmm, I'll have to change a few things, since the way you did it is implicit in that hash slice mechanism:


#!/usr/bin/perl -w use Digest::MD5 qw/md5/; { local $/; # Temporarily undefine EOL @n=@ARGV; $count = 0; while ( <> ){ $key = md5($_); $value = $n[$count++]; $uniq{ $key } = $value; } } print"$_ " for values %uniq
After a moment or two, Willard suddenly spoke up.

 - "Say, I think I understand this stuff. Why, that doesn't look complicated at all! I'm not sure about the ``$_'' and the ``$/'' variables, but I'd think I can find out about those - Perl does have good documentation, right?"

Frink and Woomert both laughed, and Frink fielded the question.

 - "The best. In fact, it all comes with Perl - and is augmented with every module you install. It's all available via the ``perldoc'' program; start by reading ``perldoc perldoc'', and you'll never find yourself at a loss for information about Perl."

Somewhat later, after the very grateful Willard had headed for home and (finally) a night of sleep, Frink and Woomert were relaxing with a rare recording of Burundi Ubuhuba nose-singing that was accompanied by a thumb-piano and zither. As usual, the food accompanying the music was tasty and highly appropriate: dinner consisted of curried ingelegde vis (a spicy fish recipe that Woomert had learned at Cape Malay) and futari (squash and yams) on the side, with East African samosa bread and spicy piri-piri sauce for the adventurous. Pickled African peaches wrapped up the menu.
Suddenly, there was a loud jangling noise from the outside, followed by cursing that would blister cheap paint (Woomert had providentially done the house and the out-buildings in a top-grade epoxy, so they weren't affected), and by police sirens shortly thereafter.

 - "Ah." Woomert casually leaned back in his chair, nibbling on one last tasty peach. "That would be the Zigamorphs. Back to prison they go for violating their probation; they had been explicitly told to stay out of my neighborhood."

 - "What... happened, Woomert? It sounded pretty bad."

 - "I knew they'd come calling soon, and had set a trap for them. Just a very basic numerical complement program which would throw a steel-cage exception when it detected a null [2]. One of these days, Frink, the criminals will become intelligent - mark my words, it's a simple matter of selection pressure. Until then, we can all sleep safe in our beds..."

[1] Larry Wall, the creator of Perl, has suggested "Pathologically Eclectic Rubbish Lister" for those who simply can't stand to have Perl not be an acronym. "Practical Extraction and Report Language" has also been suggested for those who have to sell the idea of using it to management, which is usually well-known for its complete lack of a sense of humor.


[2] A zigamorph, according to the Jargon File, is a hex 'FF' character
(11111111). A numerical complement of this would, of course, 
be all zeros - a null.

 

Ben is a Contributing Editor for Linux Gazette and a member of The Answer Gang.

picture Ben was born in Moscow, Russia in 1962. He became interested in electricity at age six--promptly demonstrating it by sticking a fork into a socket and starting a fire--and has been falling down technological mineshafts ever since. He has been working with computers since the Elder Days, when they had to be built by soldering parts onto printed circuit boards and programs had to fit into 4k of memory. He would gladly pay good money to any psychologist who can cure him of the resulting nightmares.

Ben's subsequent experiences include creating software in nearly a dozen languages, network and database maintenance during the approach of a hurricane, and writing articles for publications ranging from sailing magazines to technological journals. Having recently completed a seven-year Atlantic/Caribbean cruise under sail, he is currently docked in Baltimore, MD, where he works as a technical instructor for Sun Microsystems.

Ben has been working with Linux since 1997, and credits it with his complete loss of interest in waging nuclear warfare on parts of the Pacific Northwest.


Copyright © 2003, Ben Okopnik. Copying license http://www.linuxgazette.net/copying.html
Published in Issue 91 of Linux Gazette, June 2003

<< Prev  |  TOC  |  Front Page  |  Talkback  |  FAQ  |  Next >>