Minimal Perl for Unix and Linux People | Page 4 | WebReference

Minimal Perl for Unix and Linux People | Page 4


[previous] [next]

Perl as a (Better) Find Command

6.3 Finding Files

Perl's facilities for text processing make it a natural choice when you need to select files whose names have particular properties. We'll look at some typical cases next.

6.3.1 Finding files by name matching

One common use of find is to identify pathnames having certain patterns of characters in their final segments, using the –name option. For example, Don is looking for a text file he created with the vi editor a long time ago. After contemplating the many possibilities, he concludes that the filename might have been "letter", or it might have contained "memo", or it might have started with "epistle".

He composes the appropriate find command, and tries it:

Note that it's vital to enclose those alternative -name options joined by -o (or) operators within backslashed parentheses. Unfortunately, due to the way find works, the result of omitting them is an incorrect outcome, rather than an error message.

Using a Perl command instead to do the filename matching allows a solution that's less error-prone and more powerful. That's largely because Perl's pattern-matching is based on a powerful regular expression (regex) notation with an intuitive egrep-like syntax, rather than find's more limited filename generation (FNG) notation coupled with a cumbersome syntax.

In addition, Perl uniquely supports the text-file test, which is appropriate to use when you're searching for files created with vi, like the one Don misplaced. Using it eliminates undesirable matches against names of compiled programs, such as the matches with "*memo*" shown in the last two lines of the previous command's output.

Here's Don's improvement on the previous find command, which handles the trickier parts of the problem with Perl:

Notice that Don used the a and F options to request the automatic splitting of the incoming pathnames into fields, using "/" as the delimiter. This makes it easier to direct the matching to the final segment of each pathname, to mimic what find's option –name does.10

The matching operator is used to scan each pathname's final segment (in $F[-1]) for the exact string "letter", or the substring "memo", or a string starting with "epistle"—with the entire pathname being printed (from $_) for each match.

In summary, Don's vague recollections about his text-file's name were accurate enough to let him write two kinds of commands to find it. The command using the POSIX find by itself requires a tricky syntax and uses a relatively weak pattern-matching notation, whereas an approach relying primarily on Perl has the benefits of a more powerful matching facility with a familiar egrep-like syntax, and the ability to distinguish text files from nontext files.

We'll next use Perl with a matching operator to select pathnames in another way that POSIX find just can't match.

Finding multi-word filenames

Let's consider the intriguing case of Steffi, who has a lingering thumb. When using her word-processing application, she saves her documents under multi-word filenames, but because of her "thumb issue", those words may be separated by one space or several, depending on how long her thumb lingers on the (automatically repeating) key.

Right now, she needs to rapidly locate a file named "Final Report", or maybe it was "Final report", or possibly "final report", or perhaps even "final Report", or blast it, quite possibly " FINAL REPORT". What's more, because of her thumb issue, she needs to make allowances for various numbers of spaces between the words.

She'll be using the POSIX find command, so to save a lot of redundant typing, she simplifies the solution by initially looking for filenames having only one or two spaces of separation between the required words. She also arranges for uppercase and lowercase variations to be allowed for ever y character, through the highly effective but egregiously cumbersome method of using a character-class for each and every letter. Here's the resulting command:

To handle the case of two spaces between the words, Steffi retyped the first -name line with an extra space between the words to create the second –name line. She needed to match names containing additional spaces as well, but she was already sick of typing by this point and highly motivated to look for an easier solution.

After pleading with a Perlish friend for help, she came up with this alternative:

The -B operator checks that the current filename is a binary file (i.e., non-text; see table 6.2), which is appropriate because Steffi's wordprocessing program saves files in a format of that type. The find command can't test for this property, so Steffi couldn't have been certain of finding the right file types with her solution based entirely on it.

The i modifier on the matching operator requests a case-insensitive match, thereby dispensing with all the [Cc][Aa ][Ss][Ee]-variation complexities of the find solution with one keystroke.

The "+" quantifier following the space allows for one or more spaces between the words, accommodating much more extreme cases of thumb-down hysteresis than the more complex but less powerful find version that Steffi initially coded.

In summary, Steffi's problem is more easily solved with help from Perl because the pattern-matching operations can be handled using the more versatile regex notation, case-insensitive matching can be requested, and non-binary files can be excluded from consideration.12 Moreover, the Perl solution is also more complete—because it handles any number of additional spaces between words—and more compact than its POSIX find counterpart.

Next, you'll see another way in which Perl' s file-finding capabilities exceed those of find.


[previous] [next]

URL: