Minimal Perl for Unix and Linux People: Part 2/Page 5 | WebReference

Minimal Perl for Unix and Linux People: Part 2/Page 5


[previous] [next]

Perl as a (Better) Find Command: Part 2

6.5.1 Using Perl for reliable timestamp sorting

A classic problem is that of identifying the most recently modified (i.e., newest) file within a particular branch of the file system, which might reflect the most recent order received, the latest blog uploaded, the last Unix configuration file modified, and so forth. To find the newest file, a knowledgeable Unix programmer might compose a command like the following:

What does that pipeline do? The find command emits the pathnames of the relevant files; the xargs command submits them as arguments to ls, whose -lrdt options sort their listings in ascending order by modification time; and then the tail-1 command peels off the listing that comes out last—the one for the newest file. At least, you'd expect it to be the pathname of the newest file, on the basis of (dodgy) advice from books or colleagues, or your own experiences with similar commands.

As discussed earlier, it's considered fiendishly clever to use xargs with find instead of an -exec clause, because doing so is guaranteed to minimize the number of processes required to handle all the arguments. In fact, the find | xargs approach is so efficient, and so highly revered in Unix culture, and so impressive to your colleagues, and so, well, cool, that the only bad thing you could possibly say about its use for this task is: It's not guaranteed to produce the correct results! 18

Why can't it be trusted? Because the ls command isn't guaranteed to sort all the filenames in one batch. That can lead to an incorrect result, because the most recent file from the final batch is always the last one provided as input to tail and therefore the one emitted by the pipeline. Therefore, if so many filenames are presented to xargs that it has to divvy them up for processing by two or more ls commands, there's no guarantee that the file of interest will be processed in the critical final batch and that the correct pathname will emerge from the pipeline.

Note that this isn't a criticism of xargs itself, which does an admirable job of running the separate ls commands as efficiently as possible. The problem is that sorting isn't an operation that can be done in piecemeal fashion —all the filenames must be sorted in one batch. For this reason, the find | xargs approach just isn't suited to solving this problem.

The modified solution shown next uses a custom Perl script called most_ recent_file instead of xargs, which has two distinct advantages:

Here are the results from using the xargs-based technique shown earlier—and its Perl alternative—for finding the most recently modified file under /etc on my Linux-equipped laptop:

As you can see, the commands identify different files as the newest—and they can't both be right.

The wrong answer is the one produced by the first pipeline, because find generated so many arguments that xargs couldn't present them all to ls in one batch.

In contrast, most_recent_file (shown in Listing 6.1) always produces the correct answer.

Listing 6.1 The most_recent_file script

That script may look intimidating at first, due to its size, but if you look more closely, you'll see that it's mostly comments.

It starts by using the stat function to obtain the file's data. The value it returns for the index of 9 is the time of the file's last modification, represented by a large integer number that represents the seconds that elapsed to that time from an ancient reference point.

The rest of the script is devoted to keeping constant track of the most recent modification time seen thus far, along with its associated filename, and then printing the "winning" name after all input has been processed (in the END block). The logic goes like this: If the current file's $mtime value is larger than the largest one seen thus far (stored in $newest), the current filename replaces the earlier one as our latest idea of the one most recently modified.

That's all it takes to write a Perl script that avoids the predilection of the xargs-based solution for identifying the wrong file as most recently modified, when many must be examined.

Next, we'll discuss another limitation of xargs, and how Perl can once again be of assistance. It involves wrangling pathnames that contain whitespace characters, which has historically been a vexing problem for Unix system administrators.


[previous] [next]

URL: