|  What 
              is That, Anyway?
 Randal L. Schwartz 
              So, you've got a directory full of mixed stuff, or maybe 
              an entire tree of directories. Just what's behind each of those 
              names? Are they directories, symbolic links, or just plain files? 
              And if they're files, are they text files or binary files? 
              And if they're binary files, are they images, executables, 
              or just random garbage? 
              Perl has many built-in operators to make getting lists of names 
              easy, and also for figuring out what you really have once you know 
              a name. 
              For example, let's find all the subdirectories within the 
              current directory: 
              
             
for my $name (glob '*') {
  next unless -d $name;
  print "one directory is $name\n";
}
 Here, the glob operator expands to all the non-dot-prefixed 
              names within the current directory, and the -d operator returns 
              true for all those names that are directories.
              What if we wanted to do this recursively? We need to step outside 
              of the core Perl, but not very far away. A core-included module 
              called File::Find takes care of nearly all of our recursive 
              directory processing problems. Let's find all directories below 
              the current directory:
              
             
use File::Find;
find sub {
  return unless -d $_;
  print "one directory is $File::Find::name\n";
}, ".";
The find subroutine takes a subroutine reference (called 
              a coderef), here provided with the anonymous subroutine constructor. 
              Each name found below . (specified on the last line of this 
              snippet) will trigger an invocation of this subroutine, with $File::Find::name 
              set to the full name, and $_ set to the basename (with the 
              working directory already selected to the directory in which the 
              name is located).
              If you run this, you'll see that each directory is typically 
              shown two or more times! Once as a name within its parent directory, 
              once as the name of . when we're in the directory, and 
              perhaps one or more times for each of the subdirectories contained 
              within the directory. So, how do we eliminate that? Well, just rejecting 
              "dot" and "dot-dot" in the subroutine will do 
              nicely:
              
             
use File::Find;
find sub {
  return if $_ eq "." or $_ eq "..";
  return unless -d $_;
  print "one directory is $File::Find::name\n";
}, ".";
 There. We'll keep moving forward from this as our base, because 
              rejecting the meta-links of dot and dot-dot is generally a useful 
              thing.
              What about all the symbolic links? Can we find those? Sure! That's 
              the -l operator:
              
             
use File::Find;
find sub {
  return if $_ eq "." or $_ eq "..";
  return unless -l $_;
  print "one symlink is $File::Find::name\n";
}, ".";
 Cool! But where do they point? That's the readlink 
              operator, as in:
              
             
use File::Find;
find sub {
  return if $_ eq "." or $_ eq "..";
  return unless -l $_;
  my $dest = readlink($_);
  print "one symlink is $File::Find::name, pointing to $dest\n";
}, ".";
 We can skip the -l test by knowing that any non-symlink 
              will automatically return undef on the readlink, as 
              in:
              
             
use File::Find;
my @search = @ARGV;
@search = qw(.) unless @search;
find sub {
  return if $_ eq "." or $_ eq "..";
  return unless defined (my $dest = readlink($_));
  print "one symlink is $File::Find::name, pointing to $dest\n";
}, @search;
 I've also made it simpler to run this on different directories 
              by passing them on the command line.
              So, what do we have left? We can notice and skip over directories 
              and symbolic links. How about files? Files are where the real action 
              is located. And some of them are text-like, and some of them are 
              binary-like. Although even those lines are blurry: you could argue 
              that XML is really just a text-like binary format, and a Microsoft 
              Word document is clearly text inside a binary-like format.
              But back to what Perl can help with, first. Let's add the 
              -T operator to distinguish those text files:
              
             
use File::Find;
my @search = @ARGV;
@search = qw(.) unless @search;
find sub {
  return if -d $_ or -l $_;
  return unless -T $_;
  print "One text file is $File::Find::name\n";
}, @search;
 And that's pretty cool. Just a list of text files. But this 
              actually doesn't tell us too much. What we might really want 
              is a list of all the Perl scripts. What can tell us that? Well, 
              the UNIX command called file can peer inside the contents 
              of a file to figure out what it is. Let's invoke that on each 
              file:
              
             
use File::Find;
my @search = @ARGV;
@search = qw(.) unless @search;
find sub {
  return if -d $_ or -l $_;
  my $file_said = 'file $_';
  if ($file_said =~ /perl/) {
    print "$File::Find::name: $file_said";
  }
}, @search;
 Hey, look at that. Now we're pulling out just the names that 
              file insists are possibly Perl programs. But this program 
              will slow to a crawl on a large tree. We're reinvoking the 
              file command individually on every file in the tree.
              There's a couple of ways to go from here to speed it up. 
              I could save all the filenames to invoke file once at the 
              end of the program:
              
             
use File::Find;
my @search = @ARGV;
@search = qw(.) unless @search;
my @list;
find sub {
  return if -d $_ or -l $_;
  push @list, $File::Find::name;
}, @search;
for ('file @list') {
  if (/perl/) {
    print;
  }
}
 And yes, that sped it up considerably, but now we don't get 
              the results until the end of the tree walk, and we'll run into 
              problems if the number of arguments exceeds a comfortable limit 
              for file.
              But there's another way. Out in the CPAN (at places such 
              as search.cpan.org), we can find the File::MMagic 
              module. This apparently is a Perl module derived from the file 
              command created for the PPT project, which was originally based 
              on code written for Apache to implement the mod_mime module, 
              to emulate the standard file command. Wow. And now I'm 
              going to write a recursive controllable file-like program 
              on top of that. Will the reuse ever stop? (I hope not!)
              So, what we need from this module is the method called checktype_filename, 
              which returns back a MIME type (like text/plain or image/jpeg), 
              and perhaps a semicolon and some additional information. So, let's 
              find all the Perl scripts quickly. First, after a little playing 
              around, I see that the string I'm looking for has "executable" 
              followed by a space, then something ending in "perl" followed 
              by a space and then "script". That's a simple regular 
              expression, so I'll add that at the right place:
              
             
use File::Find;
use File::MMagic;
my $mm = File::MMagic->new;
my @search = @ARGV;
@search = qw(.) unless @search;
my @list;
find sub {
  return if -d $_ or -l $_;
  my $type = $mm->checktype_filename($_);
  next unless $type =~ /executable \S+\/perl script/;
  print "$File::Find::name: $type\n";
}, @search;
 Now I know what programs to look at when I upgrade, to see which 
              modules they all use. (Hmm. Sounds like an idea for another column. 
              I'll note that.)
              And one last fun one. Let's find all the images in the tree, 
              and then call Image::Size (also found in the CPAN) on them 
              to see their respective sizes. Just a few more tweaks:
              
             
use File::Find;
use File::MMagic;
use Image::Size;
my $mm = File::MMagic->new;
my @search = @ARGV;
@search = qw(.) unless @search;
my @list;
find sub {
  return if -d $_ or -l $_;
  my $type = $mm->checktype_filename($_);
  next unless $type =~ /^image\//;
  print "$File::Find::name: $type: ";
  my ($x, $y, $imgtype) = imgsize($_);
  if (defined $x) {
    print "$imgtype: $x x $y\n";
  } else {
    print "error: $imgtype\n";
  }
}, @search;
 And as it turns out, I could have left the File::MMagic 
              out of this program, since Image::Size can cheerfully inform 
              me when it wasn't called on an image, but you know the old 
              Perl motto: There's More Than One Way To Do It!
              So, next time someone asks you "what do you have?", 
              I hope you can now answer them with a nice short Perl program. Until 
              next time, enjoy!
              Randal L. Schwartz is a two-decade veteran of the software 
              industry -- skilled in software design, system administration, 
              security, technical writing, and training. He has coauthored the 
              "must-have" standards: Programming Perl, Learning 
              Perl, Learning Perl for Win32 Systems, and Effective 
              Perl Programming, as well as writing regular columns for WebTechniques 
              and Unix Review magazines. He's also a frequent contributor 
              to the Perl newsgroups, and has moderated comp.lang.perl.announce 
              since its inception. Since 1985, Randal has owned and operated Stonehenge 
              Consulting Services, Inc.
           |