|  A Powerful 
              Search Tool for ASCII Files
 Koos Pol
              Sometimes the main UNIX principle of combining small tools to 
              accomplish complex tasks just isn't enough. There are times 
              that you just need more. A striking example is the task of searching 
              through ASCII files. This can be any sort of file: C programs, error 
              logs, HTML files, etc. If you need to find a specific string, it 
              is usually sufficient to grep through the file and view the 
              results. Combined with find, you can get a long way. For 
              example, if you need to dig through your HTML files and find the 
              ones that have obsolete links to your old department, you may want 
              to run something like:
              
             
find /usr/local/web -name "*.html" -print |
while read F; do echo "**** $F";
grep http://intranet.mycompany.com/olddepartment $F; done
If your queries get a bit more complicated, you may still get by using 
            egrep instead of grep, but you will run out of steam 
            very soon. Besides that, you really don't want to learn all the 
            egrep options if they can be different on any operating system. 
            So what's the alternative? This is a perfect challenge for Perl 
            regular expressions: they can be extremely powerful and are the same 
            for all UNIXes that run Perl. So, what if we could rewrite the monster 
            above in something more attractive, such as:  
             
find.pl http://intranet.mycompany.com/olddepartment "/usr/local/web/*.html"
We obviously need to combine find and Perl's regular expressions 
            in one script. If we can do that, we really have a powerful tool for 
            searching files. Here are a few more examples:  Look for all your HTML files with images on remote servers:
              
             
find.pl "<img src=\"((http)|(ftp))" "/usr/local/web/*.html"
You inherited a bunch of Perl scripts and you want a quick view of 
            all the subroutines used:  
             
find.pl -v "sub\s+\w+\s*{" "*.pl"
You are sifting through some C sources for a bug. It appears it has 
            to do with signals in combination with file operations: 
             
find.pl "signal.*?FILE" "*.c"
This will produce a list of all lines containing the word "signal" 
            and a constant that has "FILE" in its name. However, 
            this list is too long to handle. If we ignore all the FILE_SELECTION 
            matches because they don't seem to be involved, it makes 
            the list much smaller:  
             
-i "\$COPY\W.*\#.*status" "/usr/local/scripts/*.sh"
Note that I changed from a double quote to a single quote because 
            some shells really don't like the ! on the command line 
            and I had to prevent the shell from interpretting it.  You may have noticed that we need a few command switches to display 
              only the file name or to display the matching lines as well. We 
              may also want to search case-insensitive. With the Getopt::Std 
              package, this is very easy. It is even included in the standard 
              Perl distribution. So, we don't need to worry a lot about that.
              
             
my %opt;        # h=help, i=no case, v=view lines, l=line numbers
my $regex;      # what we're looking for
my $filemask;   # which files
my $dir;        # location
my $filename;   # name of file found
my $line;       # matching line
my $linenr;     # remembers the line number
my $nameonly;   # only print the filename
The start is easy. All we need is find to deliver us a list 
            of files. This list can then be read one-by-one:  
             
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    next if (-B $filename); # don't check binary files
    open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";
By now we have the first of our files opened for reading. We can scan 
            it to see if it matches our regex: 
             
    while (defined($line = <FILE>)) {    # read as long as necessary
        if ($line =~ /$regex/) {         # we have a match
If we have reached this point, we have had a match. Now we can print 
            the filename and move on to the next file in the list. Let's 
            see what we have so far: 
             
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die /"$!\n" ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    next if (-B $filename); # don't check binary files
    open (FILE, "<$filename") or warn "Can't open $filename ($!)\n";
    while (defined($line = <FILE>)) {    # read as long as necessary
        if ($line =~ /$regex/) {         # we have a match
                print $filename,"\n";
            last;
        }
    }
}
You may have noticed I cheated on the $dir and $filemask. 
            Where did those come from? We used regular expressions! If the given 
            command-line parameter contains a "/", we know that 
            a directory is involved. We just cut the string on the last "/". 
            Everything before the "/" is the directory, and everything 
            after it is the actual filemask: 
             
$dir = '.';    # use current dir if we don't get one
if ($filemask =~ m|/|) {          # directory given
    $filemask =~ m|(^.*)/(.*$)|;  # split the string up...
    ($dir,$filemask) = ($1,$2);   # ...and save both parts
}
Of course, we want to search case-insensitive, as well. Perl accommodates 
            this easily by extending the regular expression syntax. The prefix 
            to make a regex case insensitive is (?i). We'll 
            stick that to the regex if there is a "-i" 
            on the command line: 
             
$opt{i} && ($regex = '(?i)'.$regex);  # case insensitive?
(You can find all the details on these extensions in the perlre 
            man pages). There is one more hurdle -- if we want to view the 
            matching lines or their line numbers, then we need to modify the logic 
            a bit. When we have a match on our regex, then we need to continue 
            reading the file for more matches until the whole file is read. Only 
            then can we skip to the next file. By the way, let's agree on 
            one thing -- if we want to see the line numbers, it is reasonable 
            to display the lines as well, isn't it? There is not much use 
            in displaying the line numbers only: 
             
$nameonly = !(($opt{1} || $opt{v}); # only print the filename
Start at the point where we have a match on our regex. Instead 
            of just printing the filename, it becomes: 
             
# read as long as necessary
NEXTLINE: while (defined($line = <FILE>)) {
    $linenr++;                        # remember this line number
    if ($line =~ /$regex/) {          # we have a match
If we didn't get a "-v" or "-l" 
            on the command line, it is sufficient to print the filename. There 
            is no need to read the rest of the file. 
             
            if ($nameonly) {           # only print the filename
                print $filename,"\n";
                last NEXTLINE;
If we do have a "-v" or "-l" on 
            the command line, we print a filename that is more visible in the 
            clutter of long screens full of text. Of course, we print the line 
            and, if requested, the line number also. 
             
            } else {
                print "**** $filename *****\n";
                print $opt{l} ? "$linenr: " : "", $line;
We now continue reading the rest of the file for other matches. If 
            we find them, we again print the lines or line numbers. We close off 
            by printing a new line as a separator between files: 
             
                while (defined($line = <FILE>)) {  # read until EOF
                    $linenr++;
                    if ($line =~ /$regex/) {      # more matches
                        print $opt{l} ? "$linenr: " : "", $line;
                    }
                }
                print "\n";
            }
        }
    }
    close (FILE);
}
close (LIST)
Let's give the lost user some helpful messages in case the program 
            gets the wrong parameters. When we stuff that into the script and 
            clean up some things here and there, then this is the final result: 
             
#! /usr/bin/perl -w
use strict;
use Getopt::Std;
my %opt;         # h=help, i=no case, v=view lines, l=line numbers
my $regex;       # what we're looking for
my $filemask;    # which files
my $dir;         # location
my $filename;    # name of file found
my $line;        # matching line
my $linenr;      # remembers the line number
my $nameonly;    # print only the filename
sub usage1 {
    $0 =~ s|^.*/||;  # strip of the path
    print  "Usage: $0 [-h] [-i] [-l] [-v] regex filemask\n";
    exit
}
sub usage2 {
    $0 =~ s|^.*/||;  # strip of the path
    print <<EOF;
Usage: Usage: $0 [-h] [-i] [-l] [-v] regex filemask
$0 is a powerful text grepper. It combines find and Perl regular expressions.
Options:
     -h  help screen
     -i  case insensitive matching
     -l  display line numbers on matching lines
     -v  display the lines matching 'regex'
EOF
    exit
}
getopts('hlvi',\%opt);  # process the command line switches
usage2() if (defined $opt{h});  # "HELP ME...!"
if (defined($ARGV[0]) && defined($ARGV[1])) {
    ($regex, $filemask) = ($ARGV[0], $ARGV[1]);
} else {
    usage1();
}
usage1() if (! defined $regex) or (! defined $filemask);
$opt{v} = 1 if ($opt{l} == 1 );       # -l without -v is useless
$opt{i} && ($regex = '(?i)'.$regex);  # case insensitive?
$dir = '.';    # use current dir if we don't get one
if ($filemask =~ m|/|) {          # directory given
    $filemask =~ m|(^.*)/(.*$)|;  # split the string up...
    ($dir,$filemask) = ($1,$2);   # ...and save both parts
}
open (LIST, "find \"$dir\" -name \"$filemask\" -print|") or die ;
while (defined ($filename=<LIST>)) {
    chomp $filename;
    open (FILE, "<$filename") or die "Can't open $filename ($!)\n";
    $linenr=0;
    # read as long as necessary
    NEXTLINE: while (defined($line = <FILE>)) {
        $linenr++;                    # remember this line number
        if ($line =~ /$regex/) {      # we have a match
            if (!$opt{l}) {           # just print the filename
                print $filename,"\n";
                last NEXTLINE;
            } else {
                print "**** $filename *****\n";
                print $opt{l} ? "$linenr: " : "", $line;
                while (defined($line = <FILE>)) {  # read until EOF
                    $linenr++;
                    if ($line =~ /$regex/) {       # more matches
                        print $opt{l} ? "$linenr: " : "", $line;
                    }
                }
                print "\n";
            }
        }
    }
    close (FILE);
}
close (LIST);
Koos Pol is a systems administrator with Compuware. He has a broad 
            experience in UNIX, Windows, and OS/2 systems. His main responsibilities 
            are providing tools and database support on various platforms to a 
            group of developers. He also provides command-line interfaces and 
            Web interfaces to database backends. He can be reached at: koos_pol@nl.compuware.com. |