|  Fuzzy 
              Text Searches with agrep and afind
 Alex Golomshtok and Yefim Nodelman
              In the past few years, the advents of storage and networking technology 
              and the Internet have contributed to a dramatic increase in the 
              amount of information accessible from the desktop. Users are now 
              faced with the problem of information overload -- it can be 
              difficult to locate the resources relevant to their needs. Under 
              these conditions, the ability to effectively search through vast 
              amounts of data becomes a primary requirement for any modern computing 
              environment.
              Most of the popular programming languages now are equipped with 
              text-searching and pattern-matching facilities of varying degrees 
              of sophistication. Most common are the "all-or-none" search 
              facilities that incorporate either exact string matching or wildcard 
              and regular expression matching algorithms. When the precise search 
              string is known in advance, the exact string matching algorithms 
              usually offer the best performance, although the speed of the search 
              is proportional to the length of the search argument and the quality 
              of the matching algorithm. For very short search patterns, for instance, 
              with length of the pattern not exceeding four characters, the brute-force 
              matching algorithms are usually the fastest. For longer search patterns, 
              more sophisticated algorithms (e.g., Rabin-Karp Checksum [1] or 
              Boyer-Moore [2] Algorithm) are known to provide superior performance.
              Regular expressions [3] are often used when an exact search pattern 
              is not known in advance. The power of regular expressions comes 
              from the ability to express a search pattern in generic terms, potentially 
              representing an infinite set of search arguments with a single expression. 
              A pattern like "[A-Za-z]+", for instance, will match any 
              string comprised by the upper and lower letters of the alphabet. 
              In addition to expressing the patterns using generic "meta-characters", 
              regular expression implementations often provide for anchors or 
              ability to control the exact position of a match within the search 
              text, and back-references that allow for easy text replacement and 
              substitution. Regular expressions are the natural solution to many 
              pattern-matching problems, such as looking for files and directories, 
              monitoring and extracting the information from log files, filtering 
              mail messages, etc.
              Despite all their power, exact matching as well as regular expressions 
              are usually an all-or-nothing proposition. These algorithms make 
              it difficult, if not impossible, to account for typos, alternative 
              spellings, spelling errors, and errors originated from data transmission. 
              To solve these types of problems efficiently, you would have to 
              employ an alternative technique known as approximate matching. Approximate 
              string matching is fundamental to modern text processing and is 
              heavily used by such computer applications as word processors or 
              search and indexing engines.
              Based on the needs of a particular application, the actual algorithm 
              behind the approximate matching mechanism may vary. Phonetic algorithms 
              such as soundex [4], for example, allow for matching strings that 
              sound similarly. A typical soundex implementation is simply an indexing 
              system that translates the strings into alphanumeric soundex codes 
              by ignoring the vowels and assigning "weights" to the 
              consonants. Since similar-sounding consonants are assigned the same 
              weight (e.g., "B", "P", "F", and "V" 
              are all assigned "1"), similar-sounding strings are likely 
              to produce equal soundex codes. Soundex is a very popular technology, 
              the most famous application of which was the initiative of the U.S. 
              Bureau of the Census to create an index for individuals listed in 
              the U.S. census records after 1880.
              While being efficient and easy to implement, soundex is limited 
              due to its phonetic nature; different natural languages require 
              different implementations. The German soundex, for instance, usually 
              uses six character codes as opposed to four-character English codes, 
              and the consonants are grouped based on different rules.
              Yet another family of string-matching algorithms, known as fuzzy 
              matching algorithms, takes an entirely different approach of calculating 
              various measures to assess the degree of proximity between strings. 
              There are two major measures that reflect the degree of proximity 
              -- k-mismatches and k-differences. The k-mismatches measure, 
              also known as Hamming distance, expresses the difference between 
              two strings by the number of characters that need to be changed 
              to obtain one from the other, where "k" is the maximum 
              number of differences allowed. For instance, "butter" 
              and "ladder" have a Humming distance of four as four characters 
              need to be replaced to obtain one from the other. The k-differences 
              measure or Levenshtein distance [5] indicates how many "edits" 
              (i.e., insertions, deletions, or substitutions) are necessary to 
              transform one string into the other, where "k" is the 
              maximum number of edits allowed. The Levenshtein distance between 
              "GUMBO" and "GAMBOL", for example, is two because 
              two edits are necessary.
              There are numerous fuzzy matching algorithms that utilize k-mismatches 
              (such as Baeza-Yates-Gonnet [6]), and k-differences (such as Wu-Manber 
              [7]). These algorithms are efficient, versatile, and language independent, 
              which makes them suitable for many text-processing tasks.
              Perl and Fuzzy String Matching
              Perhaps one of the most exciting features of Perl is its unsurpassed 
              support for regular expressions that has been an intrinsic part 
              of the language since its inception. One thing that Perl lacks, 
              however, is built-in fuzzy or approximate text-matching facilities. 
              This niche is filled with Perl extension modules, thanks to the 
              enthusiasm of numerous Perl developers. Currently, the Comprehensive 
              Perl Archive Network (CPAN) (http://www.perl.com/CPAN-local/CPAN.html) 
              offers several modules that perform fuzzy string matching and calculate 
              similarities between strings.
              One of these modules, String::Approx by Jarkko Hietaniemi 
              [8], allows for approximate string matching and substitutions using 
              Wu-Manber k-differences algorithm. This module performs fuzzy matching 
              based on a set of inputs that control such aspects of matching process 
              as maximum degree of proximity, case sensitivity, and more.
              Yet another extension, String::Similarity by Mark Lehmann 
              [9], is designed to calculate the similarity index between two string 
              arguments. This module is based on an O(ND) Difference Algorithm 
              [10], which also calculates the k-differences measure for its input 
              arguments.
              These modules bring the benefits of fuzzy matching to Perl users 
              while encapsulating the complexities of the underlying algorithms 
              and exposing enough functionality to solve nearly any text searching 
              and matching problem.
              afind
              To illustrate how approximate string matching may be applied to 
              everyday computing tasks, we developed several useful utilities 
              that extend the functionality of traditional UNIX programs such 
              as find and grep. The afind script recursively 
              descends the directory hierarchy searching for files, named such 
              that the degree of similarity between the file name and a search 
              string does not exceed a certain percentage. The similarity search 
              limit percentage is controlled via a command-line parameter. The 
              script outputs a list of files that match the criteria, sorted by 
              similarity index (most similar first).
              The script takes several command line options. Option -l 
              specifies the similarity limit as a percentage -- a higher number 
              will force the script to output closer matches. Option -h 
              causes the script to print the usage string and exit immediately. 
              Finally, option -v forces verbose output, which can be very 
              useful for diagnostic purposes. The remaining two arguments are 
              the name of the directory to use as a starting point for the search 
              and a search pattern. The directory name can be omitted, in which 
              case the current working directory will be used as a starting point.
              The following is the complete source code listing for the afind 
              script:
              
             
1  #!/usr/local/bin/perl
2
3  use Getopt::Std;
4  use File::Find;
5  use String::Similarity;
6   
7  $| = 1;
8   
9  $usage = qq(Usage: afind -lvh [directory] search_pattern\n);
10
11  die $usage
12    unless  getopts( qq(l:vh) );
13  die $usage
14     unless @ARGV;
15  die $usage
16     if $opt_h;
17
18  $limit   = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ? ( $opt_l/100.0 ) : 0.75;
19  $dir     = @ARGV > 1 ? shift : q(.);
20  $pattern = shift;
21
22  die qq(Directory $dir doesn't exist!\n)
23     unless -d $dir;
24
25  File::Find::find( \&wanted, $dir );
26
27  foreach $file ( sort { $b->{ sindex } <=> $a->{ sindex } } @files ) {
28     printf "%.2f\t%s\n",$file->{ sindex }, $file->{ name };
29  }
30  exit( 0 );
31
32  sub wanted {
33     my $sindex   = similarity( $pattern, $_, $limit );
34     print qq(Processing $File::Find::name ( $sindex )...\n)
35        if $opt_v && -t;
36     push( @files, { name => $File::Find::name, sindex => $sindex } )
37        if $sindex >= $limit;
38  }
Lines 3 through 5 load several Perl extension modules, used by the 
            script Getopt::Std [11] that facilitates parsing the command-line 
            arguments, File::Find [12] that provides the framework for 
            traversing directory structures and, finally, String::Similarity 
            [9] that provides the functionality for fuzzy pattern matching. Line 
            7 sets a special variable $| to a non-zero value, which turns 
            off the IO buffering and forces the flush after every print 
            statement on currently selected output channel. Line 9 sets up the 
            variable that contains the usage string for afind; this string 
            is output by lines 11 through 16 in case the script is invoked with 
            wrong or insufficient command-line arguments. Lines 11 and 12 call 
            the getopts routine from the Getopt::Std module in order 
            to parse the command-line arguments. This routine takes a single argument 
            -- a string that contains all allowed command-line options, in 
            our case single character options l, v, and h. 
            The switches that take arguments, such as l, are followed by 
            colons. On completion, the getopts routine sets up variables 
            for every command-line argument. For afind, three option variables 
            will be set -- $opt_v, $opt_h, and $opt_l. 
            Because $opt_v and $opt_h do not take arguments, they 
            will be assigned a Boolean true or false value based on whether the 
            command-line switches for these options are supplied. $opt_l, 
            however, if set, will contain the value for the similarity limit parameter. The getopts routine stops processing command-line arguments 
              as soon as it encounters an argument that does not look like a command-line 
              switch (i.e., does not have the dash as a first character). Therefore, 
              after getopts returns, the @ARGV array is expected 
              to contain at least one argument (the search string) because the 
              starting directory is optional. Lines 13 and 14 validate this condition 
              by checking the @ARGV array, which in scalar context evaluates 
              to the number of remaining command-line arguments. Finally, lines 
              15 and 16 print the "usage" string and exit the program 
              if the "help" option is set.
              Lines 18 through 23 perform some rudimentary validity checking 
              of input arguments. Line 18 ensures that the similarity limit value, 
              if supplied, is a valid percentage (i.e., an integer between 0 and 
              100). If no valid value is supplied, the limit is set to a default 
              value of 75% (0.75). Line 19 checks if the starting directory command-line 
              parameter is present, otherwise the default starting point is set 
              to the current working directory. Line 20 simply obtains the value 
              of the command-line search pattern argument and lines 22 and 23 
              validate the starting directory.
              Line 25 invokes the find routine from the File::Find module. 
              This routine traverses the directory hierarchy recursively starting 
              from the directory, identified by the second parameter, the $dir 
              variable. For every directory entry encountered by File::Find, 
              it invokes a user-supplied subroutine; reference to this subroutine 
              is passed to File::Find as a first parameter.
              This subroutine, called wanted (see lines 32 through 38), 
              performs the bulk of the work. Every time it is called, File::Find 
              sets the default input variable $_ to the base name of the 
              directory entry being processed. The subroutine calls "similarity" 
              function, which is part of the String::Similarity module, 
              to calculate the similarity index between the search pattern, contained 
              in the global variable $pattern and $_, which is set 
              to the current file or directory name. The similarity index ($sindex), 
              calculated by the similarity function, may take values between 0 
              and 1, where 1 means that two string arguments are identical. The 
              third parameter to the similarity function ($limit) 
              is optional; it sets the minimum similarity that the input string 
              arguments must satisfy.
              Based on this parameter, the similarity function stops 
              analyzing its input arguments as soon as it determines that the 
              result is likely to fall below the limit. In this case, the returned 
              similarity index will be invalid but lower than the limit. Therefore, 
              setting the limit may significantly boost the performance of the 
              algorithm, especially when comparing long strings. Lines 34 and 
              35 of the wanted subroutine simply check whether the verbose 
              output is requested and output the full name of the current file 
              or directory, contained in the $File::Find::name variable, 
              along with the calculated similarity index. This feature comes handy 
              while debugging the script or checking the results of the similarity 
              search.
              Finally, lines 36 and 37 save the name and the similarity index 
              of those files (and directories) that satisfy search criteria (i.e., 
              have the similarity index that exceeds the search limit). File names 
              and corresponding indexes are saved into an anonymous hash with 
              two keys, name and sindex, and the reference to this 
              hash is pushed into a global @files array.
              Upon the completion of the File::Find::find function, the 
              script iterates through the global @files array that contains 
              one entry for every file or directory that satisfies the search 
              criteria. The array is sorted in descending order by the similarity 
              index using a user-defined anonymous sort subroutine [13]. Each 
              iteration of the foreach loop at line 27 assigns the reference 
              to a hash, containing the current file information to the variable 
              $file. Then, line 28 outputs the full name of the file along 
              with its similarity index. Running afind on the system with 
              the following command:
              
             
afind /etc sesytem
produces the following output:  
             
0.77 /etc/system
While changing the similarity limit from the default of 75% to 50%:  
             
afind -l50 /etc sesytem
yields the following:  
             
0.77 /etc/system
0.62 /etc/setmnt
0.57 /etc/lp/Systems
0.57 /etc/uucp/Systems
agrep The agrep script is an example of another application 
            of an approximate string-matching technique to the well-known computing 
            problems. This script works in a way similar to a traditional UNIX 
            grep utility -- it searches input files for occurrences 
            of a particular string, specified by a command-line argument.  The script requires at least two command-line arguments -- 
              the search pattern and the name of the file to search for occurrences 
              of the pattern. Multiple file names or filenames with wildcards 
              may be supplied. Note that the agrep script relies on a shell 
              to perform the wildcard expansion; if the wildcard arguments are 
              quoted so that the expansion does not take place, the script will 
              process these arguments as exact file names.
              The behavior of the script and the amount of output can be controlled 
              via three command-line switches: -l, which specifies the 
              similarity search limit similar to the parameter of the afind 
              script; -i, which causes the search algorithm to ignore the 
              case of the letters while comparing; and -n, which directs 
              the script to only output the names of the files that contain the 
              search pattern instead of printing the actual matches (similar to 
              the -l option of a traditional grep).
              The following is a complete source code for the agrep script:
              
             
1  #!/usr/local/bin/perl
2
3  use Getopt::Std;
4  use String::Approx 'amatch';
5   
6  $| = 1;
7   
8  $usage = qq(Usage: agrep -linh search_pattern [files...]\n);
9   
10  die $usage
11    unless  getopts( qq(l:inh) );
12  die $usage
13     unless @ARGV;
14  die $usage
15     if $opt_h;
16
17  $limit   = ( $opt_l =~ /^(\d{1,2}|100)$/ ) ?
18     q($opt_l%) : q(15%);
19  $pattern = shift;
20
21  foreach $file ( @ARGV ) {
22     next if -B $file;
23     open( _F, $file ) || do {
24        print STDERR qq(Can't open file $file - $!.\n);
25        next;
26     };
27     @lines = <_F>;
28     foreach $match ( amatch( $pattern,
29        [ $limit, $opt_i && qq(i) ], @lines ) ) {
30        print qq($file: ), $opt_n ? qq(\n) : $match;
31        last
32           if $opt_n;
33     }
34     close( _F );
35  }
Lines 3 through 15 are very similar to the corresponding code of the 
            afind script with one exception -- instead of loading the 
            String::Similarity [10] module, the agrep uses another Perl 
            extension for approximate string matching, String::Approx [9]. 
            This module exposes numerous functions, however, the agrep script 
            will only use one of them -- amatch. The similarity search limit parameter, taken by the String::Aprox 
              module, is different from the corresponding parameter of the String::Similarity 
              module. It is expressed either as an absolute number of allowed 
              "edits" (i.e., k-differences measure) or a percentage 
              value of the number of allowed "edits" relative to the 
              length of the string (i.e., 10% means that every 10th character 
              may be an "edit"). For the sake of consistency, agrep 
              uses the percentage values to control the similarity limit with 
              the default set at 15%.
              Line 21 of the script iterates over the @ARGV array that 
              contains the names of the files to be searched for the occurrences 
              of the pattern. Every iteration of the loop assigns the name of 
              the file from the @ARGV to the variable $file. Line 
              22 skips to the next iteration of the loop if the current file is 
              a binary; for the sake of simplicity we excluded the processing 
              for binary files from the script. Lines 23 through 27 open the current 
              file for read access and load the file data into a global array 
              @lines. Finally, the foreach loop at line 28 invokes 
              the amatch function of the String::Approx module, 
              collects the matches one-by-one into the $match variable, 
              and prints them out.
              The amatch function is the simplest of the functionality, 
              exposed by the String::Approx. It takes a search pattern 
              as its first parameter, then a number of optional modifiers and 
              an optional array of input strings. The third parameter, the array 
              of inputs, may be omitted, in which case, the amatch will 
              assume that the default input $_ contains the string to be 
              matched. In list context, this function returns all input lines 
              that match the pattern; in scalar context, it simply returns a true 
              or false value based on whether any of the inputs match the pattern.
              The behavior of amatch can be controlled via a number of 
              optional modifiers, passed as a second parameter. One of the modifiers 
              is the similarity limit, discussed previously. As mentioned before, 
              the limit here is a total number of "edits" -- insertions, 
              deletions, and substitutions allowed in the search. This, however, 
              can be controlled at a much finer level by supplying explicit numbers 
              for each type of "edit". For instance, "I10%", 
              "D20%", and "S5%" will allow for 10% of insertions, 
              20% of deletions, and 5% of substitutions, respectively. Another 
              modifier, used by agrep, is the case-sensitivity modifier 
              i, which turns off case-sensitive matching. Other modifiers 
              are available. For instance, the exact positions within the input 
              argument where the search should begin and end can be controlled 
              via initial_position and final_position modifiers. 
              Running agrep on the system:
              
             
agrep sysshm /etc/system
produces the following output:  
             
/etc/system: *                exclude: sys/shmsys
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/shmsys
Bumping up the similarity limit from the default 15% to 25%:  
             
agrep -l25 sysshm /etc/system
will yield the following:  
             
/etc/system: *ident   "@(#)system    1.15    92/11/14 SMI" /* SV
/etc/system: * root device and root filesystem configuratio
/etc/system: *    the root filesystem) rather than at first
/etc/system: set shmsys:shminfo_shmmax=1610612736
/etc/system: forceload: sys/msgsys
/etc/system: forceload: sys/shmsys
/etc/system: forceload: sys/semsys
...
The output of agrep is not sorted by relevance or similarity 
            index simply because the amatch function does not return the 
            calculated edit distance. However, another function of the String::Approx 
            module, adist, is specifically designed to calculate and return 
            the edit distance between the input string and the pattern. Also, 
            function aslice may be directed to return the calculated edit 
            distance via optional modifier minimal_distance. This function 
            is especially useful as it provides the starting position and the 
            length of the match.  Summary
              Approximate string matching is an extremely powerful technique, 
              quickly gaining popularity within the modern software development 
              community. More and more often we see the new software products, 
              reaping the benefits of those fuzzy searching algorithms, which 
              up until a few years ago were thought of as a subject for a Ph.D. 
              in Computer Science rather than something that had any practical 
              application. Languages like Perl make these algorithms easy to use, 
              which leaves no excuse for not exploring the power of the approximate 
              text matching capabilities.
              We hope that this article stimulated your interest, and we encourage 
              you to become familiar with Perl text-processing extension modules. 
              Apart from the two modules discussed here, CPAN offers some more: 
              stemming and inflection modules, such as Text::Stem and Lingua::EN::Inflect, 
              which may be very useful for spell checking applications; and Text::Soundex, 
              which encapsulates the soundex algorithm implementation and others. 
              Altogether, these modules will enable you to produce powerful and 
              user-friendly software applications.
              References
              1. R. Karp and M. Rabin. Efficient randomized pattern-matching 
              algorithms. IBM Journal of Research and Development, 31:249-260, 
              1987.
              2. R. Boyer and J. Moore. A fast string-searching algorithm. Communications 
              of the ACM, 20:762-772, 1977.
              3. J. Friedl and A. Oram. Mastering Regular Expressions: Powerful 
              Techniques for Perl and Other Tools. O'Reilly & Associates. 
              1997.
              4. D. Knuth. The Art of Computer Programming: Sorting and Searching, 
              Vol. 3. Addison-Wesley, 1998.
              5. V. I. Levenshtein. Binary codes capable of correcting deletions, 
              insertions and reversals. Sov. Phys. Dokl., 6:707-710, 1966.
              6. G. Gonnet and R. Baeza-Yates. Handbook of Algorithms and 
              Data Structures. Addison-Wesley, 1991.
              7. S. Wu and U. Manber. Fast text searching allowing errors. Communications 
              of the ACM, 35:83-91, 1992.
              8. J. Hietaniemi. String::Approx, Latest Version 3.15, 
              Feb, 2000. CPAN directory JHI.
              9. M. Lehmann. String::Similarity, Latest Version 0.02, 
              Nov 2000. CPAN directory MLEHMANN.
              10. E. Myers. An O(ND) Difference Algorithm and its Variations. 
              Algorithmica, Vol 1. No. 2. 251-266, 1986.
              11. G.Sarathy. Getopt::Std, Latest Version - Perl 5.6.1, 
              Jan 2001. CPAN directory GSAR.
              12. G.Sarathy. File::Find, Latest Version - Perl 5.6.1, 
              Nov 2000. CPAN directory GSAR.
              13. L. Wall, T. Christiansen, R. L. Schwartz. Programming Perl. 
              O'Reilly & Associates, 1996.
              Alexander Golomshtok is a professional consultant who, for 
              the last decade, has been hanging around downtown New York developing 
              large-scale software systems and infrastructure solutions for Wall 
              Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com.
              Yefim Nodelman is a seasoned systems administrator with seven 
              years of professional experience in supporting UNIX and Windows 
              NT ystems. He can be reached at: nodelman_yefim@jpmorgan.com.
           |