|  Poor 
              Man's Search Engine
 Br. David Carlson
              What do you do if you want a keyword search engine on your Web 
              site but have no funds for this? Some freeware solutions are available 
              but may also deliver annoying ads. However, if your Web site is 
              not heavily used, you may be able to develop your own "quick 
              and dirty" solution. This article shows how I handled this 
              problem in Linux. The result was an open-source, freeware search 
              engine called QSearch. It consists primarily of a bash shell script 
              and a compiled CGI program written in C++. Although our server uses 
              Red Hat Linux and the Apache Web server, the software may be adaptable 
              to other settings.
              What's the Plan?
              I began by looking at the data. A search engine on a page must 
              allow the user to look up Web pages by specifying one or more keywords 
              or phrases. Web pages often use tags such as the following at the 
              top of each file. These tags give a description of the page and 
              keywords by which the page can be retrieved:
              
             
<META NAME="DESCRIPTION" CONTENT="Lab Problem Report Form">
<META NAME="KEYWORDS" CONTENT="Problem Report,Lab Problem,Problem">
A CGI (common gateway interface) script could be used as a search 
            engine, but if it had to search through all of the HTML files on a 
            Web site to find the above type of META tags each time someone did 
            a search, it would probably run too slowly. We may not need blazing 
            speed, but we do want results before the user gives up on our search 
            engine. Greater speed could be obtained by collecting the data from 
            these META tags and saving it all in a single file, which could then 
            be more quickly searched by a CGI script. The following example shows 
            the format that was used for this text file:  
             
/java.html|Java Information#Java#Java Information#CS 310#CS310#
/jobs.html|Computing Career Links#Career Links#Career#Careers#
/itwd/milestones.html|Milestones for Grant Project#Milestones#ITWD#
/carlsond/cs321/web/javascript.html|Notes on JavaScript#JavaScript#
Each line contains the information about a particular Web page. The 
            part before the pipe symbol (|) is the URL, minus the invariant 
            leading section, which was http://cis.stvincent.edu in my case. 
            Between the "|" and the first "#" 
            symbol is the description of the Web page, and between each neighboring 
            pair of "#" symbols, there is a keyword or phrase. 
            This format allows us to later find a particular keyword or phrase 
            and to distinguish it from the description and URL.  Thus, I hatched a plan to periodically run a bash script to collect 
              the data and place it in a text file. To do a search, users can 
              access an HTML form where they can fill in their keyword(s). Clicking 
              on the submit button on the form will send the desired keyword(s) 
              to a second bash script, the CGI search engine. A compiled program 
              to provide better security and speed later replaced this second 
              script. For the moment, however, we will look at the scripts because 
              they show what processing needs to be done.
              The Shell Scripts
              Harvesting the Data
              The getmeta script (Listing 1) gathers the data and saves 
              it in a file called "keywordfile". This script can automatically 
              run every night so that as users add Web pages and adjust the META 
              tag information, the file is kept up to date. (Listings for this 
              article are available from: www.sysadminmag.com.)
              You must decide which Web files to include in the scope of the 
              search engine. The getmeta script uses the find command 
              to scan all files with names of the form *.html that are 
              in the directory whose name is in the TARGET variable. By default, 
              TARGET contains /www. The script also scans all Web files 
              in the directory trees that begin with any of the subdirectories 
              named in the SUBDIRS variable. On this system, I only wanted to 
              descend into the directory trees for a few particular users or projects. 
              This section should obviously be adjusted to suit each particular 
              Web site.
              The getmeta script writes the names of the specified Web 
              files to a temporary text file, whose name is stored in the variable 
              "TMP". A file is used because the amount of data might 
              be fairly large if many Web files are processed. The filenames are 
              then read, one at a time, from this temporary file using a loop 
              that has its input redirected to come from the file. The rough outline 
              of this loop is as follows:
              
             
while read filename
do
   Process the filename as desired
done < $TMP
This is a pattern with many useful applications. The script processes 
            each filename by first checking whether users have read access, as 
            there is no sense in scanning a file that users cannot read. The cut 
            command is used to look for the "r" permission in the correct 
            column of a long listing for the file. The column number might need 
            to be adjusted to fit with the long listing format on your system.  Next, the script uses grep to get the lines of each file 
              that contain a tag starting with the string "<meta ". 
              The output is piped to another grep, which keeps only those 
              lines that also contain ="keywords". This output 
              is placed in the KEYS variable. If the data in this variable 
              is nonzero in size, a similar grep is used to extract the 
              description string. The data in KEYS is then piped into a 
              cut that extracts the third field where "=" is 
              used as the delimiter. This skips over the NAME= and CONTENT= 
              part of the line, giving the keywords section that follows the second 
              "=". Then the translate command (tr) is 
              used to replace the commas by # symbols and to delete the 
              "> that ends the meta tag line. The # symbols 
              are used to give us the file format already described above for 
              keywordfile. The description data held in the DESCRIP variable 
              is then refined in a similar way.
              Following this, cut is used to skip over the first few 
              characters of the filename. When using the default settings, where 
              all HTML files are under the /www directory, this amounts 
              to skipping over the initial /www and thus keeping the characters 
              in column 5 onward. We do not want the /www as it will not 
              be part of the final URL for this file. The line starting with the 
              echo command is then used to send the modified filename, 
              description, and keywords to another temporary file. Note that the 
              "|" symbol is inserted to separate the filename 
              from the description and that the keywords are surrounded by # 
              symbols.
              The tr -d '\015' may not be needed on some servers. It 
              removes carriage returns that got inserted when users edited their 
              Web files from Windows-based machines. This can happen on servers 
              that run the Samba software (http://www.samba.org) that allows 
              a UNIX or Linux machine to imitate an NT server. By removing the 
              carriage returns, we get a proper UNIX text file. At the very end, 
              the getmeta script copies the newly harvested data over the 
              top of any existing keywordfile and removes its temporary files.
              Matchmaker, Matchmaker
              The search program began as a bash script (Listing 2). It is a 
              CGI script, so your Web server must be configured to allow CGI scripts 
              to run. This script is the program that receives the data from the 
              Web-based form when the submit button is clicked. See Listing 3 
              for the search.html file. It contains the form with three 
              text boxes that the user can fill in so as to specify up to three 
              keywords or phrases for which to search. Actually, the data submitted 
              from the form is sent (as one long URL-encoded string) to the uncgi 
              program, which parses the data and places it into environment variables 
              that start with the characters "WWW_". You can 
              see that uncgi is specified in the following line of the 
              search.html file:
              
             
<FORM METHOD="POST" ACTION="../cgi-bin/uncgi/search">
The CGI script can then access the data in these variables, here named 
            WWW_Key1, WWW_Key2, and WWW_Key3. The uncgi 
            program, normally installed in the cgi-bin directory, thus breaks 
            apart a URL-encoded data string such as the following:  
             
Key1=Java&Key2=Java+script&Key3=VB+script
This is the type of data string sent from the user's Web browser. 
            It would be awkward to handle this string directly in the script. 
            Instead, the above POST command specifies that the string is 
            sent to the uncgi program, which conveniently places the data 
            into three separate environment variables that the CGI script can 
            then easily access. It is as if we did the following assignments:  
             
WWW_Key1="Java"
WWW_Key2="Java script"
WWW_Key3="VB script"
The uncgi program is available at a number of Web sites (e.g., 
            http://www.prw.net/support/cgi/uncgi.htm).  The first section of the search script sets up some variables 
              and increments a counter to show that the search engine has been 
              accessed one more time. The count itself is kept in a text file 
              that can be examined whenever you want to know how much your search 
              engine has been used. The count could also be displayed on the Web 
              page showing the results of each search, if desired.
              Next, the script figures out which of the WWW variables contain 
              nothing (zero bytes) with the -z test. The values in the 
              variables are copied as needed so that all three variables contain 
              a value, where duplicates are used to fill in for an empty value. 
              For example, if the user enters C++ in a text box and leaves the 
              other two boxes blank, the script copies C++ into the variables 
              corresponding to these two other boxes.
              The grep -i command is then used to do a case-insensitive 
              search in keywordfile for the keyword in the first WWW variable. 
              The # symbols are used to be sure that any match is for a 
              keyword and not a word that simply appears in a description or URL. 
              The results of this grep are piped into a second grep 
              that looks for the keyword in the second WWW variable. The output 
              of this last command is piped into a third grep that looks 
              for the keyword given by the last WWW variable. Thus, we get just 
              those lines of the keywordfile that contain all three keywords in 
              the keyword section of the line. This data is written to a temporary 
              file.
              The search script now outputs what needs to be sent to the user's 
              Web browser to display a page about any matches that were found. 
              We first print out the string Content-type: text/html, followed 
              by a blank line. Then we output the information about matches, marked 
              up with appropriate HTML tags. To make things easier, the initial 
              lines of HTML are copied from the file named by the HEAD 
              variable.
              The -z test is used to see if there was no keyword for 
              which to search. If so, an error message is displayed on the user's 
              Web page. Next, we reuse our favorite loop pattern, with input redirected 
              to come from the temporary file of matches:
              
             
while read item
do
   Process item as need be
done < $TMP
Note that the number of lines (matches) is counted in the MATCHES 
            variable. The filename is extracted from each item (line) by using 
            cut -d "|" -f 1, which obtains the first field that is delimited 
            by the | symbol. The description field is extracted in a similar 
            way, though two uses of cut are needed: one to extract the second 
            field delimited by |, and the other to get the first field 
            delimited by #.  The data for each match is then written out as a list item in 
              an ordered list. The URL portion is written out as a clickable link, 
              with TOPURL preceding the filename so as to give a complete 
              URL. The TOPURL variable contains the value ("http://cis.stvincent.edu" 
              on our server) that must precede all Web filenames on the system. 
              Note that if Web files on your system are scattered under users' 
              home directories, then this will not work. The description is written 
              out immediately after the URL. The Web server automatically sends 
              this output to the user's browser.
              Finally, the search script checks the number of matches to see 
              whether it was zero so that a message can be printed for that special 
              case. Some closing HTML is written out, and the script is done.
              Security Concerns
              Although the above search script worked fine, there were a couple 
              of concerns. CGI scripts are often susceptible to hacking attempts, 
              such as supplying bad input like the following:
              
             
C++ | cat /bin/passwd
A hacker could submit this via the form in search.html with 
            the hope that the | symbol would cause the script to execute 
            the command to cat out the password file. This would show the 
            login IDs for all users, allowing the hacker to try a dictionary attack 
            on user passwords. Although this type of attack did not seem to work 
            on our system, it might be possible for someone to find an attack 
            that worked. Information on setting up the Apache Web server to reduce 
            security problems can be found at the Apache Web site (http://www.apache.org). 
            Information about CGI security problems can be found in "Safer 
            CGI Scripting" by Charles Walker and Larry Bennett, Sys Admin, 
            February 2001.  Another concern is that a script is interpreted and runs more 
              slowly than a compiled program. Because our search engine was not 
              being run often and the response time was brief, this was not a 
              big concern. Still, usage might increase in the future, so it could 
              help to have a compiled program as the search engine. Because it 
              is also easier with a compiled program to weed out bad input that 
              might indicate an attack, I decided to switch to a compiled C++ 
              program.
              Using a Compiled CGI Program
              The search.cpp program performs the same overall task as 
              the search CGI script. However, the GetValue function, found 
              in stringhelp.cpp, does some additional processing to reject 
              bad input. This function gets the value of a WWW environment variable. 
              The function is careful not to overflow the Result array 
              when copying characters from the environment variable. It also only 
              copies alphanumeric characters, the + sign, the - 
              sign, the period character, the space character, and the NULL character 
              that marks the end of a string. At the first sign of any other character 
              (such as |, or other hacker favorites), the function quits 
              and returns the empty string in Result. You can adjust the 
              code in GetValue if you want to allow additional characters, 
              but be careful what you allow. The rest of the C++ program will 
              not be examined here as it does the same processing as the old search 
              script.
              Configuration and Installation
              QSearch can be downloaded from my Web site:
              
             
http://cis.stvincent.edu/carlsond/software/software.html
A Readme file is included to explain configuration and installation 
            issues. The main steps are covered here.  What do you need in order to use this software? It may be a reasonable 
              choice if you do not have a huge number of Web pages and the search 
              engine will not be heavily used. You need to have the g++ 
              compiler to compile the C++ program. You also need to have the uncgi 
              program installed, and the Web server must be configured to run 
              CGI programs. The software assumes that all Web pages to be searched 
              are located under a common directory, such as the default /www 
              directory. The META tags containing the keywords and descriptions 
              must fit the format mentioned at the start of this article.
              Edit the getmeta script to adjust the following four lines 
              for your situation:
              
             
TARGET="/www"
START=5
SUBDIRS="/www/itwd /www/carlsond /www/carrc /www/morrisoh /www/hicksb"
KEYFILE="/www/cgi-bin/keywordfile"
Note that TARGET should indicate the directory under which 
            all of your Web files are located. The START number must be 
            one more than the number of characters in this TARGET string. 
            SUBDIRS should hold a string containing any subdirectories 
            of the TARGET directory into which you want to descend to find 
            Web files. Finally, KEYFILE should give the location for the 
            keywordfile that getmeta generates, a location within the directory 
            for CGI programs.  Edit the search.cpp file and adjust the following lines 
              as needed:
              
             
#define TOPURL     "http://cis.stvincent.edu"
#define HEAD       "head.html"
#define KEYFILE    "keywordfile"
#define COUNTFILE  "search.count"
The first line should give the initial common part of all URLs on 
            your Web site. The second gives the name of the HTML file that is 
            to be displayed as the top part of the search-results Web page. A 
            sample head.html file is supplied. The third and fourth lines 
            probably do not have to be changed, although it is important that 
            KEYFILE contain the exact name for the text file created by the getmeta 
            script. The only other item that you might want to modify is the following 
            line of the file stringhelp.h:  
             
const int StrMax = 800
This should be set to a reasonable maximum string length. Remember 
            that each line of META tag data is stored in one of these strings. 
            If your Web pages contain META tag lines with long descriptions 
            or long lists of keywords, it is possible that you will need a longer 
            StrMax. Now start the compiler with:  
             
g++ search.cpp stringhelp.cpp -o search -s
This should produce an executable search program. Then log in as root 
            and change the permissions and ownership of the getmeta script 
            as follows:  
             
chmod 700 getmeta
chown root.root getmeta
Use root's crontab entry to schedule getmeta to 
            be run once a day:  
             
crontab -e
The crontab entry might look like the following if you want 
            to run getmeta at 5:45 a.m. every day. Adjust the path to getmeta 
            as needed:  
             
45 5 * * * /usr/local/bin/getmeta
Move head.html and your compiled search program to the directory 
            for your CGI programs. Change the permissions and ownership as follows:  
             
chown root.root head.html search
chmod 755 search
chmod 644 head.html
In this same CGI directory, create a file named search.count 
            and place the number 0 into it. This file will contain the count of 
            how many times the search engine has been used. You can create this 
            file as follows:  
             
echo 0 > search.count
chown root.root search.count
chmod 666 search.count
You may be able to be more restrictive here. For example, if your 
            Webserver runs as user Webmaster, change the last two commands to:  
             
chown webmaster.users search.count
chmod 644 search.count
Then ordinary users cannot change the file, although they can read 
            it.  Place the search.html file in any directory of Web pages 
              you wish. You can edit this file to add graphics or other enhancements. 
              Adjust ownership and permissions as follows:
              
             
chown root.root search.html
chmod 644 search.html
Add a link to search.html in those Web pages where you wish 
            to provide access to the search engine.  You are now ready to take the search engine for a test drive. 
              Run getmeta once manually as root to produce a keywordfile. 
              This can be done by entering ./getmeta while in the directory 
              that contains this script. Then use a Web browser to look at the 
              search.html file and try out the search engine.
              Possible Improvements
              One enhancement is to allow spaces after the commas separating 
              keywords in the META tags. This was not done here in order to keep 
              the programming simple. It might also be desirable to allow more 
              general types of searches than just exact matches for keywords, 
              although that would also complicate the programming. The getmeta 
              script could probably be written more compactly in Perl. You could 
              even write your own search daemon if you know sockets programming. 
              However, such a complex programming project would take us away from 
              the original goal of having a simple, easy-to-produce search engine. 
              Refer to:
              
             
http://cis.stvincent.edu/carlsond/cs330/unix/unix.html
for my UNIX Web page with links to UNIX and Linux information, including 
            tables and examples on writing shell scripts. Refer to:  
             
http://cis.stvincent.edu/carlsond/swdesign/
for my Web pages on C++ programming, which might be helpful in modifying 
            the search program.  Br. David Carlson is a Benedictine monk as well as chairperson 
              and associate professor in the Computing & Information Science 
              Department at Saint Vincent College. When his primary jobs allow, 
              he can often be found doing systems administration on his department's 
              Linux server. He can be reached at: carlsond@stvincent.edu.
           |