|  The Art 
              of Spidering
 Reinhard Erich Voglmaier
              Every Webmaster will encounter a robot sooner or later. He or 
              she will typically find proof of the activities of "strange 
              browsers" in logfiles. So, what is a robot? A robot (also known 
              as a spider) is a procedure that visits Web sites with the objective 
              of gathering information. The robot does not limit itself to getting 
              the information from just one Web page, but also tries to get the 
              links mentioned in this page. If it kept going in this way by following 
              all the links, it would eventually spider the whole Internet. This 
              means the robot needs limits defined in its configuration file that 
              tells it where to stop, which I'll discuss later in the article.
              Sometimes the activities of robots are welcome inasmuch as they 
              provide important information about the content of the spidered 
              site to possible users. Sometimes, however, the visits are not wanted, 
              especially when they begin to occupy a large bandwidth, penalizing 
              the traffic for which the Web site was originally intended. For 
              this reason, there are so-called robot rules that "well-behaving" 
              robots obey. This good behavior is called "netiquette" 
              and must be programmed into the robot or spider.
              In this article, I will explain how to write a spider that completely 
              mirrors a Web site. Note that I use the words spider and robot interchangeably, 
              which is how you will find it in the literature also. The first 
              script described in this article copies simple Web pages from a 
              remote site to a local site. I will also integrate a parser that 
              extracts the hyperlinks contained in the copied Web pages and show 
              how to complete this approach with a stack object that handles the 
              download of the documents needed. I will also look at what software 
              exists, because we do not want to reinvent the wheel. These programs 
              are all available from the Perl sites (http://www.perl.org 
              and http://www.perl.com).
              All the examples in this article are written in Perl, standard 
              on UNIX operating systems, but also available for VMS, Win32 architectures, 
              and others. The auxiliary libraries are available from CPAN (http://www.cpan.org/). 
              To locate them, I highly recommend using the search engine located 
              on the server of the University of Winnipeg:
              
             
http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html
Copying Single Pages  To copy a single Web page, we must set up a couple of things. 
              We first arrange for a connection to the Web server on which the 
              page we want to copy lives. This may involve also contacting a proxy 
              server, authenticating ourselves against the proxy server, and setting 
              up some parameters for the connection. Fortunately, there is a library 
              (LWP) available on CPAN that does what we need. The script in Listing 
              1 copies a page from a remote Web server to the local file system. 
              (Listings for this article can be found on the Sys Admin Web site 
              at: http://ww.sysadminmag.com.)
              The first lines of Listing 1 set up the remote page and the location 
              where the page should go. Line 7 sets up the most important data 
              structure, called the UserAgent. It holds all information about 
              the connection. For now, we'll use the simplest form. The request 
              is very easy to write; we want to "GET the Remote Page", 
              so the request reads:
              
             
HTTP::request->new('GET',$RemotePage)
Once we've defined a request object, we can fetch the page with 
            the method request(). This method simply copies the RemotePage 
            to the LocalPage. Copying Single Pages Realistically
              The previous situation was easy but not very realistic, because 
              there was no firewall. We got the page quickly, and no user credentials 
              were required. Listing 2 shows how to extend the first approach. 
              In lines 15-18, we set up up the name of the proxy server our spider 
              should use and the domains that do not need proxying. We define 
              the name of our browser to let the spidered side know what's 
              happening. Finally, we set up the time-out after which we will abort 
              the script.
              When we get the page, we control the exit status of the request. 
              We therefore have to use the HTTP::Status library. We don't 
              copy the page on the local file system, but keep it in memory for 
              later use. With the status code (I use the code() method), 
              we also get a human-readable explanation of success or failure information 
              using the method message(). These are not the only options 
              the UserAgent understands, however. For a complete list, look at 
              the man pages delivered with the UserAgent library.
              Parsing the Web Pages
              In many cases, what we've already done is enough. On my own 
              Web sites, I have a number of pages that contain information mirrored 
              from other Web sites. But because we want to get not only a single 
              Web page, but a whole site, the spider must follow the links contained 
              in the downloaded pages. We need an HTML parser, and we have at 
              least two options. The first option is to use the built-in functions 
              of Perl using regular expressions. Because HTML has a well-defined 
              syntax to describe references (substantial links to other documents, 
              images, or similar objects included in Web pages), this is not very 
              difficult. The second option is to use the Link Extractor module 
              (CPAN, as I mentioned before).
              I will show the method of coding the parse process by hand. Remember, 
              the goal is to mirror the Web site on our local file system, so 
              the user clicking on a link on our local server also expects that 
              the documents referenced in this page (images, for example) to be 
              on the same server. Thus, the parser must transform the link to 
              work on the local system as well. In Perl, this looks like:
              
             
HTMLPage =~ s/RegularExpression/TransformURL()/eig
The switch "e" means that we want to substitute the 
            regular expression with the return value of the function. The "i" 
            switch ignores case; "g" expresses that we want to 
            have the command executed not just for the first regular expression, 
            but for all of them found in the document.  In this example, we are interested in Links and Images, so we 
              will scan for expressions like:
              
             
<a href="./to_another_document.html" > Click here </a>
<img src="./pictures/One.jpg" .... ...... .....  >
The easiest form of the regular expression to match these links is:  
             
s#<(a href=)"([^"]+)"#TransformURL($1,$2)#eig
s#<(img src=)"([^"]+)"#TransformURL($1,$2)#eig
(If you need a short introduction on regular expressions, I recommend 
            one of the Perl books or sites such as http://www.perl.com/pub/p/Mastering_Regular_Expressions.)  These two examples will not find all links. For example, if there 
              is a space between "href" and "=", 
              it will not be found, so you will need to put in something like 
              this:
              
             
\s*
which means zero or more spaces. You will need more lines to match 
            background, images, sound, and so on. I recommend beginning with this 
            simple structure and then adding more features in order to catch some 
            obvious syntax errors. In the listings available from www.sysadminmag.com, 
            you will find what I'm using on my site.  Memory Structures
              With the previously explained parsing mechanism, we can now get 
              all the pages or images of objects referenced in the downloaded 
              pages. If the spidered Web site does not reference external sites, 
              we get the whole Web site. But even if this condition is true, there 
              are still other problems. What if several pages are referenced more 
              than once or if two pages reference each other? We would then simply 
              end up in a loop.
              This means that we need memory structures to keep track of which 
              pages to visit and which pages have already been downloaded. There 
              are four arrays holding the data for the download decisions:
              
              
             
              The first two arrays tell the name space in which the spider is working. 
            IncludedURLs contains the list of URLs we want to get, and ExcludedURLs 
            tells which pages we don't wish to be considered. The second 
            two arrays contain housekeeping information, URLs that don't 
            need to be downloaded (VisitedURLs), and the URLs our spider still 
            needs to download. The last array is a stack filled by the TransformURL 
            function. When filling the array, the TransformURL function consults 
            the other three arrays, as well. See Listing 3. IncludedURLs 
               ExcludedURLs 
               VisitedURLs 
               ToVisitURLs
              The Request Loop and Helpers
              We can now open an HTML page, examine the pages and images to 
              which it is pointing, and retrieve these pages and images. The dynamic 
              part of the spider and the request loop are still missing. Simply 
              written, it looks like this:
              
             
while ( <condition> ) { getPage(); }
The condition indicates whether we need to get pages from the site. 
            Remember that we put all the pages yet to be visited in an array. 
            The array is initialized with the first page we want to spider. Every 
            page referenced will be put in the array and after it is visited, 
            a page will be erased from the array. You can make a procedure that 
            says whether the array is empty or not, which is very handy for controlling 
            other conditions, too. The procedure could also dump out the memory 
            structure if the user wanted to let the spider stop and restart work 
            later. We need to continuously transform local and remote URLs from relative 
              to absolute and back. If you put the pages on a local Web server, 
              you must also transform the physical paths as needed to download 
              the pages to your local Web site to the logical view of these pages 
              appearing in the links. For example, if you have a link such as:
              
             
href="webmaster/Java/Intro.html
this may be stored on your file system as:  
             
/disk1/htdocs/webmaster/Java/Intro.html
For this purpose, I have provided the helper functions in the listings.  Just Existing Software
              It is always worthwhile to look for ready-to-use software. The 
              first tool I recommend is LWP::RobotUA. This provides the 
              normal UserAgent library used before, and also provides methods 
              to let you obey the robot rules, thus allowing you to see if robots 
              are welcome (consult the robot.txt file). See:
              
             
http://info.webcrawler.com/mak/projects/robots/robots.html
for more information on robot rules.  Another application is w3mir, powerful Web mirroring software 
              available at:
              
             
http://langfeldt.net/w3mir/
It can use a configuration file and may be useful for your needs. 
            Furthermore, it is well documented. The above example retrieves the 
            contents of the Sega site through a proxy, pausing for 30 seconds 
            between each document it has copied from the w3mir documentation:  
             
w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/
Another option is Webmirror, available at:  
             
http://www.math.fu-berlin.de/~leitner/perl/WebMirror-1.0.tar.gz
This program has a good man page and is also easy to use. It can use 
            a configuration file where you explain the more complicated options.  Although there is existing software, we often want to achieve 
              "special effects", or we may have special needs that make 
              using own script as a front end of a search engine spider a better 
              choice.
              Limits
              I have described a powerful tool capable of copying an entire 
              Web site on your local system with just the initial URL and some 
              rules about where to stop. Clearly, there is other information you 
              may need -- such as configuration of a proxy server. Also keep 
              in mind that real life is never so easy. Let's look at some 
              examples.
              There are a lot of dynamic Web sites around -- I don't 
              mean sites with a lot of images moving and blinking; I mean sites 
              that are constructed depending on the user input, such as forms. 
              Forms offer many possibilities to send user choices to the Web server. 
              To deal with these cases, you'll need to build more intelligence 
              into your spider. The LWP libraries offer the ability to simulate 
              the click on the Submit button of a form, but need to simulate all 
              choices the user has in order to get the complete picture. If you 
              consider the timetable of a railway company or of "Lufthansa" 
              for example, it will give you an idea of what this could mean. It 
              depends heavily on what you expect from your spider; it may sometimes 
              be convenient to exclude these pages. To exclude them, just do nothing. 
              If there's no link on the pages containing the form, your spider 
              will not follow it.
              Cases where a lot of user input is required will cause complications. 
              In the worst case, your spider will continue to send incomplete 
              data, and the Web server will continue to answer with a page, which 
              is what the spider expects. But, you can guess what will happen. 
              To address this possibility requires more intelligent stop clauses. 
              The discussion of these interesting features would be out of scope 
              of this article. Another inconvenient thing that could happen is 
              too-generous servers. Have you ever seen a server that, instead 
              of sending one Web page, sends two or more? These situations will 
              also confuse the scripts described above, and the scripts are not 
              designed to handle these situations.
              It is not enough to have the start URL and the stop conditions. 
              You should also have an understanding of what type of pages will 
              your spider will encounter. More importantly, you must have a clear 
              understanding of what your spider is used for and what pages it 
              better should leave alone.
              Conclusion
              This article explained how to write a robot to automatically get 
              Web pages from other sites on to your computer. It showed how to 
              get just one page or mirror a whole site. This type of robot is 
              not only useful for downloading or mirroring remote Web sites, but 
              is also handy as a front end for a search engine as a link control 
              system. Instead of copying the pages on your local site, you control 
              whether the referenced pages or images exist.
              It is good practice to respect the robot rules and be a fair player 
              in the Internet. Avoid monopolizing or blocking Web sites with too 
              many requests or frequent requests.
              This article provided one of the possible approaches and is only 
              an example of how you might proceed in developing a stable application 
              for your needs. If there's already an existing application, 
              you may be able to use or extend it. But in any case, you can learn 
              from it.
              Reinhard Voglmaier studied physics at the University of Munich 
              in Germany and graduated from Max Planck Institute for Astrophysics 
              and Extraterrestrial Physics in Munich. After working in the IT 
              department at the German University of the Army in the field of 
              computer architecture, he was employed as a Specialist for Automation 
              in Honeywell and then as a UNIX Systems Specialist for performance 
              questions in database/network installations in Siemens Nixdorf. 
              Currently, he is the Internet and Intranet Manager at GlaxoWellcome, 
              Italy. He can be reached at: rv33100@GlaxoWellcome.co.uk.
           |