|  Keeping 
              Your Web Content in Sync
 Adam Olson 
              This article is all about keeping the content in your Web server 
              farm synchronized with rsync. rsync is a very handy program that 
              provides a simple way to mirror content across a number of machines. 
              I'll show how to design a straightforward content push system 
              to keep front-end Web server content synchronized. There are plenty 
              of ways to utilize a program like rsync; this is just one of them. 
              Obtaining and Building rsync 
              The current version of rsync is 2.4.6 and was written by Andrew 
              Tridgell and Paul Mackerras. Download the compressed source tar 
              ball at: http://rsync.samba.org. I ran the following commands 
              on a system running Solaris 2.7, and the compilation went smoothly: 
              
             
# gzip -dc rsync-2.4.6.tar.gz | tar xvf -
# cd rsync-2.4.6
# ./configure
# make
# make install
This will install the rsync binary in /usr/local/bin as well 
            as the man pages. You will need to go through this process on all 
            the involved hosts.  More on Our Goal
              One example of a cookie cutter Web tier is a design where a number 
              of front-end Web servers all serve up identical content and the 
              rest is handled via calls to a back-end database of some kind. Traffic 
              is load balanced across the Web servers using a method such as DNS 
              round robin or, if possible, a hardware solution. Because the Web 
              servers all have the same content tree, using rsync to maintain 
              these structures from a central distribution point provides a clean 
              and easy way to maintain the content.
              More on rsync
              Why does rsync work so well in this configuration? Here are some 
              of the key factors:
              
              1. You can use ssh as the underlying transport mechanism. 
              This means you get added security without a lot of extra work. ssh 
              handles all of the authentication which is a lot better than leaving 
              it up to clear text protocol like rlogin.
              2. Entire filesystems or individual directories can be updated, 
              therefore making it easy to mirror your document root and subdirectories 
              to a number of destination hosts.
              3. It preserves symbolic and hard links, ownership, permissions, 
              etc. For example, if rsync is preserving file ownership, the UIDs 
              of the transferred files will remain the same instead of being owned 
              by the account initiating the transfer.
              
              rsync also includes an algorithm for determining which portions 
              of a file need to be synchronized, thus it can be more efficient 
              over slow transmission lines. Personally, I don't usually benefit 
              from this feature because high bandwidth paths are increasingly 
              more common. As the following example shows, I am more concerned 
              with the act of synchronizing our hosts than with the hopes of doing 
              it in the most efficient manner. If you are interested in learning 
              more about the rsync algorithm, a detailed description is provided 
              in the distribution.
              Let's Do Some Syncing
              I'll now walk through how to build a basic configuration 
              that can be expanded to support a multitude of hosts. The following 
              is an example of using ssh to transfer the files. You need 
              ssh (http://www.ssh.com) installed on both hosts, 
              or you can use rsh.
              The central distribution point will be located on a host named 
              dev, and our front-end Web server will be on a host named 
              www1. The distribution root on dev will be located 
              at /usr/local/webroot, and the document root on www1 
              will be located at /usr/local/webroot as well.
              The basic command to synchronize www1 to dev looks 
              like this:
              
             
dev# rsync -vrlHpog --delete --rsh=/usr/local/bin/ssh/usr/local/webroot/ www1:/usr/local/webroot/
Here is a break down of this command that shows what each part does:  
              
             
               -v -- Run in verbose mode. Displays the files being 
                transferred, as well as statistics on how much data was written, 
                read, and how long it took. 
               -r -- Recurse into directories. 
               -l -- Preserve soft links. 
               -H -- Preserve hard links. 
               -p -- Preserve permissions. 
               -o -- Preserve owner. 
               -g -- Preserve group. 
               --delete -- This option deletes any files on the 
                destination host that do not exist on the distribution host. This 
                is useful because when certain portions of the content have been 
                deleted in new revisions, unless this option is specified, the 
                files will linger around on the front-end Web servers. This could 
                conceivably have bad affects on your application. 
               --rsh=/usr/local/bin/ssh -- The path to ssh. 
               /usr/local/webroot/ -- The local content source 
                directory. 
               www1:/usr/local/webroot/ -- The remote host and 
                its local content document root.
              Another argument you may use often is --exclude. For example, 
              adding --exclude="*.log" or --exclude="*.old" would 
              exclude any file ending in .log or .old from being 
              pushed to the front-end Web servers. Log files or backups made while 
              on the development server are of little use when synchronized into 
              production. For a list of all the arguments to rsync, run rsync 
              without any arguments or check out the man page.
              Sprucing It Up
              Typing the command discussed above works well when you are dealing 
              with only a few front-end Web servers. Even then, it is always easier 
              to write a script to do it for you! I am always happier when I have 
              eliminated repetitious tasks.
              Here is a basic script that gets the job done. A useful addition, 
              if you use RSA authentication in your ssh setup, is to add 
              support for ssh-agent so a passphrase only needs to be entered 
              once:
              
             
#!/usr/local/bin/perl
#
# a basic script utilizing rsync that will synchronize
# content to a number of front end servers.
#
# adamo@humboldt1.com 10/31/00
#
#### DEFINE ####
# array of servers, add your hosts here.
@servers = (www1, www2, www3, www4, www5, www6, www7, www8);
# distribution directory
$distdir = "/usr/local/webroot/";
# destination directory
$destdir = "/usr/local/webroot/";
#### END ####
foreach $server (@servers) {
  
  print "Initiating content synchronization on $server.\n";
  system "/usr/local/bin/rsync -vrlHpog --delete \
    --rsh=/usr/local/bin/ssh $distdir $server:$destdir";
  
    if ($? == 0) {
      print "Content synchronization successful on $server.\n";
    } else {
      print " Content synchronization failed on $server.\n";
    }
}Conclusion This article covered a relatively painless way to keep the content 
              on your front-end Web servers synchronized. It can be expanded upon 
              to synchronize content across a wide area of differing services, 
              as well. rsync's seamless integration with ssh and ability 
              to mirror entire directory trees while keeping permissions and ownership 
              intact, make it a good solution to the problem of content management. 
              Adam Olson has helped build a successful ISP (http://www.humboldt1.com), 
              designed and configured portions of the California Power Network 
              while working at MCI WorldCom, and is currently working for a startup 
              in the Silicon Valley (http://www.quaartz.com). 
              Adam hopes to be sailing a lot soon. He can be contacted at: adamo@humboldt1.com. 
           |