Keeping Your Web Content in Sync

Adam Olson

This article is all about keeping the content in your Web server farm synchronized with rsync. rsync is a very handy program that provides a simple way to mirror content across a number of machines. I'll show how to design a straightforward content push system to keep front-end Web server content synchronized. There are plenty of ways to utilize a program like rsync; this is just one of them.

Obtaining and Building rsync

The current version of rsync is 2.4.6 and was written by Andrew Tridgell and Paul Mackerras. Download the compressed source tar ball at: http://rsync.samba.org. I ran the following commands on a system running Solaris 2.7, and the compilation went smoothly:

# gzip -dc rsync-2.4.6.tar.gz | tar xvf -
# cd rsync-2.4.6
# ./configure
# make
# make install

This will install the rsync binary in /usr/local/bin as well as the man pages. You will need to go through this process on all the involved hosts.

More on Our Goal

One example of a cookie cutter Web tier is a design where a number of front-end Web servers all serve up identical content and the rest is handled via calls to a back-end database of some kind. Traffic is load balanced across the Web servers using a method such as DNS round robin or, if possible, a hardware solution. Because the Web servers all have the same content tree, using rsync to maintain these structures from a central distribution point provides a clean and easy way to maintain the content.

More on rsync

Why does rsync work so well in this configuration? Here are some of the key factors:

1. You can use ssh as the underlying transport mechanism. This means you get added security without a lot of extra work. ssh handles all of the authentication which is a lot better than leaving it up to clear text protocol like rlogin.

2. Entire filesystems or individual directories can be updated, therefore making it easy to mirror your document root and subdirectories to a number of destination hosts.

3. It preserves symbolic and hard links, ownership, permissions, etc. For example, if rsync is preserving file ownership, the UIDs of the transferred files will remain the same instead of being owned by the account initiating the transfer.

rsync also includes an algorithm for determining which portions of a file need to be synchronized, thus it can be more efficient over slow transmission lines. Personally, I don't usually benefit from this feature because high bandwidth paths are increasingly more common. As the following example shows, I am more concerned with the act of synchronizing our hosts than with the hopes of doing it in the most efficient manner. If you are interested in learning more about the rsync algorithm, a detailed description is provided in the distribution.

Let's Do Some Syncing

I'll now walk through how to build a basic configuration that can be expanded to support a multitude of hosts. The following is an example of using ssh to transfer the files. You need ssh (http://www.ssh.com) installed on both hosts, or you can use rsh.

The central distribution point will be located on a host named dev, and our front-end Web server will be on a host named www1. The distribution root on dev will be located at /usr/local/webroot, and the document root on www1 will be located at /usr/local/webroot as well.

The basic command to synchronize www1 to dev looks like this:

dev# rsync -vrlHpog --delete --rsh=/usr/local/bin/ssh/usr/local/webroot/ www1:/usr/local/webroot/

Here is a break down of this command that shows what each part does:

-v -- Run in verbose mode. Displays the files being transferred, as well as statistics on how much data was written, read, and how long it took.
-r -- Recurse into directories.
-l -- Preserve soft links.
-H -- Preserve hard links.
-p -- Preserve permissions.
-o -- Preserve owner.
-g -- Preserve group.
--delete -- This option deletes any files on the destination host that do not exist on the distribution host. This is useful because when certain portions of the content have been deleted in new revisions, unless this option is specified, the files will linger around on the front-end Web servers. This could conceivably have bad affects on your application.
--rsh=/usr/local/bin/ssh -- The path to ssh.
/usr/local/webroot/ -- The local content source directory.
www1:/usr/local/webroot/ -- The remote host and its local content document root.

Another argument you may use often is --exclude. For example, adding --exclude="*.log" or --exclude="*.old" would exclude any file ending in .log or .old from being pushed to the front-end Web servers. Log files or backups made while on the development server are of little use when synchronized into production. For a list of all the arguments to rsync, run rsync without any arguments or check out the man page.

Sprucing It Up

Typing the command discussed above works well when you are dealing with only a few front-end Web servers. Even then, it is always easier to write a script to do it for you! I am always happier when I have eliminated repetitious tasks.

Here is a basic script that gets the job done. A useful addition, if you use RSA authentication in your ssh setup, is to add support for ssh-agent so a passphrase only needs to be entered once:

#!/usr/local/bin/perl

#
# a basic script utilizing rsync that will synchronize
# content to a number of front end servers.
#
# adamo@humboldt1.com 10/31/00
#

#### DEFINE ####

# array of servers, add your hosts here.
@servers = (www1, www2, www3, www4, www5, www6, www7, www8);

# distribution directory
$distdir = "/usr/local/webroot/";

# destination directory
$destdir = "/usr/local/webroot/";

#### END ####

foreach $server (@servers) {
  
  print "Initiating content synchronization on $server.\n";
  system "/usr/local/bin/rsync -vrlHpog --delete \
    --rsh=/usr/local/bin/ssh $distdir $server:$destdir";
  
    if ($? == 0) {
      print "Content synchronization successful on $server.\n";
    } else {
      print " Content synchronization failed on $server.\n";
    }
}

Conclusion

This article covered a relatively painless way to keep the content on your front-end Web servers synchronized. It can be expanded upon to synchronize content across a wide area of differing services, as well. rsync's seamless integration with ssh and ability to mirror entire directory trees while keeping permissions and ownership intact, make it a good solution to the problem of content management.

Adam Olson has helped build a successful ISP (http://www.humboldt1.com), designed and configured portions of the California Power Network while working at MCI WorldCom, and is currently working for a startup in the Silicon Valley (http://www.quaartz.com). Adam hopes to be sailing a lot soon. He can be contacted at: adamo@humboldt1.com.