Article

mar2001.tar

Web Hosting: A Migrational Case Study

Ripduman Sohan

Hosting, the act of providing a service on behalf of an individual or company, is a concept that has been around for as long as the Internet. There are many types of hosting services, including Web, mail, and database hosting. However, the most popular and longest-lived hosting service has been Web site hosting.

Many organizations, such as universities, commercial companies, and ISPs, provide this essential service for their users or customers. Today, the Web is the most popular medium for retrieving information from the Internet. To ensure your material in this information cornucopia is readily available, it's essential to configure your end to deal with anything your users may want, without inconveniencing them or annoying you.

In this article, I present a case study migration of a system containing 203 virtual hosts from one server to another, many of which had backend databases. The Web server used was Apache, the database, MySQL, all running on FreeBSD and being transferred to Linux. I intend to show you how simply this can be done and share some of the tricks and pitfalls generally involved with setting up, running, and successfully migrating medium- to large-scale Web sites with this software. I've also included virtual hosting because that's what the original job entailed, and also because I wanted to be as thorough as possible. Nevertheless, almost all of the concepts in this article should be adaptable to single Web sites and different software with little or no tweaking.

The Scenario

The source system was a box running FreeBSD 3 on a Pentium II 266 located in San Francisco. It was connected to the Internet via a 256-KB link and was using Apache 1.3.1. Of the 203 virtual hosts, 30 required databases, so it also had MySQL 2.3 installed. The machine setup was incompetent -- so incompetent that the actual MySQL database was available directly off the Web. It also had no backups. The target machine was a brand new, default installation, Redhat 6.3 machine on a T3 link in New York. I didn't have physical access to either side of the scenario and was working off a satellite link with a 700-ms lag.

The reason for the changeover was twofold. The company was increasingly aware of the insecurity and lack of power of the source machine in relation to their increasing customer base, and they were also getting a better deal with a new co-location provider. My job was to move the whole system, with zero downtime and no loss of client data.

The Move

Backup

The move started with the most important thing -- a system backup! I couldn't back up any of the user Web files or databases due to the high load on the system. As soon as I touched any of these, the system became highly erratic and with 2.3 GB of user data on the system and no local means of backup, I wasn't going to transfer all the data via the Internet link. Therefore, I initially backed up just the httpd.conf file and the password and group files. However, before you start a migration, I advise you to check that latest system backup is valid. You do have one, right?

Analysis

The next move was to download and install analog. This is a very well-known and comprehensive Web log analyzer, available as a package for most platforms. You can get started quickly with the following steps:

1. Install analog. Use your package manager, usually rpm -i analog.rpm in Linux.

2. Edit the analog.cfg, usually in /etc/analog.cfg. Edit the sections LOGFILE to point to where your Web server logfile lives, and the section OUTFILE to point to your output filename.

3. Turn on the hourly report with the command FULLHOURLY ON in the analog.cfg file.

4. Run the binary, usually /usr/bin/analog.

This will create a full hourly breakdown report using your log, which you can view with any browser. I did this to build a time profile so I would know the best time for me to actually log into the machine in order to copy and move the data. Most Web servers go through a daily cycle of use, depending on the time zone of their audience, and it's best to work when load average is lowest to minimize disruption to the system. If you can do the migration with downtime or without affecting the service, go ahead and skip this step.

New System Build

After figuring my optimal timings, I built the new server. If you choose to run a dedicated machine as a server, a little forethought in design can go a long way to prevent problems. Your most important resources on a Web server are memory and disk space. Work on maximizing those. Make your Web data partition as large as possible and put it on a separate drive if necessary. Most default Linux server installations come with setups that are not really optimal to Web servers -- do you really need X Windows? What about gimp? Get rid of all unnecessary software. This usually frees up to 800 MB and makes software conflicts less likely. On most RedHat-compatible Linux distributions, you can use the following commands to work with installed packages:

rpm -qa -- Provides the full list of installed packages

rpm -qi packagename -- Information on a selected package

rpm -e packagename -- Removes the selected package

Next, trim memory usage. Disable (or replace with lighter equivalents) all services that don't have to be running: atd, bind, dhcp, and Sendmail (replaceable by ssmtp) are the usual candidates. You can usually remove the packages or just remove the startup scripts from your startup scripts directory. Also ensure that you have adequate swap space (usually as much as available ram). Swap space, at least in Linux, comes in two variants: partition or file. Use the partition type -- at least if your swap space gets corrupted your system won't.

I then upgraded the "key software". Key software is software on which system functionality depends. This is usually Apache and its related support software. It's worth using the latest available stable version of your key software that you deem fit for consumption. (My personal method of finding out the Apache version to use is by querying http://www.slashdot.org by using netcraft.) If you have any modules used by Apache (e.g., PHP), it's worth getting the latest versions. Also, this is the time to install any third-party software you want to use; Professional FTP Daemon (Proftpd) and OpenSSH are popular options in this respect. If you're using a system with a package manager and you don't need any non-standard options (i.e., those requiring a source compile), get the installable package. The reason for this is that any issues with that particular OS and software will have been worked out by the package maintainers, so it's worth using the package.

Performance Tuning

The crux of the matter is configuring your software so it performs well. Although I could write several articles on how to configure your system, I'll only give you the main ideas behind maximizing performance for Apache and, to some extent, the system.

Apache is the machine's interface to the world. Configure this poorly, and your beautiful new server with its oceans of ram and your T3 link won't be worth anything.

To configure it well, first, get rid of all unnecessary Apache modules and add any custom ones you do want. You can do this by editing the httpd.conf file and looking for lines similar to:

LoadModule info_module        modules/mod_info.so

This tells Apache to load the module info_module into memory when it starts. Go through each LoadModule line of the default installation and disable all the modules you'll never need. (This is done by putting a # sign at the front of the line.) Typical modules rarely used on production Web servers are mod_autoindex (creates automatic indexes for directories) and libproxy.so (proxy caching module). You can find the complete module description for each standard module in the Apache documentation. This is done to minimize the memory Apache uses because fewer loaded modules mean less memory allocated to module code. Sometimes, disabling modules can also lead to server speed increases.

If you're using Perl as a scripting language for the server, consider having mod_perl loaded. This will eliminate having to run an instance of Perl every time you start up a script, which means better response time for the server.

Next, I always modify the lines StartServers, MinSpareServers, and MaxSpareServers. These lines go together. To understand this, remember that creating a process on most operating systems is quite expensive in terms of time, and in Apache, each process is known as a server. Hence, you'll want to start a reasonable number of processes when you start up the program (StartServers line), while simultaneously having a reasonable number of processes free to serve any other incoming requests before it creates any additional new ones (MinSpareServers line). Conversely, you don't want to waste memory with spare processes lying about after they've finished doing their work (MaxSpareServers line). I find the values 8, 4, and 10 work well for most setups.

The next lines to modify are the lines MaxClients and MaxRequestsPerChild. MaxClients relates to the maximum number of clients that can connect to the server simultaneously. A larger number means more concurrent connections, but worse performance; a smaller number means the opposite. A good compromise is a value of 200. The line MaxRequestsPerChild relates to the number of requests each process can handle before it is forced to die. This prevents errant processes (e.g., one that leaks memory) from hogging system resources. If you're confident everything works well, you can set this value to zero to provide that little extra boost in performance.

As a trick, you can use the above parameters to provide a limited service while you perform maintenance or migration work. When I migrated my data, I restarted Apache on the source machine with a single server and a MaxClients of 50. This allowed users to still get (some) service while I had a more usable machine.

You can also turn of hostname lookups (HostNameLookups line). This prevents Apache from looking up and logging the DNS record of the connecting client, as opposed to just the IP. Finally, you should avoid providing server-side scripting (.shtml files) because this forces Apache to parse each page it sends and makes them uncacheable.

Regarding your system, there are two things that are consequential to performance. The first is the maximum number of open files you can have at any one time. In Linux, you can modify this parameter by modifying the file /proc/sys/fs/file-max. The command echo 16384>/proc/sys/fs/file-max will increase maximum open files to 16384. For administration ease, you can put this command in one of your startup scripts.

The second thing you can do is to rebuild the kernel, cutting out all the unnecessary drivers. This frees memory and makes the kernel leaner and, therefore, faster. However, if you're working off a remote link, be certain the kernel works before deployment. Having an exact replica locally, in terms of software and hardware, may help here.

Data Migration

Data migration is relatively painless if carried out properly. The first thing to do is to make sure all your user accounts have been duplicated with the correct logins and passwords. If you're migrating between homogeneous machines, just copy the relevant password and shadow files across. Otherwise, you may have to manually migrate accounts, but this is dependent on source and destination systems. Most shadow systems now have interchangeable password files, but check first. I was lucky that all the information my company had for virtual hosting was assigned by them (including password), so I made a text file with the user information in it, and use the Linux newusers command to create all the new users.

Remember when migrating user accounts to make sure that all the accounts on the new server have the same group identifier (GID) and user identifier (UID) as on the old system. This prevents permission and ownership problems when you copy the data across. It may also be beneficial to create separate groups for different account classes. For example, I have separate groups for the Web sites that connect to databases, and for those that don't.

After the users are set up on the system, you need to create or migrate the Apache virtual hosts for each virtual host. An easy way of keeping your virtual host configuration separate is to put it in a separate file (e.g., the directive Include conf/vhosts.conf in your httpd.conf would allow additional configuration directives (the virtual hosts) in the vhosts.conf). This makes a migration easy -- just modify the file for your new configuration and include it in your new setup.

You must ensure that all virtual hosts have their own transfer and error log files. This is handy for the customers, because it allows them to maintain and analyze their own log information. It's handy for you, because it frees you of the same task.

After this, it's smooth sailing. All that's left is archiving the data off the old server and restoring it on the new one. There are several ways to do this. My favorite method requires both machines to have OpenSSH installed. Then use the following command, carried out in the data directory of the source server:

tar -cf - * |(ssh -l username destination.host.com tar -xvpf -)

This archives all the data on the host server and unarchives it at the destination, all in one command. Nevertheless, if you feel so inclined, go through the tar, copy, untar cycle. I recommend you don't change the Web data directories when moving the files, in case of any hard coded paths.

Databases should also be migrated at this point. With MySQL, the process is easy. The basic steps are dumping the data to a text file, copying the file to the new server, creating the database on the new server, and importing the data into the database. A quick, typical example is:

oldserver$ mysqldump dbname >outfile  \
  (dump the database dbname to file outfile)
oldserver$ scp outfile newserver: 
  (copy the outfile to the new server using secure copy)

newserver$ mysqladmin create dbname  \
  (create the database dbname on the new server)
newserver$ mysql dbname <outfile   \
  (import data from file outfile)

Testing

After migration, you're almost there -- but do things work? The last thing you want is to change your DNS entries to point to the new server, or create new DNS entries only to realize that things don't work. However, you can't check to see whether things work unless you move the DNS entries!

Fortunately, there are a number of solutions to this problem. The most comprehensive one I use is the following:

1. Create a DNS server on an extra machine with fake records that indicate the new server is the Web server for all the virtual hosts you are hosting on it.

2. Find a set of machines on the same network as the fake DNS server that will be used for testing. Point their primary DNS server to this server.

3. Surf the virtual host sites to see whether they work.

This works because the fake DNS server is programmed to assume that it is the master domain holder for the virtual hosts and, hence, passes the wrong information to the client machines. The client machines then contact the new server and request data from it. The advantage of this is that a number of machines can simultaneously be used to do testing.

A less elaborate but easier option is to have one machine on the network configured to have no DNS server, but have a modified hosts file (/etc/hosts for Linux) to point to the virtual hosts. If your server machine needs to do some sort of host lookup, this option is very useful, because it allows you to check that the whole system works before migrating records.

Once you're satisfied that everything works, change the DNS settings to reflect the new server. The job is now done. Keep the old server active for some time after retiring it, because the DNS propagation takes time (usually a few days, but sometimes up to a month). You can then gradually recycle or destroy it.

Security

You can never be too safe on a production server. A few things to check are:

Don't give users access to the system if they don't require it. Make their shells /bin/false.
If you allow CGI scripts, vet them to ensure they aren't malicious.
If you use a custom Web-based administration tool, make sure it's secure.
If the main access to your system is via ftp, consider using ProFTPD. This is an excellent ftp server with lots of security features, including the ability to lock users into their home directory, apply quotas, allow logins without a valid shell, and control the maximum number of concurrent user logins.
As much as possible, use OpenSSH to access your system. Disable telnet and any other services you don't need.
Periodically check and update your software for any discovered security vulnerabilities.
Use TCP Wrappers to control access to services.
Periodically monitor your log files for any suspicious activity.

Conclusion

As you know, Web servers are integral but sensitive parts of today's Internet. I hope this article has given you an insight to how important it is to plan ahead when preparing or migrating a Web server. I also tried to show how simple setup or migration can be, with little or no downtime, if the planning is done well.

Links

Analog Weblog analyzer -- http://www.analog.cx

SSMTP (Send only Sendmail emulator) -- http://rpmfind.net/linux/RPM/contrib/libc6/i386////ssmtp-2.38-1.i386.html

Netcraft International -- http://www.netcraft.com

ProFTPD (Professional FTP Daemon) -- http://www.proftpd.net

OpenSSH -- http://www.openssh.com/

Ripduman Sohan is currently finishing off a degree in Software Engineering at City University, London. He's originally, and still, based in Kenya and has been using and promoting *nix based systems since he was 14 years old.