| Orchestrating a Kinder, Gentler Disaster
 
Dorian Deane 
When a system crashes, the system administrator's goal
is to return 
the system, as closely as possible, to its original
state. Presumably, 
any system administrator worthy of the title will have
the backups 
required to accomplish this, but how quickly? Magnetic
tape, the media 
of necessity for most system administrators, is notoriously
slow, 
and that means that a single misstep, such as having
to read a tape 
twice because of poor planning, can cost hours of downtime.
Making 
the situation even worse is the time dilation factor:
that well-known 
phenomenon in which time slows in direct proportion
to the number 
of people waiting for you to finish your task. 
The corruption of a single file system is generally
not a disaster 
-- merely a time-consuming annoyance. My definition
of a disaster 
involves the loss of an entire shared disk or even loss
of the server 
itself. Under these circumstances, the system administrator
had better 
be ready to approach the problem creatively. After months
without 
a major crash, getting the right answer to questions
you had thought 
were easy all at once becomes critical. How many file
systems from 
the newly deceased disk were absolutely necessary to
get users working 
again? Did that disk provide any swap area? Which clients
imported 
files from it? Can the most important partitions be
restored to another 
disk so that you can worry about the repair process
later? Each one 
of these questions can be answered in some way, but
few of them can 
be answered quickly -- unless you have prepared in advance. 
The information most difficult to recapture from tape
is the underlying 
disk structure. If a disk loses its label, the system
administrator 
may be left floundering as he or she tries to make reasonable
guesses 
about what the users need. Recreating the original disk
label is sometimes 
vital, but not always easy (and almost never quick)
given the information 
stored by a typical backup system -- such as one based
on the standard 
dump and restore programs. Getting the swap partition
right is particularly 
important, and no longer simply a matter of using the
swap area that 
came on your preconfigured system. Swap requirements
get bigger every 
year: where it used to be just Lisp developers who sat
outside your 
door whining for a bigger swap partition, lately even
applications 
written entirely in C are requiring hundred-megabyte
swap areas. 
Leaving disk problems alone for a moment, consider what
happens if 
your main disk server starts spitting smoke and fire
(most likely, 
just as you were getting ready to go home for the evening).
If you 
have a spare machine ready and waiting, you are in an
unusual and 
privileged position. The rest of us end up making an
eleventh-hour 
decision to designate a client as the new server, and
suddenly it 
becomes important to know which clients imported which
file systems. 
Intelligent approximations will get you much of the
way home, particularly 
in a homogeneous environment, but the occasional odd
situation can 
cause real trouble. Again, it's trivial to mount tape
after tape and 
read in every fstab and exports file, but this can take
massive amounts of time. 
The Bourne Shell script in Listing 1, which I call dkmap,
answers 
these problems quite simply. For a network of one or
two systems, 
the script is a mere convenience, giving the system
administrator 
a means to easily automate an already fairly trivial
task. The larger 
the network, the more useful dkmap becomes. dkmap queries
each machine in its host list for various information,
the most important 
being the labels of all attached disks. The other information
it gathers 
can be used as a map of shared file systems in the network.
Answering 
questions such as "Was /usr/local on partition
d or e?" becomes 
easy. This information is on your backup tapes, but
if you ever have 
to stream through a 2-Gigabyte tape for nothing more
than a file system 
table, you'll know the meaning of frustration. Some
administrators 
wing it -- they know the systems and can make reasonable
guesses 
-- but many users, particularly application developers,
are picky 
enough that a reasonable guess won't do. 
Parts I and II of dkmap could be simplified, but I find
it 
convenient to have it formatted this way. If your system
configuration 
changes often, you may want to run it from crontab,
once a 
week, sending the output to a printer. You will probably
want to run 
dkmap after hours; it does not run quickly because it
is thorough 
-- rather than assuming that you have all disk types
listed in 
your fstab, it patiently tries each disk in DISK_LIST
for device identifiers from 0 to 9. One of the more
interesting techniques 
in the script is the method of copying a script to a
remote machine 
and then executing it; though this trick strays from
the goal of simplicity, 
it saves a lot of extra rshs and satisfies a preference
of 
my own by not requiring installation of a separate script
on each 
remote machine. 
The dkmap script was written to run in a Sun environment.
It 
is a given that changes need to be made for it to run
on other systems. 
I have tried to place comments at all points in the
script where I 
suspect there will be portability issues, and in one
instance, I've 
added a case statement to resolve one of the most obvious
problems 
in a heterogeneous network. Error-checking is minimal;
I did not feel 
justified in doubling the length of the script to make
it only a little 
more robust. Use it and enjoy -- I find myself running
to look 
at its printout at least once every couple of months.
 
 
 About the Author
 
Dorian Deane has been a UNIX systems programmer/administrator
for the last five years. He currently works with the
Advanced Decisions 
Systems division of Booz, Allen and Hamilton. You may
contact Dorian 
at ddeane@ads.com. 
 
 
 |