| Surviving Large-Scale Distributed Computing Powerdowns
 
Andrew B. Sherman and Yuval Lirov 
Introduction 
A full or partial building powerdown destroys an orderly
universe 
in the data center and stimulates hardware and software
failures. 
Unfortunately, building power outages are a necessary
evil. On one 
hand, computer systems require conditioned and uninterruptible
power 
sources. On the other hand, those very power sources
require periodic 
testing and maintenance. Thus, a series of scheduled
outages is the 
price paid for avoiding random power interruptions.
On the average 
we experience one scheduled powerdown per month. 
In a distributed computing environment, the difficulties
of coping 
with powerdowns are compounded. The distributed environment
for a 
large organization may comprise hundreds of systems
in the data center 
and thousands of systems on desktops. These systems
may be administered 
by a handful of people, each of whom may be responsible
for anywhere 
from 50 to 150 machines. In such an environment powerdown
recovery 
requires scalable procedures. Other factors that may
affect recovery 
time include time constraints, imperfect configuration
management 
data, chaotic powerup sequence, stimulated disk failures,
and disguised 
network outages. 
The focus here is on data center powerdowns, since they
pose the greatest 
threat. Each individual failure of a server can have
an impact on 
a large number of users and pose a substantial business
risk. Against 
that backdrop, failures of client machines are less
important individually, 
although very trying in large numbers. Given these priorities,
recovery 
from a powerdown affecting both the data center and
other business 
areas should proceed in two stages, server recovery
and client recovery. 
The objective is to minimize the time required for the
first stage, 
to try to guarantee 100 percent server availability. 
ICMP echo requests (pings) are typically used 
to find faults after the powerup. However, ping can
only confirm 
that a machine is sufficiently operational to run the
lowest layers 
of the kernel networking code -- it may be unable to
determine 
if the machine is fully operational. And a damaged system
that escapes 
detection (because it still answers pings) can exacerbate
confusion and even create cascading faults. 
This article presents a diagnostic methodology that
reduces the time 
needed for the server recovery stage. The methodology
exploits the 
synergy of a few very simple tests. By emphasizing early
fault detection, 
it helps system administrators and operators direct
their effort where 
it is needed. At our site this methodology has eliminated
much of 
the uncertainty that system administrators used to face
the Monday 
morning after a weekend powerdown. 
A Time-Critical Business Environment 
Our organization, Lehman Brothers, is a global financial
services 
firm. Thousands of Sun and Intel-based machines (hundreds
of them 
residing in the data centers) are interconnected by
a global wide 
area network, reaching beyond the New York and New Jersey
locations 
to all domestic and international branch offices. 
For the computer support staff, this means that there
are very few 
times during the week when there isn't a business need
for many of 
the systems. In North America alone, analysts, sales
people, and traders 
use the systems from 7:00 AM to 6:00 PM Monday through
Friday, and production batch cycles from 4:00 PM to
7:00 AM 
Sunday through Friday. This leaves Saturday as the prime
day for any 
maintenance activity, especially a powerdown that affects
the entire 
data center. 
The primary goal of our recovery procedures is to complete
the first 
recovery phase (100 percent server availability) by
Sunday afternoon, 
in time for the start of business in the Far East and
the start of 
Sunday night production in New York. The secondary goal
is to complete 
the second phase (100 percent of the desktop machines
in operation) 
by the start of business on Monday. 
Problems and Risks 
Our organization experiences an average of one scheduled
powerdown 
per month. Powerdowns affecting distributed systems
are coordinated 
by the Distributed Operations group. This group produces
and runs 
scripts that halt all affected machines and then turn
off power on 
all machines and peripherals. When power is restored,
Operations turns 
on power to all peripherals and processors. Each individual
support 
group is responsible for taking care of details such
as cleanly halting 
database servers prior to the powerdown. 
There are several risks in this procedure, no matter
how effectively 
it is carried out: 
The order in which machines are brought back up is 
dictated more by geography than by the software and
hardware interdependencies. 
For example, many of our machines are dependent upon
a set of AFS 
servers for software. In an ideal world, those servers
would be powered 
up first. 
Disk failures occur in great numbers at power-cycle
time. A basic and unchangeable fact of life of computers
is that disks 
fail. You are inevitably reminded of that fact when
you spin up a 
large number of disks at once. 
In a large data center, it is inevitable that some 
peripherals will not be turned on during the power-up
process. 
How often, if ever, machines are routinely rebooted
is a matter that varies widely with local tastes. If
your taste is 
to leave your machines up for a very long time, a powerdown
exposes 
you to hitherto undiscovered bugs in your start-up scripts. 
If you cross-monitor your systems, the entropy in 
your monitoring system will be rather high during and
immediately 
after the powerup until things stabilize. 
The databases that were shut down so cleanly need 
to be brought back up. 
Bringing It All Back Home 
From experience and from our analysis of the risk factors
we determined 
that the single largest problem was coping with disks
that were off-line 
(due to either component or human failure) when the
system was rebooted. 
This is not surprising given that even in normal circumstances
the 
most common hardware fault is a disk drive failure. 
Facing this fact for powerdowns caused us to change
our normal reboot 
procedures, as contrasted with the powerup procedures
to be described 
in the next paragraph. Since trying to boot with dead
disks is a problem 
at any time, our normal bootup procedure (see code in
Listing 1) now 
issues a quick dd command for every disk referenced
in either 
/etc/fstab or in the database configuration files to
ensure 
that each is spinning. If all the disks are okay, pages
and email 
are sent to the SAs for the machine. If there is a disk
error, the 
pages and email list the failed partitions. Later, the
reboot process 
starts the log monitors and cross-monitors (if enabled).
On dataservers 
an additional script starts up the server process and
database monitors, 
and notifies the database administrator that the machine
is up. 
This level of testing and notification is adequate for
the occasional 
reboot. However, it is not an effective diagnostic procedure
during 
a powerup. Since there is no particular reboot order,
there is no 
way to predict when certain network-wide services, such
as paging 
and mail, will be available. Further, given the level
of chaos in 
the network, it is best to turn the cross-monitors off
before the 
powerdown and leave them off until the network begins
to approach 
a steady state. 
The ultimate resolution entails performing extraordinary
surveillance 
in an extraordinary situation. On the Saturday night
or Sunday morning 
after a powerdown, we run a script that takes attendance
of all our 
group's servers (see code in Listing 2). While the sample
code given 
is driven by a file containing the hostnames, the script
as implemented 
is driven by a netgroup (which originates from a central
configuration 
management database) that identifies the universe of
servers. In either 
case, for each machine in the list, the script runs
the following 
diagnostic sequence: 
Does the machine answer ICMP echo requests? (ping) 
Is the RPC subsystem responding? (a local utility, rpcping) 
Are the disks all spinning? (also tests the functioning
of inetd and in.rshd.) 
The code for the disk check is shown in Listing 3. As
in the routine 
bootup version of the disk check, the script identifies
all disk partitions 
of interest from /etc/fstab and database configuration
files, 
and checks their status with a dd command. In addition
to 
running the disk check, this simple script serves as
an amazingly 
robust test of the health of the system. If some major
piece of the 
reboot procedure has failed, the chances are good that
the attempt 
to run the disk check using rsh will fail. The early
damage 
assessment provided by this diagnostic script gives
system administrators 
and field service engineers a headstart on solving software
and hardware 
problems. Once all the servers are operational or under
repair, the 
recovery procedure can focus on the clients and can
begin cataloging 
desktop machines that have not come back up. 
Experience 
In the many powerdowns our site has experienced since
we implemented 
these procedures, we have had from 98 to 100 percent
availability 
of the servers by the start of Sunday night production,
and 100 percent 
availability by the start of business Monday morning.
In general our 
problem resolution priorities are: primary production
servers, backup 
production servers, development servers. This has worked
for us, as 
we have not missed the start of production since we
introduced these 
procedures. 
There are a few pitfalls, however. For powerdowns extending
beyond 
the data center, we have seldom achieved 100 percent
client availability 
on Monday morning. One reason for this is that despite
our best efforts 
to dissuade them, users will sometimes have machines
moved without 
notifying the System Administration groups, which means
that there 
are invalid locations in our configuration management
system. It can 
sometimes take days to find a machine on the fixed income
trading 
floor if it is down, the primary user is on vacation,
and it was moved 
without a database update. Furthermore, some machines
are behind locked 
office doors, ensuring that they will be down or damaged
when their 
users arrive for work, depending upon when the door
was locked. 
Finally, we advise any support manager coping with powerdowns
to expect 
the unexpected. For example, it is implicitly assumed
in this discussion 
that once power is restored on Saturday night there
will be no further 
power outage. This is not necessarily a valid assumption:
Among other 
possiblities, human errors by the people doing the power
maintenance 
could lead to unscheduled outages later in the weekend.
In that case, 
there would be no orderly process; rather, the entire
data center 
(or building) would go down in an instant, and come
back all at once. 
In such cases, the same risks appear but the failure
probabilities 
are much higher. 
Acknowledgment 
The authors thank Jeff Borror for inspiration and his
vision of a 
fault-tolerant computing environment.  
 
 About the Authors
 
Yuval Lirov is Vice President of Global UNIX Support
at Lehman 
Brothers. He manages 24x7 administration of systems,
databases, and 
production for 3,000 workstations. A winner of Innovator
'94 award 
from Software Development Trends magazine and of Outstanding
Contribution '87 award from the American Institute of
Aeronautics 
and Astonautics, he authored over 70 patents and technical
publications 
in distributed systems management, troubleshooting,
and resource allocation. 
Andrew Sherman is a manager of Systems Administration
in Lehman 
Brothers' Global UNIX Support department. He is a cum
laude graduate 
of Vassar College in Physics and received a Ph.D. in
Physcis from 
Rensselaer Polytechnic Institute. Prior to joining Lehman
Brothers, 
he led the Unix Systems Support team at Salomon Brothers
and was a 
major driver of the effort to standardize the system
architecture 
there. 
Drs. Lirov and Sherman are currently editing Mission
Critical 
Systems Management -- an anthology to be published in
1996. 
 
 
 |