| A Host Health Probe
 
John Lees 
The Department of Computer Science at Michigan State
University operates 
several hundred computers for its students, staff, and
faculty. Computers 
are located in three buildings on a large campus. Although
we have 
a small systems management staff (three full-time people
and ten or 
so half-time graduate students), we are expected to
keep the computers 
up and running around the clock. We have developed a
number of tools 
to ease this task. The subject of this article is a
perl script 
named probe, which checks on the health of each of our
systems 
every morning and mails a status report to the appropriate
system 
managers. 
We operate a total of about 250 computers running the
UNIX operating 
system. Most are Suns running SunOS 4.1.x or Solaris
2.x, but there 
are a few NeXTs, half a dozen DEC Alphas running OSF/1,
and an Apple 
workgroup server running A/UX. Any tool we use for systems
management 
has to work in this diverse and ever-changing environment.
We've chosen 
Larry Wall's perl language because the perl interpreter
is easy to port to new versions of UNIX, is well documented
by two 
O'Reilly handbooks, and provides a powerful base for
writing custom 
scripts. 
The probe script began as a way of checking that the
xntpd 
daemon was running on all our computers. This is a daemon
which keeps 
computers synchronized using the Network Time Protocol.
We quickly 
saw the utility of making a few more simple checks on
all our computers 
every day, and the probe script grew into its present
form. 
Our computers are divided into seven NIS (Sun Network
Information 
System) domains for the purpose of controlling access.
Because the 
probe script is driven by the netgroup map, it is necessary
to understand a little about this database and about
Sun's NIS (see 
the sidebar, "Sun Network Information Systems (NIS)"). 
The probe script uses the NIS netgroup map to find the
names 
(host names) of the computers to probe. This adds to
the complexity 
of the script, but we make too many changes to the netgroup
(and hosts) 
databases for any other scheme to be practical. A simple
list of systems 
to probe would have to be updated several times each
week. 
Sample Output 
Listing 1 shows the kind of output generated by the
probe 
script. This sample has been edited to condense it a
little, but it 
gives the general idea. The fields for each computer
being probed 
are: 
The hostname 
Absolute value of the offset, in seconds, between the
time on the computer running the probe and the computer
being probed 
Utilization of the /, /usr, /var, 
and /home filesystems 
System load 
Number of users 
Status of six selected daemons (lowercase if not running) 
Uptime in days 
Hostname of the NIS server to which the computer is
bound 
The probe Script 
Listing 2 shows the probe script. I'll briefly discuss
the 
major sections, but see the comments in the script for
details. 
The main program sets up global constants and variables,
then calls 
the getngrp() subroutine to build the global list of
NIS domains 
and computers within each domain. Using this list, the
do_poll() 
subroutine probes each computer. 
The do_poll() subroutine does most of the work. First
it uses 
the newping program to determine if the computer is
up and 
usable (newping is a modified version of a program published
in Sys Admin 2.4). It next determines the time offset
between 
the two computers, then forks a process to gather more
information. 
The child process uses rsh to run several commands on
the 
computer being probed. A process is forked to do this,
to decrease 
the chance of hanging the entire probe if one of the
rsh commands 
hangs. Finally, all the information is formatted and
displayed using 
the do_print() subroutine. 
The do_print() subroutine displays on stdout and/or
mails to the manager of the current NIS domain. When
reports are being 
mailed, this routine begins a new report when a new
domain is begun. 
The getngrp() subroutine reads the NIS netgroup map
and builds 
a global associative array of all the netgroups and
computers. (This 
subroutine has been made into a perl library routine,
and 
has found use in a number of our other local scripts.) 
Like many system administration tools, the health probe
script 
grew over time rather than being designed of one piece.
If I were 
to redesign it, I would probably break it into two pieces,
one running 
on the master and one running on each computer, to more
efficiently 
gather the information and to smooth out the differences
among the 
five different versions of UNIX we have in use. 
Using the probe Script 
We run the probe script once each day, in the early
morning 
so we can catch problems before the heavy user load
begins. Because 
we have had quite a few network problems, we run the
probe 
script indirectly, using a Bourne shell wrapper. The
wrapper (see 
Listing 3) attempts to kill the probe script and any
hung 
children if we are having a bad network day. 
The probe script is normally run with no command-line
options. 
This sends the reports generated to manager@ for each
of the 
top-level netgroups. We have appropriate mail aliases
set up for this. 
The complete set of reports is also copied to stdout,
which 
is then mailed to an appropriate person (me, as I run
the probe 
from my crontab). A -m option will suppress copying
all the reports to stdout, and a -s option will suppress
mailing the individual reports to lab managers. 
Modifying the probe Script for Your Situation 
You will almost certainly need to modify (that sounds
better than 
"hack," doesn't it?) the probe script to fit
your 
exact mix of machines. The first thing you have to look
at is how 
the netgroup database is set up. You must either use
a scheme like 
ours or modify the getngrp subroutine to work with your
scheme. 
Modifying getngrp should be easy. You can even replace
it 
with a simple routine to read a list of hostnames from
a file. 
Here are several other modifications you may need or
want to make. 
Use ping instead of newping, if you 
have not installed newping and experience no problems
with the simple 
ping shipped with your system. 
Check on different filesystems. The probe script 
checks on /, /usr, /var, and /home. This may 
not be appropriate for you. 
Check for a different set of daemons, perhaps adding
a section for a different operating system flavor. 
And of course it is up to you whether to use the prun
wrapper or run 
the probe script directly.  
 
 About the Author
 
John Lees has an M.S. in computer science and has worked
during 
the past twenty years about equally as a teacher, technical
writer, 
programmer, and system administrator. His computer experience
began 
in the days of front panels and paper tape, and he doesn't
have enough 
fingers and toes to count the operating systems he has
used. His love/hate 
relationship with UNIX dates to early 1985. Currently
Mr. Lees is 
a systems analyst with the Department of Computer Science,
and manager 
of the Pattern Recognition and Image Processing Laboratory,
at Michigan 
State University. He is a member of ACM, Computer Professionals
for 
Social Responsibility, the IEEE Computer Society, the
Society for 
Technical Communication, and the TeX Users Group. He
may be contacted 
as lees@cps.msu.edu. 
 
 
 |