| A System Load Monitoring Trilogy
 
Leor Zolman 
If you've been following my articles in the past two
issues of Sys 
Admin, you've probably noticed that one of my big concerns
as system 
administrator here at R&D Publications has been
to seek out new and 
useful ways to smooth out the CPU system load on our
single-CPU Xenix 
installation. 
The overnight and background job spooling utilities
described previously 
allow our users a great degree of direct control over
their use of 
system resources. From time to time, the users must
make decisions 
such as whether to launch a long series of reports in
the background 
or to run them overnight, instead. Most of our users,
however, are 
not technical enough to comfortably use the standard
UNIX/Xenix diagnostic 
utilities to get a handle on the system load. Without
a tool to translate 
the load figures spewed by programs such as uptime into
plain 
English, those users would lack the information on which
to make job 
scheduling decisions. 
To address this problem, and to assist me in gauging
the effects of 
various efficiency-related system policies and tools,
I have developed 
the set of shell scripts described in this article.
The first script, 
load, provides a single number and English-language
analysis 
of the current system load for nontechnical users. The
a script 
generates some useful instantaneous statistics for the
system administrator's 
perusal, including the system load, the total number
of system jobs, 
and the average number of jobs per user. The final script,
sysload.sh, 
is a long-term system load tracking facility with automatic
periodic 
averaging. All information processed by these scripts
is internally 
generated using the standard UNIX utilities ps, who,
and uptime. 
load: Characterizing the Current System Load 
The system command uptime (actually a link to the w
command, equivalent to w -t) displays a line of system
statistics 
containing the elapsed time since system boot-up, the
current number 
of users, and the system CPU load (as the number of
jobs in the run 
queue) averaged out over the last 1, 5, and 15 minutes.
The load 
script (Listing 1) runs uptime and pipes the output
into an 
awk script to extract the first of the three average
load values 
and display a status report based on that value. 
Line 11 extracts the load value based on the number
of tokens detected 
in the uptime output text. The precise format of the
line produced 
by uptime actually varies with the length of time the
system 
has been up. Therefore, the awk script sets the val
variable to the value of the third-to-last token. Then,
lines 13-14 
strip the trailing comma. 
The rest of the script simply displays some text based
on the value 
of val. The text tells a user what impact a CPU-intensive
background 
job is likely to have on system performance at the current
load level. 
The user is then in a better position to weigh the potential
performance 
impact of his/her job against the criticality of that
job, and decide 
whether or not to run the job in the background. 
A sample output of the load script is shown in Figure
1. If your computer 
system's horsepower differs significantly from ours
(a 486-33 ISA 
machine), then you may want to alter the load values
hard-coded into 
the script's comparison lines to better reflect the
load characteristics 
of your particular machine. 
a: Displaying System and User Processes Statistics 
One very powerful window into the system process table
is the ps 
command. I wrote the a shell script (Listing 2) to analyze
data provided by ps and display a summary containing
some basic 
statistics otherwise difficult to glean from the raw
ps output. 
When extracting data about user patterns and trends
from the system 
process table, it is useful to first separate the "signal"
from the "noise." Therefore, a breaks the
list of all 
system processes down into three categories: root processes,
printer 
processes, and user processes. Root processes (getty,
cron, 
other demons, etc.) and printer processes (the master
scheduler and 
intermittent printer request handlers) are not large
contributors 
to the system load, and are therefore segregated from
explicit user 
applications when collecting user process data.  
The a script recognizes one further dichotomy: shell
interpreters 
are distinguished from other kinds of user processes.
Generally, shell 
processes tend to be dormant while their subprocesses
are executing. 
This is certainly not always the case, so I've included
a feature 
to summarize the user process statistics both with and
without shell 
interpreter instances taken into consideration. 
The output from a sample a run is shown in Figure 2.
All analysis 
is performed in lines 18-34. There is some tricky coding
involved, 
so I'll annotate what I've done. 
In line 18, the innermost in-line statement 
 
ps -u root 
 
generates a list of all processes owned by root. 
This list is piped to 
 
wc -l 
 
to produce a single number representing a count of the
number of lines in the ps output. Finally, this number
is reduced 
by 1 (using the expr command) to compensate for the
header 
line produced by ps, and the result is assigned to the
rootpros 
environment variable. The next line repeats the same
procedure to 
count lp processes, and then the sum of the root and
lp process counts is assigned to the otherpros variable. 
In line 22, a total system process count is computed
by running ps 
-e, counting the output lines, and subtracting 3 (one
for the header 
line, and two for the processes spawned by invocation
of the a 
command itself). To get the number of user processes,
I subtract the 
value of otherpros from totpros. The result is assigned
to userpros. 
Lines 25-28 count up the number of user shell interpreters
currently 
active, and assign that value to shpros. Since root
processes have already been counted up in a class of
their own, any 
shell interpreters owned by root are excluded from the
shpros 
count. 
To calculate the total number of non-shell user processes,
the value 
of shpros is subtracted from userpros and the result
is assigned to nonshpos (line 29). 
To calculate the processes-per-user averages, it is
first necessary 
to find out how many "distinct" users are
currently logged 
in to the system, since a single user may be logged
in on multiple 
terminals or have several multiscreen sessions active
on a single 
terminal. Line 30 calculates the number of distinct
users by listing 
the user ID of all processes, sorting by the ID, eliminating
duplicates, 
and counting the number of lines in the output. The
resulting value 
is assigned to nusers. 
The final calculations in lines 31-34 produce the averages
to two 
decimal places, applying a standard multiplication and
modulus kludge 
useful with integer-only math. The integer and fractional
portions 
of the average values are calculated separately. 
sysload.sh: Recording a Periodic System Load History 
The two scripts described above provide instantaneous
process information, 
but contain no provisions for maintaining a history.
The last script 
for this month is a facility for recording long-term
process load 
history information into a set of log files. These files
may be inspected 
periodically in order to seek out cyclical trends or
patterns of light 
and heavy system usage. 
sysload.sh (Listing 3) writes to three log files, given
the 
symbolic names DAYLOG, LOADLOG, and AVGLOG. You 
fill in the actual pathnames for these files in lines
26-28, and the 
pathnames for the debugging versions in lines 30-32. 
The DAYLOG file is used when the call to sysload.sh
has the form: 
 
sysload.sh daily 
 
You decide how often to sample the system load, and
create 
a cron table entry that schedules the above command
accordingly. 
For example, on our system the script runs every fifteen
minutes between 
8 A.M. and 5:45 P.M. Monday through Friday. The cron
table entry appears as follows: 
 
0,15,30,45 8-17 * * 1-5
/usr/local/sysload.sh daily 
 
where /usr/local is where the sysload.sh 
script resides. Figure 3 shows the entire contents of
our system's 
DAILY log file as I write this. Each one-line entry
contains 
the date, the time, and the system load. In Listing
3, these daily 
runs are processed in lines 38-50. 
After all sampling for the day is complete, sysload.sh
must 
be run one more time with the argument final instead
of daily. 
Several things happen at that point: 
1. The entire contents of DAYLOG are appended onto LOADLOG.
LOADLOG thus contains a cumulative record of all daily
load 
samples ever taken. 
2. The average load for the day (as per all entries
in DAYLOG) 
is computed, and a line containing this information
is appended onto 
LOADLOG. The same line is also appended onto AVGLOG. 
3. On Friday of each week, the five most recent daily
averages from 
AVGLOG are themselves averaged, and a line containing
this 
weekly average is appended onto AVGLOG. 
4. The DAYLOG file is deleted, and the next weekday's
daily 
averages are thus written to a new DAYLOG file. 
Our cron table entry for the end-of-day sysload.sh invocation
is: 
 
0 18 * * 1-5
/usr/local/sysload.sh final 
 
The last daily run happens at 5:45, so the final run
is scheduled 
for 6:00 P.M. Figure 4 shows the tail portion of the
contents 
of a representative AVGLOG file. 
Conclusion 
These utilities have provided several benefits to me
as a system administrator. 
With the help of the load program, nontechnical users
are now 
confident enough to diagnose aberrant system slowdowns,
and often 
bring such events to my attention before I'm even aware
of them. 
The a program, in conjunction with SCO's vmstat utility,
gives me a fairly good, quick map of system utilization
at any one 
given moment, and sysload.sh allows me to report long-term
system load statistics to management in order to help
evaluate hardware 
and software requirements for the company. I hope the
tools prove 
useful to you in your administration duties, as well.
 
Errata 
I recently discovered a bug in one of the Onite system
scripts published 
in the Sys Admin Premiere issue. In isonite.sh (Listing
7, page 24), the script that tells whether a particular
job name exists 
in the overnight queue, the line printed as: 
 
[ -r $SPOOLDIR/$1 ] && exit 0 
 
is bogus. The line should be corrected to read: 
 
[ -f $SPOOL
DIR/P$priority/$1 ] && exit 0 
 
 
 About the Author
 
Leor Zolman wrote BDS C, the first C compiler targeted
exclusively 
for personal computers. He is currently a system administrator
and 
software developer for R&D Publications, Inc., and
columnist for both 
The C Users Journal and Windows/DOS Developer's Journal.
Leor's first book, Illustrated C, has just been published
by 
R&D. He may be reached in care of R&D Publications,
Inc., or via net 
E-mail as leor@rdpub.com ("...!uunet!rdpub!leor").
 
 
 
 |