|  The 
              AIX Error Logging Facility
 Sandor W. Sklar
              The primary goal of every UNIX systems administrator is to ensure 
              that the systems that they are responsible for are functioning smoothly 
              and with the best performance possible, 100% of the time. File systems 
              running out of space, applications dumping core, and Ethernet adapter 
              failures are just a sample of the types of things that can trip 
              up a system, impacting that goal. Therefore, it is critical that 
              the people responsible for a system are aware of anything that might 
              have an impact on attaining that 100% system availability. 
              One of the things that makes AIX my favorite flavor of UNIX is 
              that, besides all the standard tools, daemons, and configuration 
              files that are present in all flavors of UNIX, IBM has provided 
              a number of enhancements that make the monitoring, reliability, 
              and administration of RS/6000 systems second to none. This article 
              will focus on one of those tools: the error logging facility. I'll 
              show you how the AIX error logging facility works, then I'll 
              present a program I wrote that checks the log for error messages, 
              filters out any error messages you wish to ignore, and sends an 
              email to the systems administrator.
              The Error Logging Subsystem
              On most UNIX systems, information and errors from system events 
              and processes are managed by the syslog daemon (syslogd); 
              depending on settings in the configuration file /etc/syslog.conf, 
              messages are passed from the operating system, daemons, and applications 
              to the console, to log files, or to nowhere at all. AIX includes 
              the syslog daemon, and it is used in the same way that other 
              UNIX-based operating systems use it. In addition to syslog, 
              though, AIX also contains another facility for the management of 
              hardware, operating system, and application messages and errors. 
              This facility, while simple in its operation, provides unique and 
              valuable insight into the health and happiness of an RS/6000 system.
              The AIX error logging facility components are part of the bos.rte 
              and the bos.sysmgt.serv_aid packages, both of which are automatically 
              placed on the system as part of the base operating system installation. 
              Some of these components are shown in Table 1.
              Unlike the syslog daemon, which performs no logging at 
              all in its default configuration as shipped, the error logging facility 
              requires no configuration before it can provide useful information 
              about the system. The errdemon is started during system initialization 
              and continuously monitors the special file /dev/error for 
              new entries sent by either the kernel or by applications. The label 
              of each new entry is checked against the contents of the Error Record 
              Template Repository, and if a match is found, additional information 
              about the system environment or hardware status is added, before 
              the entry is posted to the error log.
              The actual file in which error entries are stored is configurable; 
              the default is /var/adm/ras/errlog. That file is in a binary 
              format and so should never be truncated or zeroed out manually. 
              The errlog file is a circular log, storing as many entries 
              as can fit within its defined size. A memory buffer is set by the 
              errdemon process, and newly arrived entries are put into 
              the buffer before they are written to the log to minimize the possibility 
              of a lost entry. The name and size of the error log file and the 
              size of the memory buffer may be viewed with the errdemon 
              command:
              
             
[aixhost:root:/] # /usr/lib/errdemon -l
Error Log Attributes
--------------------------------------------
Log File                /var/adm/ras/errlog
Log Size                1048576 bytes
Memory Buffer Size      8192 bytes
The parameters displayed may be changed by running the errdemon 
            command with other flags, documented in the errdemon man page. 
            The default sizes and values have always been sufficient on our systems, 
            so I've never had reason to change them.  Due to use of a circular log file, it is not necessary (or even 
              possible) to rotate the error log. Without intervention, errors 
              will remain in the log indefinitely, or until the log fills up with 
              new entries. As shipped, however, the crontab for the root 
              user contains two entries that are executed daily, removing hardware 
              errors that are older than 90 days, and all other errors that are 
              older than 30 days.
              
             
0 11  *  *  * /usr/bin/errclear -d S,O 30
0 12  *  *  * /usr/bin/errclear -d H 90
These entries are commented out on my systems, as I prefer that older 
            errors are removed "naturally", when they are replaced by 
            newer entries.  Viewing Errors
              Although a record of system errors is a good thing (as most sys 
              admins would agree), logs are useless without a way to read them. 
              Because the error log is stored in binary format, it can't 
              be viewed as logs from syslog and other applications are. 
              Fortunately, AIX provides the errpt command for reading the 
              log.
              The errpt command supports a number of optional flags and 
              arguments, each designed to narrow the output to the desired amount. 
              The man page for the errpt command provides detailed usage; 
              Table 2 provides a short summary of the most useful arguments. (Note 
              that all date/time specifications used with the errpt command 
              are in the format of mmddHHMMyy, meaning "month", "day", 
              "hour", "minute", "year"; seconds 
              are not recorded in the error log, and are not specified with any 
              command.)
              Each entry in the AIX error log can be classified in a number 
              of ways; the actual values are determined by the entry in the Error 
              Record Template Repository that corresponds with the entry label 
              as passed to the errdemon from the operating system or an 
              application process. This classification system provides a more 
              fine-grained method of prioritizing the severity of entries than 
              does the syslog method of using a facility and priority code. 
              Output from the errpt command may be confined to the types 
              of entries desired by using a combination of the flags in Table 
              2. Some examples are shown in Table 3.
              Dissecting an Error Log Entry
              Entries in the error log are formatted in a standard layout, defined 
              by their corresponding template. While different types of errors 
              will provide different information, all error log entries follow 
              a basic format. The one-line summary report (generated by the errpt 
              command without using the "-a" flag) contains the 
              fields shown in Table 4:
              Here are several examples of error log entry summaries:
              
             
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
D1A1AE6F   0223070601 I H rmt3           TAPE SIM/MIM RECORD
5DFED6F1   0220054301 I O SYSPFS         UNABLE TO ALLOCATE SPACE
                                         IN FILE SYSTEM
1581762B   0219162801 T H hdisk98        DISK OPERATION ERROR
And here is the full entry of the second error summary above:  
             
LABEL:            JFS_FS_FRAGMENTED
IDENTIFIER:    5DFED6F1
Date/Time:       Tue Feb 20 05:43:35
Sequence Number: 146643
Machine Id:      00018294A400
Node Id:         rescue
Class:           O
Type:            INFO
Resource Name:   SYSPFS
Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
Probable Causes
FILE SYSTEM FREE SPACE FRAGMENTED
      Recommended Actions
      CONSOLIDATE FREE SPACE USING DEFRAGFSMonitoring with errreporterUTILITY
Detail Data
MAJOR/MINOR DEVICE NUMBER
000A 0006
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd9var, /var
  Most, if not all systems administrators have had to deal with 
              an "overload" of information. Multiple log files and process 
              outputs must be monitored constantly for signs of trouble or required 
              intervention. This problem is compounded when the administrator 
              is responsible for a number of systems. Various solutions exist, 
              including those built into the logging application (i.e., the use 
              of a loghost for syslog messages), and free third-party solutions 
              to monitor log files and send alerts when something interesting 
              appears. One such tool that we rely on is "swatch", developed 
              and maintained by Todd Atkins. Swatch excels at monitoring log files 
              for lines that match specific regular expressions, and taking action 
              for each matched entry, such as sending an email or running a command.
              For all of the power of swatch, though, I was unable to set up 
              the configuration to perform a specific task: monitoring entries 
              in the AIX error log, ignoring certain specified identifiers, and 
              emailing the full version of the entry to a specified address, with 
              an informative subject line. So, I wrote my own simple program that 
              performs the task I desired. errreporter (Listing 1) is a 
              Perl script runs the errpt command in concurrent mode, checks 
              new entries against a list of identifiers to be ignored, crafts 
              a subject line based upon several fields in the entry, and emails 
              the entire entry to a specified address.
              errreporter can be run from the command line, though I 
              have chosen to have it run automatically at system startup, with 
              the following entry in /etc/inittab (all on a single line, 
              but broken here, for convenience):
              
             
errrptr:2:respawn:/usr/sec/bin/errreporter \
     -f /usr/sec/etc/errreporter.conf /dev/console 2&1
Of course, if you choose to use this script, be sure to set the proper 
            locations in your inittab entry. The system must have Perl 
            installed; Perl is included with AIX as of version 4.3.3, and is available 
            in source and compiled forms from numerous Web sites. It relies only 
            on modules that are included with the base Perl distribution (see 
            Listing 2 for errreporter.conf file).  Although this script perfectly suits my current needs, there are 
              many areas in which it could be expanded upon or improved. For instance, 
              it may be useful to have entries mailed to different addresses, 
              based upon the entry's identifier. Another useful feature would 
              be to incorporate "loghost"-like functionality, so that 
              a program running on a single server can receive error log entries 
              sent by other systems, communicating via sockets à la the 
              syslog "@loghost" method.
              Summary
              The AIX Error Logging Facility can provide insight into the workings 
              of your system that are not available on other UNIX platforms. I 
              find it to be just one of the many advantages of AIX in a production 
              environment, and I hope that I have helped to explain this simple 
              yet powerful tool.
              In this article, I have touched on some of the more commonly used 
              aspects of the Error Logging Facility in AIX. There are numerous 
              other features and capabilities of this subsystem, including the 
              use of the "diag" command for error log analysis 
              and problem determination, the addition of custom error templates, 
              the redirection of error log entries to and from the syslog 
              daemon, and the use of error notification routines in user-developed 
              code to provide notice and error logging to this subsystem. For 
              more information on those topics, and more detail on the items discussed 
              above, please see the documents listed in the References section 
              below.
              References
              The first source to go to for information on the usage of the 
              commands and programs that are part of the Error Logging Facility 
              is the man pages for the errpt, errdemon, errclear, 
              errinstall, errupdate, errlogger, and errdemon 
              commands, and for the errlog subroutine.
              The complete listing of the entries in the Error Template Repository 
              can be generated with the "errpt -t" command, for 
              the one-line summary, or the "errpt -at" command, 
              for the full text of each error template.
              A more in-depth overview of the Error Logging Facility can be 
              found in Chapter 10 of the AIX 4.3 Problem Solving Guide and 
              Reference, available online at:
              
             
http://www.rs6000.ibm.com/doc_link/en_US/a_doc_lib/aixprob/prbslvgd/errlogfac.htm
  Sandor Sklar is a UNIX systems administrator at Standford University. 
            He can be reached at: ssklar@stanford.edu. |