|  Managing 
              SolarisTM with Kstat
 Alexander Golomshtok and Yefim Nodelman
             Rapidly increasing demand for high-performance, mission-critical 
              computer systems and especially the proliferation of Internet-based 
              business applications gave birth to thousands of commercial and 
              public-domain performance management solutions. Solaris, being a 
              very mature operating environment with significant install base, 
              enjoys the attention of many performance management software vendors 
              and independent developers. Available tools range from very simple 
              standalone programs, designed to monitor just a few aspects of the 
              system's behavior, to very complex distributed systems with 
              built-in troubleshooting, trend analysis, and forecasting capabilities.
              Perhaps one of the most well-known commercial performance management 
              solutions for all UNIX flavors is BMC Patrol1. Patrol is a multi-tiered 
              system, capable of not only monitoring various aspects of system 
              performance, but also advanced modeling and impact analysis. Although 
              a good choice for enterprise-wide performance monitoring, Patrol 
              may be an overkill for those who just wish to quickly address particular 
              performance concerns -- its distributed architecture, complexity, 
              and especially the licensing costs may be prohibitive to small organizations 
              looking for a comprehensive low-cost and low-maintenance solution. 
              Other vendors, including Sun Microsystems itself, offer sophisticated 
              performance monitoring and management tools for Solaris, but unfortunately, 
              sophistication comes at the price of complexity and high acquisition 
              and maintenance costs2.
              As a low-cost alternative to commercial performance-monitoring 
              tools, free software enthusiasts have developed quite a few performance-monitoring 
              applications for UNIX and Solaris in particular. Most of these free 
              utilities concentrate on the monitoring aspects of performance management 
              and do not include complex trend analysis and modeling capabilities. 
              One example of such a free performance-monitoring application is 
              a very popular utility by William LeFebvre, called "top"3. 
              Top is a standalone program that continuously lists processes with 
              highest CPU consumption percentage and displays other performance-related 
              information, such as some CPU and virtual memory statistics. This 
              handy tool quickly assesses the overall state of the system and 
              does not require complex setup or maintenance. However, its functionality 
              is very limited and does not provide for customization.
              Proctool by Walter Nielsen and Morgan Herrington is another freely 
              available performance-monitoring utility, which tops the list of 
              some UNIX systems administrators4. Proctool was originally inspired 
              by "top", but over the years, it evolved into a more sophisticated 
              application with real Motif-based GUI and capabilities beyond those 
              of "top".
              Overall, most of the commercial as well as free performance-management 
              systems do not possess enough flexibility to satisfy certain custom 
              troubleshooting needs. As a rule, these tools are either too generic 
              and all-encompassing or too simplistic and narrow in scope. A very 
              common requirement, for instance, is the ability to set up custom 
              alerts, which are triggered if a particular combination of performance 
              measures meets or exceeds predefined thresholds. Some commercial 
              and free tools allow for this kind of monitoring, however, this 
              capability usually comes at the price of high complexity. Apart 
              from making the lives of systems administrators miserable, complexity 
              often implies excessive consumption of system resources by the performance-monitoring 
              tool, which severely cripples the performance data-collection process. 
              Even a relatively simple tool such as "top" has the resident 
              set size of 1.5 MB on our dual CPU Sun ES 450 system and often comes 
              at the top of its own process list, indicating the highest CPU utilization 
              percentage. Other tools, such as BMC Patrol are even worse. For 
              portability sake, these tools do not read the kernel performance 
              statistics directly, but use programs such as iostat, netstat, 
              and vmstat to collect the data, thus incurring the overhead 
              of starting additional processes on a subject system. Another drawback 
              is that most of these tools require root access to the system, which 
              makes them simply unusable by those poor people who are not granted 
              the administrative access to their computers.
              Luckily, Adrian Cockcroft of Sun Microsystems and Richard Pettit 
              of Resolute Software developed a revolutionary approach to solving 
              performance-management problems. Instead of building a tool for 
              performance monitoring, they came up with a toolkit, called SE (Symbolic 
              Engine), consisting of a programming language interpreter, a few 
              handy libraries, and a bunch of example scripts, mimicking the functionality 
              of traditional vmstat, iostat, netstat, and 
              other UNIX utilities5. The fundamental element of the SE Toolkit 
              (www.setoolkit.com) is the SymbEL programming language, which 
              provides a foundation for building custom performance management 
              tools and utilities. The SE Toolkit is extremely versatile, efficient, 
              and easy to use. For the most part, it does not require root privileges; 
              the data collection algorithms, employed by the SymbEL interpreter, 
              access the kernel statistical data directly, which allows for building 
              very accurate performance monitors. The most attractive feature 
              of the Toolkit is its unlimited flexibility -- any custom tool 
              can be developed using SymbEL and the libraries that come with it. 
              However, in order to take the full advantage of the SE flexibility 
              one would have to master the SymbEL programming language.
              Solaris Performance Metrics Interfaces
              As I mentioned previously, reading the kernel statistics in order 
              to collect the performance data is an approach far superior to using 
              existing Solaris programs such as netstat, iostat, 
              sar, and others. Direct access to the kernel data eliminates 
              the need to start additional processes for data collection purposes, 
              thereby greatly reducing the overall resource consumption.
              Solaris 2 exposes numerous interfaces for collecting various performance 
              and status data. One of the oldest interfaces, available since Sun 
              OS 4.x, is kvm, which stands for "kernel virtual memory". 
              As the name implies, this interface provides a way of accessing 
              information within the address space of an operating system. Besides 
              reading the kernel virtual memory on a running system, kvm 
              can be used to analyze a dump of a running kernel, which may be 
              a result of a system crash.
              Kernel virtual memory can be accessed by simply reading the /dev/kmem 
              file, which is a character-special file that provides user-level 
              access to kernel memory image. However, libkvm library provides 
              more robust and high-level interface by encapsulating the direct 
              access to /dev/kmem and /dev/ksyms and simplifying 
              the process of reading the kernel data6. Many Solaris performance 
              utilities, such as netstat, utilized and continue to utilize 
              the kvm interface despite its drawbacks, which include a 
              need for root access (netstat is setuid) and lack of thread 
              safety.
              The Kstat (which stands for "kernel statistics") library7 
              is a newer interface for performance metrics collection that eliminates 
              some of the disadvantages of the kvm. Kstat functions access 
              the data stored within the user-level data structures. These are 
              essentially copies of similar structures within the OS kernel, so 
              that the kernel memory image no longer needs to be scanned directly. 
              Apart from the obvious higher portability, this approach also allows 
              for non-root access to a copy of a kernel data and solves thread-safety 
              problems. Because the Kstat data is stored in the "user space", 
              a user-level application may lock the Kstat structure, thus ensuring 
              that none of the data changes while it is being accessed. The kvm, 
              on the other hand, as a user-level library, has no mechanism for 
              preventing the kernel data structures from being modified by kernel 
              threads while the performance data collection operation is in progress.
              The performance metrics accessible via the Kstat interface are 
              stored in a linked list of structures, often referred to as "Kstat 
              chain". There are actually two chains -- one stored in 
              the user space (user chain) and another stored in the kernel space 
              (kernel chain). Whenever an application process issues a data collection 
              request through the Kstat library, the library dispatches the ioctl 
              request to a special loadable driver, designed to act as a middleman 
              between the kernel and the user space. The driver then locks a corresponding 
              portion of the kernel Kstat chain and transfers the kernel copy 
              of the data into the user space. This mechanism prevents kernel 
              threads from modifying the Kstat chain while the collection operation 
              is in progress, thus ensuring consistency of the data being read 
              by the user-level process.
              Each node of the Kstat chain (often called simply "Kstat") 
              contains metrics that reflect the operations of a single functional 
              component, such as a disk device or network interface. Each Kstat 
              is generally identified by a unique "path" that consists 
              of three distinct elements:
              
              
             
               Module -- Uniquely identifies the functional area or subsystem. 
                The module name for disk devices, for instance, may be "sd" 
                or "ssd"; for network interfaces it is "tr" 
                (token ring), "le" (lance ethernet), etc. 
               Instance number -- Uniquely identifies the instance within 
                the module (i.e., disk instance number). 
               Name -- Uniquely identifies the functional component. For 
                disk devices and network interfaces, the name is usually a combination 
                of the module and the instance number, such as "sd1" 
                or "hme0".
              Each Kstat data structure contains a common header and a variable 
              data portion. The header houses the Kstat identification information, 
              such as its module, instance, name, and type along with the pointers 
              to the data portion and the next Kstat within the chain. The data 
              portion has variable structure and may be one of the following:
              
              
             
               Raw -- A chunk of memory that can be cast to an appropriate 
                C structure. An application, dealing with raw Kstats should have 
                prior knowledge of what C structure the data portion of the Kstat 
                should be cast to. 
               Named -- An array of name-value pairs. 
               Interrupt -- A C structure containing the information about 
                interrupts. 
               IO -- A specific C structure containing the information 
                about disk devices. 
               Timer -- An array of name-value pairs similar to the named 
                type.
              The Kstat library (libkstat) provides numerous functions 
              for opening and closing Kstat chains, traversing the Kstat nodes 
              and reading the performance data. The following little program, 
              designed to list all available Kstats on a given system, provides 
              an introductory example of the library usage:
              
             
 1  #include <stdio.h>
 2  #include <kstat.h>
 3   
 4  int
 5  main( int argc, char **argv ) {
 6     kstat_ctl_t *pc;
 7     kstat_t     *pk;
 8     if ( !( pc = kstat_open() ) ) {
 9        perror( "failed to open kstat" );
10        return -1;
11     }
12     printf( "%-10s%-5s%-16s\n", "module", "inst", "name" );
13     for( pk = pc->kc_chain; pk; pk = pk->ks_next )
14        printf( "%-10s%-5d%-16s\n",
15           pk->ks_module, pk->ks_instance, pk->ks_name );
16
17     kstat_close( pc );
18     return 0;
19 }
The program opens the Kstat chain (line 8) by calling the kstat_open 
            function and iterates through the linked list of Kstat nodes (lines 
            13 through 15) printing the module, instance, and name fields for 
            every Kstat. The kstat_open function returns a pointer to the 
            Kstat control structure, which, among other things, contains the pointer 
            to the head of the Kstat-linked list (kc_chain). As I mentioned 
            previously, each Kstat contains the pointer to the next element of 
            the list (ks_next), which makes it easy to traverse the chain 
            from the beginning to the end. The program can be compiled with the 
            following command (assuming that the source code of the example above 
            is saved as lkstat.c): 
             
cc  -o lkstat    lkstat.c    -lkstat
Running the resulting lkstat binary on our Sun ES 450 system 
            produces the following output:  
             
module    inst name            
unix      0    kstat_headers   
unix      0    kstat_types     
unix      0    sysinfo         
unix      0    vminfo          
unix      0    vmhatstat       
...
ufs       0    inode_cache     
sd        21   sd21            
sd        3    sd3             
cpu_stat  1    cpu_stat1       
cpu_info  1    cpu_info1       
cpu_info  3    cpu_info3       
cpu_stat  3    cpu_stat3       
...
The Kstat library is widely used by Solaris 2 performance monitoring 
            utilities -- most of the functionality exposed by the SE Toolkit, 
            for instance, is based upon the capabilities of the Kstat library. 
            Even the simple programs, such as well-known uptime, rely on 
            Kstat to obtain the statistical information. Listing 1 shows a sample 
            implementation for the uptime utility, designed to further 
            demonstrate the versatility of the Kstat library.  This program accesses a single Kstat "unix.0.system_misc", 
              which contains some system usage information, such as a number of 
              clock interrupts since the boot time. At first, the program obtains 
              some system configuration information (number of clock interrupts 
              or ticks per second) using sysconf (3C) library function 
              at line 23. Then we open the Kstat chain using the kstat_open 
              function and get the handle of the desired Kstat. This time, instead 
              of iterating through the linked list of Kstats, we use the kstat_lookup 
              function, which takes the module, instance, and name elements of 
              the Kstat path and returns the pointer to the header of the "unix.0.system_misc" 
              Kstat structure (line 28). At line 32, we issue the kstat_read 
              call that signals the Kstat driver to read the kernel data into 
              the user chain. The "unix.0.system_misc" Kstat is of named 
              type, which can be easily checked by examining the ks_type 
              field of the kstat_t structure. It is set to the value of 
              1 (which indicates the named type), so we use the kstat_data_lookup 
              function to look up the values of the variables that we're 
              interested in.
              At lines 36 through 43, we call the kstat_data_lookup function 
              four times to obtain the values of the "clk_intr" variable 
              (which contains the number of click ticks since the boot), and the 
              values of "avenrun_1min", "avenrun_5min", and 
              "avenrun_15min" variables. These represent the average 
              number of processes on the run queue within the last one, five, 
              and fifteen minutes, respectively. The "avenrun" variables 
              are used to calculate the system load average based on the formula 
              borrowed from the source code of the "top" utility3. I 
              believe that Solaris 2 native uptime program utilizes the same formula, 
              which simply converts the unsigned long number into a double and 
              divides it by a scaling factor FSCALE, taken from /usr/include/sys/param.h. 
              We then use the value of the "clk_intr" variable to calculate 
              the number of days, hours, minutes, and seconds since the last boot 
              (lines 45 through 48). Finally, we print out the uptime information 
              along with the load average figures and close the Kstat chain. When 
              compiled and run on an ES 450, this program produces the following 
              output:
              
             
Up: 204 day(s) 20 hour(s) 36 minute(s) 4 second(s), load average: 0.02, 0.01, 0.01
The output is quite similar to the output of the standard uptime 
            command, however, to save space we excluded the code for calculating 
            the number of users on the system. The number of users can easily 
            be determined by reading the user and accounting information via the 
            utmpx (4) interface.  Kstat and Perl
              Although the Kstat programming model is fairly simple, it still 
              requires extensive C programming skills, which may scare away even 
              the most experienced systems administrators. One major flaw of the 
              Kstat interface is the necessity to program around five different 
              types of Kstats. This makes the process of reading the performance 
              metrics inconsistent and may lead to obscure errors. This may not 
              be an issue for small and simple programs, such as our uptime 
              utility. However, developing an equivalent of vmstat, for 
              instance, would require access to a few different Kstat structures 
              of different types, which can easily lead to complex, convoluted, 
              and impossible to debug code. The SymbEL programming language of 
              SE Toolkit [5] takes a much more consistent approach by allowing 
              the developer to read values of any Kstat variables in a uniform 
              manner regardless of the Kstat type. Unfortunately, this consistency 
              comes at a price -- one would have to learn SymbEL.
              As usual, CPAN8 offers a Perl extension module, which enables 
              anybody with basic Perl programming skills to take full advantage 
              of Kstat interface. This module, called Solaris::Kstat 9, 
              provides uniform access to Kstat data via tied hash interface, so 
              that any Kstat variable can be read using its module, instance, 
              and name simply as hash keys.
              To demonstrate the advantages of the Perl-based approach and provide 
              grounds for comparison, the uptime program was converted 
              into a Perl script (Listing 2).
              The first very noticeable difference is the fact that the Perl 
              script is almost twice as small as the corresponding C version. 
              Also, no function calls are required to navigate through the Kstat 
              chain, instead we simply read the values of already familiar "clk_intr" 
              and "avenrun" variables from a hash, using the module, 
              instance, name, and variable names as hash keys (lines 9 through 
              15). The rest of the program remains the same, and it produces exactly 
              the same output as the C version of uptime. Clearly, the 
              Perl-based approach would appeal to administrators and developers 
              in search of custom performance-monitoring utilities.
              As mentioned previously, the main advantage of using the Kstat 
              interface is the ability to quickly develop custom performance-monitoring 
              scripts that check one or two very specific aspects of the system's 
              behavior and can be tailored to the needs a particular environment. 
              While configuring database and file servers, for instance, one would 
              have to make sure that the load is spread evenly across all available 
              disk devices, hence the need to monitor for slow or overloaded disks. 
              Inspired by one of the example scripts that come with SE Toolkit10, 
              we created another program that detects disks with response times 
              and utilization percents that exceed the threshold (Listing 3).
              This program, called "slowdisk", takes three command-line 
              parameters: -i, which specifies the sleep interval between 
              taking the snapshots of the performance metrics; -s, which 
              sets the threshold for the service time; and -b, which specifies 
              the threshold for the utilization percentage. Besides using the 
              Solaris::Kstat module, this program loads the Solaris::MapDev 
              extension, which is also a part of the Solaris bundle by Alan Burlison 
              [9]. Solaris::MapDev is designed to provide the mapping between 
              the instance names, used by the Kstat interface (i.e., "sd1") 
              and conventional device names (i.e., "c0t0d0"). We use 
              the get_inst_names function of the Solaris::MapDev 
              to obtain the instance names for all disk devices on the system 
              (line 14). Since the function returns not only disk, but also tape, 
              floppy, CD-ROM, and other instance names, we apply a grep 
              filter to select only those that start from "sd" or "ssd" 
              (i.e., internal or storage array disks).
              The performance metrics, exposed via the Kstat interface, are 
              usually either running totals or instantaneous values. Thus, to 
              assess the performance characteristics of a particular disk, we 
              will have to take periodic snapshots of these values and then calculate 
              the averages over a time interval. For these reasons, we save the 
              initial values of performance metrics for each disk device using 
              the foreach loop at line 16 through 21. This loop invokes 
              the Solaris::Kstat update function, which calls the kstat_chain_update 
              function of libkstat. This is necessary to synchronize the 
              user and kernel chains, because once in a while, the kernel would 
              modify its linked list of Kstats by adding new or removing old nodes. 
              Once the chain is synchronized, we parse the disk instance name 
              to obtain the module name ("sd" or "ssd") and 
              the instance number needed to read the data from the Kstat hash 
              (line 18). We then save the snapshot of Kstat values for a given 
              disk in a hash, pointed by the $prev hash reference using 
              the disk instance name as a key (line 19). Finally, we record the 
              time of a snapshot by reading the value of the "clk_intr" 
              variable.
              Having saved the initial state of our performance metrics, we 
              enter the main loop at line 23. At first the execution of the program 
              is suspended using the sleep function (line 24) with the 
              interval parameter, controlled by an optional command-line argument 
              -i. In case the command-line argument is not supplied, the 
              program sleeps for 5 seconds. When it wakes up, the new snapshot 
              of the Kstat performance metrics is taken and recorded -- this 
              time into a different hash, pointed to by $curr hash reference 
              (lines 25 through 31). Now that we have both current and previous 
              snapshots of the performance metrics, we can compare them and calculate 
              the average figures. The foreach loop at line 32 once again 
              iterates over each disk device, first calculating the elapsed time 
              between snapshots (line 33). At lines 34 through 37, we calculate 
              the average number of completed reads per second ($rps) and writes 
              per second ($wps) using the values of "reads" and "writes" 
              Kstat variables and elapsed time between snapshots, calculated in 
              the previous step. We then figure out the high-resolution time interval 
              ($hr_time) by calculating the difference between the current and 
              the previous values of the "wlastupdate" Kstat variable 
              that contains the time of the last update to the wait queue (lines 
              39 through 41).
              In case the current and the previous values of "wlastupdate" 
              are the same, we use the default high-resolution time interval of 
              1ns (line 41). At line 44 through 47, we compute the average busy 
              wait ($avw) and busy run ($avr) queue length using the values of 
              "wlentime" and "rlentime" variables, which contain 
              the sum of the queue lengths multiplied by time at that length, 
              for wait and run queues respectively. Finally, we compute the average 
              wait ($avwait) and average service ($avserv) times (lines 49 through 
              49) using previously calculated values of average busy wait and 
              busy run queue length and the total number of completed reads/writes 
              per second, calculated at line 42 as a sum of reads per second ($rps) 
              and writes per second ($wps). Then we can calculate the total response 
              or residency time ($svc_t at line 55) as a sum of average wait and 
              average service times, as well as average run percent ($r_pct at 
              line 53). This is the difference between the current and previous 
              times spent running ("rtime") divided by the high-resolution 
              time and is expressed as a percentage.
              Finally, we compare the newly calculated values against the thresholds, 
              specified by the command-line arguments (line 59), print out the 
              timestamp, disk name, and calculated values in case these thresholds 
              are met or exceeded (line 57 through 58), then save the current 
              snapshot values in the $prev hash for subsequent iterations 
              (line 60). Note that we must convert the disk instance names into 
              conventional device names, which is done by the inst_to_dev 
              function of the Solaris::MapDev module at line 58. We also 
              apply some default threshold values in case the command-line arguments 
              are omitted -- 50 ms for total residency time and 20% for the 
              average run percentage (line 59). These threshold values are exactly 
              the same as those used by the virtual_adrian_lite.se script 
              [10] of SE Toolkit to detect slow disks11 12.
              When run on the ES 450 while heavy compress jobs are hitting 
              one of the file systems, the script yields the following results:
              
             
slow disk detected: 12:47:41     /dev/dsk/c0t0d0     51.11    73.16
slow disk detected: 12:50:37     /dev/dsk/c0t0d0     51.03    71.33
slow disk detected: 12:52:58     /dev/dsk/c0t0d0     51.32    72.10
slow disk detected: 12:54:08     /dev/dsk/c0t0d0     60.43    73.16
slow disk detected: 12:55:54     /dev/dsk/c0t0d0     60.82    68.77
Conclusion  Solaris::Kstat is one of the handiest and the most exciting 
              Perl modules that CPAN has to offer. It is not, however, without 
              a flaw -- its programming model, although fairly portable and 
              easy to use, is relatively low level. To produce, for example, an 
              equivalent of the vmstat utility, one would have to possess 
              a deep knowledge of the system's internals as virtual memory 
              metrics are spread across multiple unrelated Kstat nodes. Also, 
              as noted, Solaris::Kstat programming requires the developer 
              to perform all computations to obtain the average figures used for 
              performance monitoring, which can get very involved. The SE Toolkit, 
              on the other hand, significantly simplifies programming by providing 
              high-level wrappers, or classes, such as vmstat, iostat, 
              netstat, and others, to encapsulate all the calculations 
              needed to obtain complete virtual memory, IO, or network metrics. 
              It is fairly trivial to create similar wrappers for Solaris::Kstat 
              module. However, that is left as an exercise to the reader.
              Another problem is lack of reliable documentation for Kstats. 
              Solaris man pages are very sketchy and incomplete, and Adrian Cockcroft's 
              Sun Performance and Tuning [2] still remains the most comprehensive 
              source of information. To gain an intimate familiarity with the 
              subject, I strongly encourage readers to examine the source code 
              of the example scripts and classes of SE Toolkit. The SymbEL programming 
              language of SE Toolkit is quite similar to C, so most of the example 
              scripts can easily be understood and deciphered.
              References
              1. BMC Software. PATROL for Performance Management and Prediction. 
              http://www.bmc.com/products/esm/perfpred.html.
              2. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 
              2nd edition. Sun Microsystems Press, 1998. pp 26-37.
              3. William LeFebvre. UNIX Top. http://www.groupsys.com/top.
              4. Walter Nielsen, Morgan Herrington. Proctool. ftp://opcom.sun.ca/pub/binaries/proctool.
              5. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 
              2nd edition. Sun Microsystems Press, 1998. pp 449-556.
              6. Adrian Cockcroft, Richard Pettit. Sun Performance and Tuning, 
              2nd edition. Sun Microsystems Press, 1998. pp 373-386.
              7. Kstat (3K) manpage. Sun Solaris 2.
              8. CPAN -- Comprehensive Perl Archive. www.perl.com/CPAN-local.
              9. Alan Burlison. CPAN Directory ABURLISON. Latest release: Solaris-0.05a.tar.gz 
              10/2/1999.
              10. Adrian Cockcroft, Richard Pettit. virtual_adrian_lite.se. 
              RICHPse/examples.
              11. Adrian Cockcroft. System Performance Monitoring. Sun World 
              Online, 09/05/1995.
              12. Adrian Cockcroft. Clarifying Disk measurements and terminology. 
              UNIX Insider, 09/01/1997.
              Alexander Golomshtok is a professional consultant who, for 
              the last decade, has been hanging around downtown New York developing 
              large-scale software systems and infrastructure solutions for Wall 
              Street firms. He can be reached at: golomshtok_alexander@jpmorgan.com
            .
              Yefim Nodelman is a seasoned systems administrator with more 
              than seven years of professional experience in supporting large 
              UNIX and Windows installations.
           |