| USENET ELM: A Case Study in Portability between UNIX Systems
 
Sydney S. Weinstein 
The diversity of UNIX systems requires "Universal
UNIX Applications" 
to be as portable as possible. The attempt to keep one
such application 
-- USENET Elm -- portable as both UNIX and C have evolved
has 
required constant effort and provides a useful case
study of UNIX 
portability issues. 
Dave Taylor wrote Elm in the mid-1980s while he was
working at Hewlett-Packard, 
then in 1987 released it, with HP's blessings, to the
USENET community. 
Like much freely distributable UNIX software, Elm is
released as source 
code compiled by the user or system administrator. Thus
portability 
of the system at the source code level is mandatory. 
Elm, in the UNIX vernacular, is a Mail User Agent (MUA).
It displays 
the contents of a mailbox or folder (sequential text
file containing 
mail messages), allows display of individual mail messages
from the 
mailbox, accepts replies to those messages, and allows
for generation 
of new messages for the Mail Transport Agent (MTA) to
deliver. Elm 
does not deliver the messages; instead, it passes them
to the MTA, 
which handles the routing and delivery. 
Early UNIX MUAs were line-oriented, as the standard
terminal in use 
was a hard-copy printing terminal. With the switch to
CRT-based terminals, 
UNIX applications moved from line- to a screen-orientation.
As one 
of the early screen-oriented MUAs,  Elm incorporated
the best features 
of the line-oriented MUAs available in the mid-1980s
and extended 
the concept to a full-screen, menu-driven system. Designed
to be simple 
to use and "intuitive," yet not so restrictive
as to frustrate 
sophisticated users, Elm is currently used by approximately
250,000 
individual users, on over 20,000 systems. 
Original Elm Environment 
Elm was initially developed with HP-UX, a port of the
AT&T System 
V.2 version of UNIX. These systems used a K&R-style
C compiler (ANSI 
C was not yet a glint in someone's eye). Elm was coded
in the "loose" 
style common to software not intended to be ported between
very diverse 
systems. 
AT&T System V.2 Dependencies 
Hewlett-Packard based HP-UX on the Motorola MC680x0
family of processors. 
Processors in this family share certain characteristics: 
 
32-bit word length
32-bit integer length (int type)
32-bit argument passing (all arguments less than 32 bits long are
converted to 32-bit values when placed on the stack as arguments to functions)
32-bit pointer length
Large linear addressing space with no segmentation 
The common length for the pointer data type, argument
passing, and the int data type allowed for some very
loose 
programming practices, the most common being to intermix
the int 
and pointer data types freely, on the assumption that
an int 
can always hold a pointer value. The common length also
means that 
an integer/character argument passed to a function could
always be 
considered as an int. Casting arguments to convert the
types 
explicitly was not necessary. 
The large linear addressing space allowed large buffers
to be placed 
on the stack and used to hold data values without concern
for overflow. 
If overflow appeared likely, the size of the buffer
could be increased 
-- there was plenty of room. 
Because AT&T UNIX System V.2 limited filenames to
fourteen characters, 
the individual elements of a full path name (the filenames)
were short 
and the space reserved to hold path names was also very
small. In 
addition, Elm used the C library provided with this
UNIX since, at 
the time, no other version of UNIX had a different C
library. 
HP Function Keys 
The original Elm was developed on and hard-coded to
support Hewlett-Packard 
terminals. These terminals used their own keyboard layout
with their 
own set of function keys. They also allowed for labeling
the function 
keys on the screen directly above the keys themselves.
Since the HP 
method is not an industry standard, the decision to
hard-code support 
for terminals rather than use the termcap function key
fields 
has created even greater portability problems. 
Dave's Own Curses 
A common library package called curses generally performs
screen 
updating in UNIX programs. Dave Taylor, Elm's creator,
implemented 
his own, simpler, version of the curses package. He
handled 
only the low-level terminal control routines, such as
cursor move, 
up-line, down-line, and clear screen and left all the
actual screen 
intelligence to his display routines. Its limited interaction
with 
the curses package makes Elm very portable to other
systems. At the 
same time, however, the code's low-level nature makes
it very difficult 
to modify the screen code or add features. Instead of
hiding the screen 
intelligence in the curses routines, Dave distributed
it throughout 
many modules. 
Dave's curses package did make use of UNIX's underlying
terminal capability 
database. He used the calls from the older termcap system
instead 
of the newer System V.2 terminfo system. The termcap/terminfo
database tells applications programs how to perform
a common set of 
functions on many different types of terminals. It allows
UNIX tasks 
to be portable between terminal types. 
In general, if you are writing a "universal UNIX
application," 
you can best achieve portability by using the system
configuration 
libraries, such as termcap/terminfo. Use of these facilities
makes your program immediately portable to all systems
and equipment 
to which anyone has ported those facilities. In the
case of termcap/terminfo, 
your screen-oriented program can immediately function
on whatever 
types of CRT terminal are in use. 
Porting to BSD-Type Systems 
Elm's first major port was from the HP-UX version of
AT&T UNIX System 
V.2 to the other major variant of UNIX, the Berkeley
Software Distribution 
(BSD). This was (and still is) a logical first major
port, especially 
since one of the major UNIX minicomputers in the mid-1980s
was the 
DEC VAX. The University of California at Berkeley had
ported an earlier 
version of UNIX to the VAX and added support for page
demand virtual 
memory and extended networking. This version became
known as BSD UNIX. 
The DEC VAX is very similar to the MC680x0 family. Both
share the 
32-bit features and large linear addressing space listed
earlier, 
but the DEC VAX orders its bytes in the reverse order
of the MC680x0. 
Since each processor is internally consistent, this
difference becomes 
significant only if a memory region is addressed as
two different 
data types. In the case, say, of a memory area addressed
both as a 
text string and as an integer value, the integer value
0x41424344 
(1,094,861,636) would be ABCD on the MC680x0 family
and 
DCBA on the VAX family. 
For purposes of portability, it is necessary to make
sure no data 
structure refers to the same area of memory with two
different fundamental 
types. All strings must be passed as string pointers.
The short cut 
of placing a couple of characters into an int and passing
the 
int will no longer work: the characters would come out
backwards 
on the VAX family. In addition, code must examine union
data 
structures to see which fundamental type is being used.
Further, if 
the union is used to overlay two fundamental data types,
the 
code must take into account the byte ordering of the
system on which 
it is running. 
Failure to implement these subtle coding changes will
not cause compiler 
errors or link problems; instead, the result will be
strange behavior 
at execution time. The program could crash with an invalid
pointer, 
for example, or it could get a cursor movement string
out of sequence 
and scramble the display. These types of problems are
very difficult 
to track down. 
BSD 4.2/4.3 vs. AT&T System V.2 
For the application programmer, the major differences
between the 
BSD UNIX family and the AT&T UNIX family reside
in the #include 
files and C runtime libraries. Each team developed its
own runtime 
library, with the result that similar routines have
different names. 
Also, identical data structures ended up in different
#include 
files. The differences show up most notably in the string
and memory 
manipulation functions (see Table 1). In particular
memory block arguments 
to the memcpy/memcmp routines are backwards from the
same arguments to the bcopy/bcmp routines. 
Not only are the string routines defined differently,
but the header 
files that declare them have subtly different names.
The AT&T UNIX 
name is <string.h>, while under BSD UNIX, it's
<strings.h>. 
As a further complication, some routines exist in only
one of the 
systems. Note that memset is generic, and the 0 used
to initialize the block of memory is passed as an argument.
bzero, 
on the other hand, can only set a block of memory to
zero. Several 
other of the string functions included with System V.2
do not exist 
on early, or "pure," BSD systems. These include
most of the 
library routines that start with the prefix str, as
documented 
on the string(3) manual page. These routines, at least,
will 
show up as missing header files at compile time or undefined
externals 
at link time, making these types of problems much easier
to track 
down. 
Only rarely are functions with the same name in both
versions used 
for different purposes. However, many similar commands
take different 
arguments in the two versions, affecting shell scripts
and spawned 
commands. 
Long vs. Short Filenames 
One of the more annoying differences between the older
AT&T UNIX versions 
and the BSD versions is the AT&T 14-character filename
limit. This 
difference normally creates problems when porting from
BSD to AT&T 
(if filenames are longer than 14 characters), but can
also cause difficulties 
when porting in the opposite direction. Usually, in
this case, the 
problem deals with buffer lengths. Most programs written
for systems 
without the flex-file names (the name for the longer
file names used 
in BSD systems) leave relatively short buffers for path
names. With 
the longer filenames these buffers often overflow, causing
name truncation 
or, worse, other data items on the stack to be overwritten. 
Since the filenames are of different lengths, it follows
that the 
directory structures must also differ. For this reason,
the directory 
access functions differ in the data types of their arguments.
This 
difference can also result in programs that compile
correctly but 
do not produce the expected results. Symptoms include
directory listings 
within the program that appear to be missing files or
that show garbage 
filenames or the inability of the program to find files
in the directory. 
Mailbox Locking 
Another component of UNIX that was not yet standard
when the AT&T 
and BSD split occurred was file locking, and both versions
developed 
their own method of handling interlocks to prevent two
processes from 
writing to the same file. The original mail systems
created a semaphore 
file in the mail spool directory to indicate their locking
of the 
spool file. This scheme worked well for local systems,
but required 
that the mail user agent and the mail transport agent
have permission 
to create files in the spool directory. The steps in
locking of this 
type are: 
Attempt to create a file of the name LCK..name
in the spool directory.
If the create succeeds, you have locked the file.
If the create fails, then someone else has locked
the file already. If the
iteration limit has not been exceeded, sleep for a short
duration, then return to
step one to try again.
If the iteration limit has been exceeded, report the
error to the user and, 
optionally, just ignore the lockfile. 
Later revisions of this method placed the process id
(PID) of the owning process in the lock file. When the
create 
failed, the file could be opened for read and a system
call would 
determine if the lock was stale (the process that owned
it no longer 
existed). If the lock was stale, it would be removed
and the locking 
process would be repeated. 
AT&T System V.2 used this revised method for mailbox
locking. BSD 
systems started with this locking protocol, but due
to atomic file 
creation problems with NFS (Network File System), switched
to locking 
the file only using the kernel file locking system call.
Newer UNIX 
System V.4 systems use a system call for locking the
file that is 
different from that used by the older BSD systems. 
Using the wrong locking technique for the system results
in a window 
of time where two tasks can write to the mailbox. This
can cause garbled 
messages, lost messages, or truncated mailboxes. If
your program opens 
a file for writing, you must consider how file locking
is performed 
on all systems to which your application will be ported.
 
Changes in the Port 
No method exists for writing a single set of code that
can handle 
both the System V.2 and the BSD versions of UNIX. However,
the #ifdef 
command of the C preprocessor makes it possible to integrate
both 
versions into the same source files. Elm used this method
to provide 
a single version of source code for both systems. The
initial #ifdef 
symbol was BSD and was passed to the C compiler via
the Makefile. 
ifdefs then handled the code required for the different
serial 
communications systems calls (setting up the serial
line communications 
modes), different string routines, and different header
files. In 
addition, this port revealed some of the weaknesses
regarding buffer 
sizes mentioned earlier. During the port, all the buffer
sizes were 
adjusted to fit the needs of the larger of the two systems. 
Elm did not run into any problems with byte ordering
at this stage 
of the port. However, byte order did become a problem
once it became 
possible to share the Elm alias database between NFS-linked
systems. 
An unexpected surprise arose in the different implementations
of the 
<ctype.h> macros for character manipulation. The
standard System 
V.2 macros toupper and tolower, which convert a character's
case, would change only lower- or upper-case characters,
respectively. 
If the character passed to the macro was not the appropriate
case, 
no change was made. For example, in the statement 
 
c = tolower('a');
 
under System V.2, c would contain the lower case 
letter `a'. Under BSD, the macro is implemented as 
 
#define tolower(c) ((c) - 'A' + 'a') 
 
This macro turns the lower case a (0x61) 
into a SOH code with the eighth bit set (0x101). The
two macros 
had to be redefined as follows to make the code compatible
for both 
System V.2 and BSD: 
 
#define	tolower(c)
(isupper(c) ? ((c) - 'A' + 'a') : c) 
 
The isupper macro now protects the code, preventing
translation of all but upper-case letters. However,
this redefinition 
is still not fully portable. It assumes that lower-
and upper-case 
letters are always the same distance apart in the character
set as 
the upper and lower case 'a'. This is true for ASCII,
but not 
for all character sets. 
Heterogeneous System File Sharing 
The next big portability hurdle for Elm came when systems
were linked 
together via NFS into one common disk cluster. NFS allowed
many different 
types of systems -- even non-UNIX systems -- to share
disk partitions, 
and many sites mounted the users' home directories via
NFS. Elm, which 
uses a file for global aliases, then also needed to
access the private 
alias data across the NFS file system as well. Since
the system where 
the file resided and the system running Elm were not
necessarily of 
the same type, byte order imediately became an issue. 
Big vs. Little Endian 
The battle over the order by which to number a word's
bits and bytes 
has often been compared to the wars waged by the Lilliputians
of Gulliver's 
Travels over such issues as which end of the egg should
be eaten 
first, the little or the big end. Networking forced
UNIX to rise above 
this war and declare a truce, or at least a translator. 
Since all networks need multibyte addresses to identify
all of the 
hosts and circuits, these addresses must share a common
byte order. 
Communication becomes impossible if a single machine
is known as node 
0x1234 on one system and node 0x4321 on others. The
solution is to 
pass bytes over the network in network byte order. For
TCP/IP 
networks, specifications issued by the Network Information
Center 
document this order. Several macros (see Table 2) assist
the C programmer 
in placing the bytes in that order (each routine converts
one item 
into the proper byte ordering). Elm was adapted to store
its alias 
tables using these routines, with the result that the
table appears 
the same whether the machine accessing it was a "little-endian"
or a "big-endian." Users whose home directory
is cross-mounted 
via NFS can access their private alias table regardless
of which type 
of system they are on. In addition, the global or master
alias table 
can also be shared across systems. 
NFS Locking 
NFS added a degree of portability to Elm, but it also
brought problems. 
File locking, already discussed in the section on mailbox
locking, 
was late to be standardized under UNIX. The multiple
locking methods 
require portable C programs to adapt their locking methods
to each 
system's standard. NFS makes that situation a bit worse.
Since NFS 
is stateless, cross-system locking cannot be defined
using the standard 
method (lockf or flock) for NFS-mounted file systems.
To work around the problem where remote programs access
files via 
NFS, some systems use a special daemon, rpc.lockd, to
perform 
the locks locally on the system where the files actually
reside. This 
requires the portable C program to have yet another
method of locking 
files. At present (2.3 and 2.4), Elm does not use the
lock daemon. 
Coping with System Differences 
As the prior sections demonstrate, many of the modifications
required 
for portability between UNIX versions, or for that matter,
between 
UNIX and other operating systems, require changes to
the code for 
each system type. Yet, to maintain several versions
of the same file, 
one for each different standard, would be impractical
and would lead 
to problems such as inconsistent code, wasted space,
and a complicated 
makefile procedure. 
Fortunately, C provides a construct to handle these
differences with 
a single source file. 
The C preprocessor has three commands -- #if, #ifdef,
and #ifndef -- that do much of the work in creating
portable 
programs. 
#if tells the preprocessor to emit the lines following
the 
command until it reaches an #else or #endif only if
the expression on the command line is true. Each symbol
in the expression 
is evaluated based on its value at that point in the
file. These are 
symbols, not variables, so each must be set to a value
using a #define 
statement or the -Dsymbol=value argument to the command
line. 
#ifdef tells the preprocessor to emit the lines following
the 
command until it reaches an #else or #endif if the symbol
on the command line has been defined. It does not matter
what value 
the symbol has. The symbol can be defined by a #define
statement, 
by the -Dsymbol argument to the compiler command line
with 
or without a value, or could have been predefined within
the preprocessor 
itself. System manufacturers generally predefine a symbol
within their 
C preprocessor to identify the system. This symbol is
intended to 
delimit code that must differ for their system. 
#ifndef tells the preprocessor to emit the lines following
the command until it reaches an #else or #endif if the
symbol has been not defined. The symbol can either never
have been 
defined or have been cleared by an #undef command. 
In all three cases, the C preprocessor will emit the
lines following 
the command if the condition is met, causing the compiler
to compile 
the lines on later passes. If the condition is not met,
the C preprocessor 
just outputs a blank line for each line being skipped.
When the #else 
command is reached, if there is one, the action is reversed.
In any 
case, the if condition ends at the #endif command, which
is required. 
The conditions can be nested in such a way that a check
for one symbol 
is conditional on the preceding check for another. However,
portability 
requires that you nest statements in a way that all
C compilers will 
understand. For ease of readability, it is often useful
to indent 
nested ifdefs as 
 
#ifdef CONDITION1
#ifdef CONDITION2
#endif CONDITION2
#else CONDITION1
#ifndef CONDITION3
#endif !CONDITION3
#endif CONDITION1 
 
Two aspects of this construct can create problems for
some compilers. First, many C preprocessors require
that the # 
character be in the first column of the line. And, second,
many do 
not allow symbols on the #else and #endif lines. To
ensure portability, type the lines as follows 
 
#ifdef CONDITION1
#	ifdef CONDITION2
#	endif /* CONDITION2 */
#else /* CONDITION1 */
#	ifndef CONDITION3
#	endif /* !CONDITION3 */
#endif /* CONDITION1 */ 
 
Since ifdefs are often nested to many levels and 
the #else or #endif might not be close to the command
which it affects, placing the condition name as a comment
on the #else 
and #endif lines helps to clarify the structure. 
Elm has always based its system portability changes
on ifdefs, 
and as the number grew, the comments were added to make
the range 
of each ifdef more apparent. However, this proliferation
of 
ifdefs leads to the next problem, what is the proper
condition 
to use? 
How to Use #ifdef 
When Elm was first ported, all of the changes required
for the BSD 
version were grouped under the symbol BSD. This led
to code 
fragments like 
 
#ifdef BSD
#	define strcpy index
#	define strchr rindex
#	include <sys/pwd.h >
#	undef tolower
#	undef toupper
#else
#	include <wd.h>
#endif 
 
Such constructs allow compiling the BSD version with
just the symbol -DBSD added to the CFLAGS= line of the
makefile. Problems arose, however, as Elm was ported
to systems that 
were hybrids of the pure System V.2/V.3 and BSD 4.2/4.3
versions. 
No longer were all of these changes required all of
the time. 
A better approach is to define a symbol for each portability
change 
itself, rather than for the system as a whole, and to
define these 
symbols as close to the name of the condition as possible.
If the 
previous code fragment had been written as 
 
#ifdef HAS_INDEX
#	define strcpy index
#	define strchr rindex
#endif
#ifdef PWDINSYS
#	include <sys/pwd.h>
#else
#	include <pwd.h>
#endif
#ifdef TOLOWER_MACRO
#	undef tolower
#	undef toupper
#endif 
 
then, as the different operating system versions required
different combinations of changes, the CFLAGS= line
could be 
changed as needed. If the CFLAGS= line in the makefile
becomes 
too complicated, then in one global header file, included
first in 
all modules, a code sequence similar to 
 
#ifdef ATT_SVR2
#	undef HAS_INDEX
#	undef PWDINSYS
#	undef TOLOWER_MACRO
#endif
#ifdef SUNOS_41
#	define HAS_INDEX
#	define PWDINSYS
#	define TOLOWER_MACRO
#endif
#ifdef HPUX_8
#	ifdef HAS_INDEX
#	undef PWDINSYS
#	undef TOLOWER_MACRO
#endif 
 
could handle each of the combinations with only a single
flag on the CFLAG= line of the makefile. 
Using this type of code sequence in the include file,
porting to a 
new operating system would only require listing the
features the system 
supports. Of course, any new quirks of that operating
system would 
generate new names and changes to the code in the rest
of the program. 
But still, the makefile would require only the name
of the version 
on its CFLAGS= line. 
A side effect of this change is that there are now many,
if not hundreds, 
of symbols created to ensure the widest portability,
and it becomes 
very difficult to determine the proper values for a
new operating 
system version/port for each of these symbols. But with
proper coding 
style, help is on the way later in this article in the
section on 
Metaconfig. 
The Merge of System V and BSD 
The merger of the System V and BSD standards into the
new System V 
Release 4 standard has really placed a wringer on the
choice of ifdef. 
Besides changing the location of many #include files,
this 
standard splits into separate conditions many of the
old combinations 
of things that used to go together as a single ifdef.
In particular, 
SVR4 supports many items using both styles, and sometimes
one is better 
than the other and other times, not.  
Elm used to group most of the BSD compatibility changes
together. 
Now that SVR4 has most of those items within the System
V defines, 
these ifdefs had to restrict their range once again,
making 
it all the more important to choose the ifdef symbol
to cover 
as little as possible -- preferably just the single
change required 
for the port. Then, when the underlying operating system
changes, 
at worst the symbols will simply need to be defined/undefined
to adapt. 
Metaconfig and Configure 
Larry Wall has written many programs for C programmers
and has shared 
them with the USENET community. All of the programs
run on many different 
types of UNIX operating systems. To simplify porting,
Larry wrote 
a shell script called Configure, for his rn program
(a USENET 
network news reader) that tried to determine automatically
the values 
needed for the various ifdef symbols. Where the script
could 
not determine the answer automatically, it would ask
for "local 
preference" items. To automatically configure the
software, you 
just typed Configure at the shell prompt. 
The Configure script would identify the location of
needed commands 
and libraries, check the contents of those libraries
to determine 
which functions were available, and ask the user for
local preference 
items. From these, an #include file was built and included
into each source file. The header file contained the
results of the 
program and function checks as #define SYMBOL or #undef
SYMBOL lines. It also included the preference items
as #define 
PREFERENCE_SYMBOL value lines. 
Coding the program to take advantage of Configure's
symbols allowed 
immediate configuration at the source level. However,
writing the 
Configure script by hand for each new program was tedious.
Since most 
of it was boilerplate, and whole sections could be used
by many different 
programs, this script was a perfect tool for automatically
generating 
the ultimate script. Since Larry was working on a very
large program 
with many portability changes, he used the program as
both the reason 
to develop the tool and as a method of developing it.
The program 
was Perl, and the tool he developed is Metaconfig. 
Metaconfig is a large Perl script that scans a list
of files, called 
a manifest, looking for all symbols used on #if type
lines in the .c, .h, and .y files, and all shell 
variables used in the .SH files. These symbols form
the wanted 
list. Using these symbols, Metaconfig then searches
a library of shell 
script fragments, called units, for those units that
define 
the symbols on the wanted list. Each of the units also
lists the other 
units it requires, if any. All of these units are then
combined in 
an order to satisfy the dependencies, and placed with
a common start 
and end code to form the shell script Configure. 
Since the units are common and reusable, a library of
units was quickly 
developed that Metaconfig can use for other programs.
Each unit is 
placed in a file named by combining the primary symbol
name with a 
.U suffix. These units form the master library used
by Metaconfig. 
Each program also has a local library of units which
are similar to 
the master units, but incorporate changes to the master
library equivalent 
unit. The local override units are given the same name
as the master 
library unit they replace. When Metaconfig is run, it
generates a 
message specifying which local units will override the
equivalent 
units from the master library. 
In addition to the override units, the local library
includes units 
that are specific to a program and not considered useful
to other 
programs. These custom unit files are also named by
combining the 
primary symbol name and a .U suffix. 
Metaconfig units and the symbols they define fall into
three categories: 
Symbols that are automatically determined by the Configure
script and cannot be
overridden by the user.
Symbols that are automatically determined by the Configure
script, but can also be
overridden by the user. The automatically determined value
becomes the default
value the first time the script is run. The answer given the last
time the script was run is
the default value for each subsequent time the Configure script
is executed.
Symbols that are local preference items. No automatic
value is possible.
Sometimes the unit's code specifies a suggested value for a
default value the first
time Configure is run. Configure uses the answer from the
prior run as the default
for each subsequent run. 
An example of the first case would be to check for certain
functions in the C library. Configure automatically
determines what 
C functions exist in the libraries chosen to link the
application. 
This list is available via a shell function and is used
to define 
symbols based on the availability of individual functions. 
Listing 1 shows d_strcspn.U, a unit from Elm's local
Metaconfig 
library, which checks the existence of certain C functions.
The lines 
preceded by a ? are control lines for Metaconfig. 
RCS-type lines are comment lines for use by the Revision
Control System and
contain version tracking information.
MAKE-type lines contain a list of shell symbols defined
in this unit, followed 
by a colon (:), and then the list of symbols/functions this unit
requires to be already
defined. This second list is the dependency list. The d_scrcspn.U
unit defines two shell
symbols, d_strspn and d_strcspn, and requires that the shell
symbols and libc already
be defined. The first symbol before the colon is the primary
symbol. The unit's filename
must match this symbol with a .U suffix.
The second MAKE line defines the types of operations
the dependency
makefile requires for this unit (the definition of these types
is too long to be
included here, but is explained in the Metaconfig documentation).
S-type lines are extracted to form documentation
on the shell symbols
available in the different unit files. The metaconfig source
includes a program
that automatically extracts these lines from all of the
units to produce
a document on the available symbols.
C-type lines function similarly to S-type 
lines, but for symbols defined for use in C code rather
than in shell 
scripts. Once again, the Metaconfig source includes
a program that 
automatically extracts these lines and forms a document
on all of 
the available C preprocessor symbols.
H-type lines are used by Metaconfig to automatically
generate the configuration include file. 
The remainder of the lines comprise the shell script
fragment. In 
the simple example in Listing 1, the shell script uses
a fragment 
of shell code that is contained in the shell variable
inlibc. 
The libc unit defines this variable, thus the libc dependency
on the first MAKE-type line. The inlibc function searches
the name list from the C libraries to see whether the
symbol in the 
shell variable $1 exists. If it does, the symbol in
$2 
is set to define. If not, the symbol in $2 is set to
undef. The set command on the line preceding the inlibc
call initializes $1 and $2. Using the value just 
set into the symbol d_strspn, the ?H-type lines will
automatically produce a #define or #undef for the symbol
STRSPN. The C code can then use the line #ifdef STRSPN
when it needs to call the strspn C library function,
and provide 
alternate code following a #else line. 
d_internet.U (Listing 2) provides an example of the
second 
type of Metaconfig unit, one that allows the user to
input a value 
to override the default. The header lines are the same,
but the shell 
script fragment is a bit more complicated. The first
section uses 
the case construct to set the default value for the
d_internet 
symbol based on the value in the shell variable d_internet
from the prior run. If the d_internet variable is empty,
or 
not one of the strings define or undef, the default
value is set based on some conditions the shell script
can check on 
its own. In this case, those symbols are set by other
units or by 
shell code directly in this unit. The middle section
echoes a message 
that explains the meaning of the symbol the user is
about to define. 
The script then asks the question, presenting the default
answer to 
the user. Lastly, the result the user types is checked
to see how 
to define the shell symbol d_internet. 
The last type of Metaconfig unit is used to define a
user choice or 
local preference. The unit for these looks almost identical
to the 
unit shown in Listing 2. The only difference is in how
the default 
value is set when there is no prior answer to use. While
d_internet.U 
used a value determined by the Configure script as the
default, this 
local preference unit uses a hard-coded default directly
in the shell 
fragment. Of course, it is still preferable to remember
the answer 
from the last Configure run and use that as the default
whenever possible. 
Just as C files can directly include the .h file written
by 
the Configure script, shell scripts and other non-C
files can use 
the shell variables in the file config.sh created by
the Configure 
script to adapt to the results of the Configure run.
The Configure 
script executes all files ending in .SH in the manifest
to 
produce the appropriate adapted file. Listing 3 shows
an extract from 
the makefile prototype, Makefile.SH, in Elm's master
directory. 
The .SH files are broken into three sections. The first
section, 
which runs up to the echo statement, locates the config.sh
file, which contains all the answers obtained by the
Configure script. 
Configure then reads this file into the current shell.
The second 
section uses the shell variables to modify the lines
with the results 
of the config.sh just read. The last section just adds
the 
remainder of the file that does not need the variables
substituted. 
In the listing, the line [...] indicates that lines
were deleted 
from this example. The actual makefile is much larger. 
By coding your program to take advantage of the existing
library of 
units, you can achieve instant portability between most
UNIX operating 
systems with Metaconfig. In addition, by allowing for
local preferences, 
Metaconfig provides an easy means of customizing the
distribution. 
International Portability 
The upcoming 2.4 version of Elm tackles a totally new
problem -- 
international portability. The ASCII character set,
which most UNIX 
systems use, takes advantage of the English language's
26-character 
alphabet to be a seven-bit code, with the eighth bit
within the eight-bit 
byte used for parity. On most UNIX systems, internally,
the eighth 
bit is always zero, cleared by the istrip terminal control
parameter. 
Eight-Bit Clean 
For languages with alphabets of more than 26 characters,
the eighth 
bit is used to extend the character set to support additional
characters. 
Any program destined for international consumption,
then, must be 
eight-bit clean, which means that you do not alter or
clear 
the eighth bit of any character value, and you do not
depend on all 
character values to be positive when viewed as signed
characters. 
The international standard treats all characters as
unsigned quantities. 
Using the eighth bit to extend the character set also
changes the 
definition of an alphabetic character. It is no longer
valid to consider 
the range `A'-`Z' and `a'-'z' as the only 
alphabetical characters. All checks for the type of
character should 
use the macros defined in <ctype.h>. It is the
system's responsibility 
to have the proper values in this file and its associated
modules 
in the C library to support the local character set.
Because Elm has 
always been eight-bit clean and has always used the
macros instead 
of direct comparisons, version 2.4 required no changes
in these areas. 
It's worth noting that some character sets are too large
even for 
eight bits (the Japanese Kanji alphabet, for example,
uses a 16-bit 
character). For purposes of international portability,
your program 
should not assume an eight-bit character type. 
NLS and Message Catalogs 
Changes to messages, prompts, and commands from English
to the local 
language represent the most significant challenge in
internationalization. 
Since most programmers do not speak all the languages
needed to please 
all of the potential users of their programs, how do
you solve this 
problem? 
The solution uses the concept of Native Language System
(NLS) support. 
The X/Open standards committee, a group of computer
companies, produced 
an NLS usable for UNIX that provides several components: 
 
LOCALE functions for setting the desired character 
set and language characteristics, including bit length,
collation 
sequence, and character attributes.
System error messages in each of the locally supported
language sets.
Message catalog support. 
The LOCALE subsystem tells the C runtime library 
which character set is in use. The user typically defines
the desired 
character set as an environment variable. The locale
functions read 
the variable and set up the appropriate structures and
collation lists. 
ctype.h macros use these character attributes to determine
the class of each character. The collating sequence
allows the extended 
characters to be sorted in appropriate order, rather
than be grouped 
at the end due to the unused portion of the character-set
code space. 
The user also sets the language for system error messages
in an environment variable. The locale functions initialize
the 
syserror structure with the messages in the appropriate
language. 
The most important change is support for message catalogs.
Because 
most C programs, including Elm until the 2.4 release,
code their messages 
directly into the source, a single compiled version
cannot output 
different messages based on the language desired. Rather
than requiring 
that messages for every supported language be coded
directly into 
the program, solution gives the user the ability to
define new message 
catalogs that include the text of all of the messages,
translated 
by the user, into the chosen language. For example,
to print the command 
scan message for calendar entries, Elm would display
a message on 
the screen using the C code fragment 
 
PutLine0(LINES-3, strlen(Prompt),
"Scan message for calendar entries...");; 
 
This fragment, in English only, places the message at
the bottom of the screen. A message catalog function,
however, obtains 
the message from a file based on its message number.
The file can 
be translated into any language so that the program
can automatically 
speak that language. Recoding the example using the
message catalog 
functions yields 
 
PutLine0(LINES-3,
strlen(Prompt),
catgets(elm_msg_cat, ElmSet, ElmScanForCalendar, "Scan
message for calendar entries...")); 
 
The function catgets reads the message catalog 
and loads into memory all the messages from the set
ElmSet, 
if they are not already in memory. It then returns the
text string 
of the message ElmScanForCalendar. If the message catalog
is 
not open on the file elm_msg_cat, or there is no set
ElmSet 
or no message ElmScanForCalendar, the string contained
in the 
call is returned as the default answer. 
The function that opens the message catalog, catopen(),
uses 
the language environment variable to select the correct
file from 
the application program's set of message catalogs, each
of which contains 
the application's messages in a single language. The
program that 
compiles the messages into the file also produces a
C header file 
that defines the set and message number symbols. 
Because word order rules and conventions vary among
languages, a straightforward 
string replacement mechanism would produce garbled messages.
Where 
an English message reads "6 messages received,"
for example, 
the message in another language might read "received
6 messages." 
In C, the printf function converts the numbers into
text strings 
and builds simpler strings into complete messages. If
the string message, 
or its foreign translation is in the variable msgs,
and the 
string received is in the variable rcvd, then the message
could 
be output with the printf statement 
 
printf("%d %s %s\n", num_msgs, msgs, rcvd);
 
Since the arguments are passed in order on the stack,
the printf function just uses them in order to fulfill
its 
format string. To turn that message into "received
6 messages," 
printf must access the arguments on the stack in a different
order. NLS provides for this ability with an extension
to the printf 
function. If a format argument contains an integer followed
by a $ 
character, that integer is interpreted as the ordinal
of the argument 
on the stack to use for this format string. The same
string would 
then be printed as 
 
printf("%1$d %2$s %3$s\n", num_msgs, msgs,
rcvd);
 
It then becomes easy to turn the message around to say
"received 6 messages" using 
 
printf("%3$s %1$d %2$s\n", num_msgs, msgs,
rcvd);
 
Once again, the different format strings for these last
two printf statements would be obtained from the message
catalog 
using the catgets()function. The final printf statement
would read 
 
printf(catgets(elm_msg_cat, ElmSet,
ElmMessagesReceived, "%d %s %s\n"),
num_msgs, msgs, rcvd); 
 
In addition, the values for the variables msgs 
and rcvd can also be obtained from the message catalog.
 
The English version does not need the $ notation as
the arguments 
are used in their natural order. The translations in
the message catalog 
would use the $ notation as needed. 
The problem remains of writing for an operating system
whose vendor 
doesn't support NLS. Several freely distributable programs
provide 
NLS support, including new versions of the printf family
of 
functions. Elm, with release 2.4, will include one such
program so 
that users whose systems don't support NLS will still
be able to compile 
new message catalogs for the language of their choice. 
Future Portability Issues 
Up to this time, Elm has supported only electronic mail
interchange 
using UNIX-based messaging systems. These systems use
the RFC-822 
standard to format messages. A newer, international
standard, entitled 
X.400, has been approved by the CCITT (the international
standards 
body). This standard allows for a hierarchical address
to any place 
in the world, on any computer system. And, unlike RFC-822,
it has 
a companion standard, X.500, similar to the telephone
directory white 
pages. The X.500 standard allows distributed directory
services, which 
means that knowing only a name, one could look up the
electronic mail 
address. Elm must eventually evolve beyond its purely
UNIX mail roots 
and handle X.400 messaging systems directly, instead
of behind an 
RFC/822-to-X.400 gateway. 
The change in the UNIX market is from character-based
terminals to 
bit-mapped terminals running Graphical User Interfaces
(GUI) also 
has implications for Elm's development. Both of the
two major GUI 
standards, OpenLook and OSF/Motif, use the X Windowing
System. Future 
versions of Elm will have to support these as well as
the traditional 
character-based interfaces. A complete redesign of Elm's
user interface 
-- to replace menus with buttons and add support for
sliders and 
multiple windows -- will be required. 
These and other changes will wait for a rewrite after
2.4 is released. 
Like all programs that have evolved through a long development,
Elm 
at some point will need to be rewritten totally to clean
up convoluted 
code and remove some of the past assumptions. Such a
rewrite provides 
the best opportunity to consider the portability issues
that created 
problems in the past and to design in ways of handling
them.  
 
 About the Author
 
Sydney S. Weinstein, CDP, CCP is a consultant, columnist,
lecturer, author,
professor, and president of Datacomp Systems, Inc.,
a consulting and contract
programming firm specializing in databases, data presentation
and windowing,
transaction processing, networking, testing and test
suites, and device
management for UNIX and MS-DOS. He can be contacted
care of Datacomp Systems,
Inc., 3837 Byron Road, Huntingdon Valley, PA 19006-2320
or via electronic
mail on the Internet/Usenet mailbox syd@DSI.COM (dsinc!syd
for those who
cannot do Internet addressing). 
 
 
 |