Article
Figure 1
Listing 1
Listing 2
Listing 3

feb2001.tar

Safer CGI Scripting

Charles Walker and Larry Bennett

The CGI is the simplest and by far the most common way of providing Web pages with dynamic content. Essentially, the CGI (Common Gateway Interface) is a way for the Web server to invoke a program to generate HTML that gets sent back to the Web browser, rather than simply serving up a static HTML file. Without the CGI and other similar dynamic content schemes, many things would be impossible on the Web -- stock trading and booking of vacations, for example, and just about anything requiring input from users. The Web would still be simply a mechanism for downloading static documents. Figure 1 shows how CGI scripts fit into the picture.

These programs invoked by the Web server are called CGI scripts. The name of the program is sent by the Web browser in the URL, followed by arguments to the CGI script. The Web server sets up the CGI script's environment so that it can access the arguments, then starts the CGI script. The CGI script then runs, does whatever the programmer coded, and writes its output to stdout. The Web server redirects stdout back to the Web browser that sent the request.

With static HTML, the Web server simply sends the requested HTML file back to the user's Web browser, which then interprets the HTML, formats it, and displays it. Take this URL for example:

 http://www.trionetworks.com/hypertrak/techwhite.htm

This causes the server www.trionetworks.com to send the content of the file /hypertrak/techwhite.htm back to the Web browser. Look at the following URL for an example of how a CGI script might be invoked:

 http://www.trionetworks.com/cgi-bin/hmshow.cgi?func=showlist&rt=all&set=all

On this Web server, the directory cgi-bin has been defined to the Web server to contain CGI scripts, rather than static HTML. The Web server invokes the program hmshow.cgi and lets it generate the HTML to send back to the browser, rather than sending the contents of a static file back to the browser. The characters following hmshow.cgi in this URL are known as the "query string". The query string always starts with a "?", then is followed by parameters to be passed to the CGI script and the values. In this case, the parameters are called func, rt, and set, and their values are showlist, all and all, respectively. The Web browser stores the query string in the environment variable QUERY_STRING before invoking the CGI script. The CGI script parses the contents of QUERY_STRING and takes appropriate action.

A CGI script can be written in almost any programming language that will not get in the way of checking the values of environment variables, reading from stdin and writing to stdout. The majority are written in Perl, but they can also be written in C or the shell scripting language of your choice.

What Bad Things Can a CGI Script Do?

Passing Unchecked User Input to a Subshell

A CGI script can, intentionally or otherwise, do anything that the user it runs as can do. Typically, CGI scripts run as the same user as the Web server. On most UNIX systems, the Apache Web server is used and by default, Apache runs as user "nobody". By convention, "nobody" is a user for unprivileged operations. Some may think that something running as nobody could not do much to compromise a Web server, but there are many ways security can be compromised.

There are many files sprinkled around the typical UNIX system, which are readable by all users, but which you probably don't want in the public domain. A prime example is /etc/passwd. This file contains a list of all users on your system, and if you are not using a shadow password file, also contains the encrypted forms of all of your users' passwords. If a hostile party can manage to get a copy of /etc/passwd, you are wide open to a password guessing attack, even if you use a shadow password file. If you don't use a shadow password file, you will be easy prey for a dictionary attack, whereby a program encrypts a long list of words and compares them against encrypted passwords.

It is easy to write a CGI script that is vulnerable to malicious query string contents, which can make a CGI script do things it was never intended to do (e.g., sending a file on the server back to a hacker's Web browser). A classic CGI security problem occurs when a CGI script starts a shell and passes it data from the query string without carefully checking the query string contents.

Listing 1 contains a CGI script that appears relatively innocuous -- it provides a way for somebody to run a whois query from his or her Web browser. For example, this HTTP request will return information about IP address 207.46.130.45:

GET /cgi-bin/buggy.cgi?parm=207.46.130.45

However, this script will do much more, if you know how to ask. Notice that this CGI script uses the backticks (`) operator on the next to last line to run the whois command. This causes Perl to start a shell (i.e., /bin/sh) and have it execute the command inside the backticks. Now imagine what would happen if you had sent the following HTTP request:

GET /cgi-bin/buggy.cgi?parm=207.46.130.45;%20cat%20/etc/passwd

The script would still run a whois query. However, look at the stuff that has been added onto the end of the request. Most shells use ";" to separate commands. The percent sign (%) followed by two hexadecimal numbers is the way that hexadecimal values for characters are sent in URLs. Since it is in the query string, the CGI script must substitute the equivalent character. In this case, the character equivalent to %20 is blank. After the CGI script parses the query string, it will pass the following to the subshell that it spawns:

whois 207.46.130.45; cat /etc/passwd

Because the CGI script naively passes all of its parameters to the shell without checking them, the contents of /etc/passwd would be sent back to the hacker's Web browser. The hacker could then commence guessing passwords. This CGI script can also execute any other command that the Web server can execute, simply by tacking the commands onto the end of the query string. It would be equally easy to have the same bug in a CGI program written in C, which uses the system () call to run whois.

Of course, there are plenty of other things that this outwardly innocent-looking CGI script could be coaxed into doing -- like emailing the password file or any other file readable by the Web server to a mailing list.

Fixing this Vulnerability

CGI scripts should not start a subshell if there is a way around it. In Perl, a subshell is started in any of the following ways:

1. Using the backtick operator, as demonstrated above

2. Opening up a pipe to a program. For example, the code fragment below would cause the process /bin/lpr (a process that submits print requests), and would cause anything that the Perl script writes to SPOOLER to be redirected into the standard input of /bin/lpr:

open (SPOOLER, "| /bin/lpr 2>/dev/null");

3. Using system () to start a program:

system ("/bin/ps -ef");

4. Using exec () to start a program:

exec ("/bin/ps -ef");

In C, a subshell is started in the following ways:

1. Using system ()

2. Using popen ()

If a subshell must be started, then the following precautions should be followed:

1. Parse all input. Determine which set of characters is valid for the particular type of input token you are expecting, and allow ONLY those characters. Either remove or escape other characters. It is not as simple as scanning a piece of data for shell metacharacters and rejecting anything that contains a metacharacter, for two reasons:

It is easy to forget one metacharacter and let it slip through.

Some characters that are shell metacharacters may be valid in some positions in the input. For example, ";" is a valid character to appear in a file name. But ";" is also a command separator. Suppose a CGI script is displaying the contents of a file on behalf of a user, with the file name coming from the user. The script has a line of code like this:

'/bin/cat $filename'

$filename is the name that came from the user. As we have previously seen, a hacker could insert a ";" into the file name with a command after it, thereby causing the shell to execute the command. The hacker could send in a "file name" that looks like this:

myfile;ls

This would cause the file myfile to be displayed, followed by a directory listing. Suppose that you must allow ";" to appear in the file name, which is a valid character in the file name. Before it is passed to the shell, the script should change $filename to this:

myfile\;ls

This would cause the file myfile;ls, if there is such a file, to be displayed. Be careful -- the hacker may know that you are escaping metacharacters, and he may have already sent in a file name that looks like this:

myfile\;ls

If the CGI script simply sticks another "\" before the ";", then what gets passed to the shell is:

myfile\\;ls

This would cause the file myfile\ (if it exists) to be displayed, followed by the directory listing that you didn't want the hacker to see.

As you can see, there are all sorts of games that can be played with metacharacters, and having anything other than a blanket ban on metacharacters in input can be tricky and hard to test. It is a matter of balancing flexibility versus security. Even when a CGI script is checking its input, it is important to remember that all programs have bugs and the more complex a program is, the more bugs it is likely to have. Complex logic to check input is more likely to have bugs than simple logic to check input, and exploiting bugs is what hackers do.

This is a list of metacharacters for various shells:

;<>*|'&$!#()[]{}:'"/^\n\r

If any of these are in data passed to subshells, make sure that they are properly escaped and make sure that all possibilities are tested.

2. Specify the absolute filename of any commands, so that the PATH environment variable will not be used to find the command. Also ensure that the PATH environment variable is set to a known value. It should contain only directories that are writable solely by the owner of the directory. The reason it is important to set PATH to a known, good value, even if your CGI script does not use PATH to find commands, is that your script might start a command that relies on PATH.

The risk in relying on PATH to find commands is that a hacker could have modified PATH to include a particular directory. That directory could contain a malicious script placed there by the hacker.

If a command that your script starts relies on PATH, the danger is mitigated by allowing only directories that are solely writable by their owners. That will reduce the risk of executing a malicious script.

If CGI scripts are coded in Perl, a further measure that can (and should) be taken is to turn on "taint" checking. Taint checking is a feature of Perl that forces a program to check untrusted input and environment variables. To turn on taint checking in Perl 5, change the CGI script to add the -T option to the invocation of Perl, as shown below:

#!/usr/bin/perl -T

If only this change and no other changes are made to the CGI script shown in Listing 1, the script shown dies with the following message:

Insecure dependency in '' while running with -T switch.

This is because the scalar $parm is considered "tainted".

The principle of taint checking is that all data from outside the program, or derived from data outside the program, must be "laundered" before the data is used in such a way that it could affect something outside your program. Until data is laundered, it is considered tainted. Attempting to use tainted data in any command that invokes a shell, or in any command that modifies files, directories, or processes, will cause the program to die.

To launder tainted data, the program must perform a regular expression match. It must then derive the new value of the data from subpattern variables set by the regular expression match. Let's attempt to untaint the data in our example CGI script. After splitting $key and $val, insert the code from Listing 2, which will guarantee that $val has only certain characters.

When a hacker attempts to add an additional shell command onto the end of the whois request, the script will detect it and terminate. Remember that although taint checking makes you check your input, it doesn't enforce the quality of the checking. The quality of the checking is up to you.

Now let's attempt to run this script with good input and see what happens. This time, it dies with another error:

Insecure $ENV{'PATH'} while running with -T switch

The problem this time is that our script is still running with the PATH environment variable that it inherited from the Web server, which has an unknown value and is considered tainted. As previously mentioned, not setting PATH to a known value, which contains only directories non-writable by anyone except the owner, is a bad practice. Listing 3 shows the fixed script. This version of the script does not pass extra commands to the shell, and will only execute the intended programs.

Imagine what could have happened if the Web server had been running as root. The CGI script we started with could have been hijacked to do anything, such as emailing the shadow password file to somebody, or trashing an entire file system. Never run the Web server as root. It is usually necessary to start the Web server as root so that it can open the HTTP port, but it should be configured to change to another user, such as nobody, after it is finished with initialization. In fact, it is not a bad idea to set up a user specifically for running the Web server, because there are often other services that run as nobody.

Other Problems with Bad Input Data

Suppose you have a form with three radio buttons on it. At the bottom of the form there is a "submit" button that, when clicked, causes a CGI script to run. These radio buttons select a text file on the server, which the CGI script will write to the user's browser.

The three possible files that can be selected by this set of radio buttons are "file1", "file2", and "file3". Here is a possible implementation of the CGI code:

 # Since this CGI script is outputting plain text, not HTML, tell 
 # the browser to expect plain text.
 print "Content-type: text/plain\n\n";
 
 # Write the contents of $radiobutton to stdout. Value should be 
 # file1, file2 or file3 since input could only have come from our 
 # form. There's no need to check it -- since all our users are 
 # nice people :-)
 open (FD, $radiobutton);
 print <FD>;

The problem here is that perhaps the form was not used to send the form input. It is trivial for a hacker to display the source HTML for a form and determine what variables the CGI script is expecting in the form data. From there it is not too much more work for the hacker to manually generate an HTTP request using telnet or a simple HTTP client of the hacker's own creation to send a request to the server containing bad form data. A hacker might have guessed (or known, if you are using a public domain CGI script) that the script you are running to handle the form input has this vulnerability. Instead, this piece of code should do something like this:

 
 if ($radiobutton =~ m/^(file1|file2|file3)$/)
 {
 # Tell the browser to expect plain text.
 print "Content-type: text/plain\n\n";
 
 open (FD, $1);
 print <FD>;
 }
 else
 {
 # Either set $radiobutton to some default value and process or 
 # die with error.
 }

CGI scripts also need to take precautions with plain text input. Consider a system in which users can enter plain text data into a form. Suppose there is a CGI script that handles the form input and saves it in a database verbatim, and another CGI script that retrieves this "plain text" from the database and displays it. The retrieval CGI script might have a section of code that looks like this:

 print "<html><title>Here is the text you entered</title><body>";
 print "$userdata\n";
 print "</body></html>";

If $userdata is something like "Hi Fred", then there is no problem. But suppose that when the form data was saved in the database, it contained something like:

 <!--#include file="/etc/passwd" -->

If server side includes were turned on in the server, it would display the contents of the password file.

There are all kinds of nasty variations on this theme. Something like the following could have been inserted to execute a command to attempt to delete all files on the server:

 <!--#exec cmd="cd /; rm -rf" -->

A hacker could even insert HTML designed to blend into the Web site being attacked, complete with a link to a rogue Web site where users might be prompted to enter credit card data for the hacker to steal.

To fix this, the CGI script that handles the input should check for "<" and ">" characters in text input that could be used in HTML documents and change those characters to something else, such as < for <, and > for >. Additionally, if server-side includes are enabled, it may be worth turning them off if not necessary.

Buffer Overflows

A major source of vulnerabilities in C and other compiled languages has been incorrect assumptions about the size of input to the program. Here is an example in C showing how a buffer overflow could occur:

 #include <stdio.h>
 #include <stdlib.h>
 
 char query_string_copy [256];
 
 int main (int argc, char *argv [])
 {
 char *qs;
 
 qs = getenv ("QUERY_STRING");
 
 strcpy (query_string_copy, qs);
 
 }

This piece of code gets a pointer to the query string in the environment and makes its own copy of it. However, the buffer that is to receive the copy is only 256 bytes long. If the query string (including the null terminator) is more than 256 bytes long, strcpy will blindly do what it is told and scribble all over whatever comes after query_string_copy in memory.

The CGI program may merely crash in a situation like this. However, CGI scripts that are open source and that have bugs like these become easy for dedicated hackers to exploit. A classic form of exploitation of buffer overflows is for the hacker to discover a place in a CGI script where input is not properly length-checked. Then the hacker can design an input string that is intended to overflow the buffer and overlay something specific, such as a return address to the calling function. Once this has happened, the hacker has effectively hijacked the CGI script. The hacker could make the CGI script pass control to some code supplied by the hacker, which could then do just about anything (e.g., deleting files, opening up an xterm on the hacker's host, etc.).

Fixing this Vulnerability

When writing CGI code in C, always check the size of all input data and ensure that buffers are never overrun. Avoid the use of the following C library functions, which copy into a destination buffer and do not take a destination length argument or, on some systems, are themselves vulnerable to overflowing of internal buffers:

 gets (), strcpy (), strcat (), sprintf (), 
 fscanf (), scanf (), sscanf (), vsprintf (), 
 realpath (), getopt (), getpass (), streadd (), 
 strecpy (), strtrns ()

If you use an ANSI C compiler, use function prototypes to ensure that the types of the arguments passed to functions match what the functions expect. If you don't use prototypes, it's very easy to have a type mismatch and never know it.

Besides this, it is a matter of fixing compiler warnings, careful inspection, testing, and debugging.

Other Security Gotchas

Sometimes programs such as shells and interpreters that are designed to run other programs are located in places where they can be invoked by a request to the Web server. For example, in Windows environments, the Perl interpreter (PERL.EXE) may be located in the cgi-bin directory. This is extremely dangerous, because it allows anyone to run arbitrary commands on the server. Do not do it! No program that you don't want the whole world to be able to invoke should be in any directory that is defined to the Web server as a CGI directory.

Be careful with temporary files because they could disclose information about the CGI script, the configuration of the server, or confidential information about users. If a CGI script has to create temporary files, those files should be created with the most restrictive permissions possible. If no other users need to read or write to the file, don't give them permission to. If there is no need to have the file stay around after the CGI script is no longer running, make sure it gets deleted before the script terminates. If possible, create temporary files in directories that are readable and writable only by the user that the CGI script runs as.

Also, beware of temporary files that text editors and other development tools might leave in a CGI directory. A temporary file created by an editor and left in a CGI directory could enable hackers to run old versions of CGI scripts or get the source code.

Likewise, core files can also disclose information that could be useful to somebody trying to compromise a system. Maybe a hacker has found a way to make a CGI script core dump, and the hacker knows that the CGI script has some confidential information in variables. The hacker could feed the CGI script input to make it core dump, and then get a copy of the dump. If a CGI script is written in C, then when it is in production, use the setrlimit () system call to limit the size of the core file to 0.

SUID and SGID CGI Scripts

In UNIX systems, there is a bit in the file permissions called SUID. When the SUID bit is set in a command's file permissions, the program runs with the permissions of the owner of the file, rather than the permissions of the user that started it. Likewise, there is a SGID bit in the file permissions that causes the file to run with the permissions of the group associated with the file. Typically, SUID is used when the script or program needs to be superuser (i.e., root). A well-behaved SUID program gives up its extra privileges as soon as possible.

It can be dangerous to have SUID or SGID CGI scripts, so their use should be avoided if at all possible. If it is necessary for a CGI script to do something with more privilege than the Web server, take these steps to limit the possible security exposure:

1. Do not just make it SUID root. Is there, or could there be, another account that has sufficient privileges but is not superuser? It is better not to run as root if not absolutely necessary.

2. Do not write a SUID CGI script in a shell scripting language (csh, ksh, etc.). There are too many possible security problems.

3. Make sure that the CGI script gives up its extra privileges except when it needs them, by setting its effective user ID to the real user ID.

Note that if Perl 5 is used, taint checking is automatically turned on when the script is SUID or SGID.

Putting a CGI Script in Its Own Sandbox

In an environment in which there are multiple authors of CGI scripts (e.g., a server that is hosting multiple Web sites), it is sometimes advantageous to run CGI scripts as the user who is responsible for the CGI script, not as the Web server. This is done with a piece of software called a CGI wrapper.

A commonly used CGI wrapper is called CGIWrap, and it is available from http://www.umr.edu/~cgiwrap. CGIWrap is a SUID CGI script that executes other CGI scripts as the user who owns the file, rather than the Web server. It will run under just about any UNIX-based Web server. Typically, the Webmaster develops a policy that all users' CGI scripts must run under CGIWrap. The user puts CGI scripts in a directory under their home directories, and CGIWrap executes the users' CGI scripts from there.

As an example of how CGIWrap might be used, suppose that a server runs two Web sites, one owned by user Bob and one owned by user Joe. Bob wants to have some CGI scripts, so he creates a directory called public_html/cgi-bin under his home directory home/Bob. Joe puts the CGI scripts for his Web site in home/Joe/public_html/cgi-bin. The executable for CGIWrap goes in the Web server's main cgi-bin directory and is SUID as root. CGIWrap runs all user scripts.

CGIWrap causes Bob's CGI scripts to run under the permissions of user Bob and Joe's scripts to run as user Joe. Bob's CGI scripts can, if carelessly coded, trash anything writable by Bob, just as Joe's CGI scripts can trash Joe's data. However, unless Joe has given Bob write permissions to his files, Joe's CGI scripts cannot trash Bob's data.

There are other CGI wrappers besides CGIWrap. Another commonly used one is suEXEC, which comes with the Apache Web server. suEXEC operates on the same general principles as other CGI wrappers, but it is designed to take advantage of Apache's implementation and can only be used with Apache.

In UNIX systems, there is a facility called chroot, which is a way of giving a program its own root file system outside of which it cannot access. For example, if chroot was used to change a program's root file system to /hom/Joe, and that program tried to open /etc/hosts, then it would actually open /home/Joe/etc/hosts. Once chroot is done, it cannot be undone. Any programs started by a program running in a chroot environment inherit the parent's chroot environment. In effect, the program is locked in a cage that it cannot break out of. This is a very good way of further restricting the potential damage that untrusted CGI scripts can do. There is another CGI wrapper called sbox, from http://stein.cshl.org/software/sbox that makes use of chroot to restrict the environment in which CGI scripts run. If Joe's CGI scripts are always started in a chroot environment with /home/Joe as the root file system for Joe's CGI scripts, then it is impossible for Joe's CGI scripts to even attempt to access anything outside of /home/Joe. However, using chroot to restrict CGI scripts can involve a lot of work. All files that are needed in order for the CGI scripts to run (i.e., shared libraries, the Perl interpreter, and various configuration files) all must exist within the restricted area. This means that directories such as /usr, /tmp, /dev, /etc, and others, will have to be created within the chroot environment. These directories will have to be populated with the subsets of the real directories' files, which are needed in order to support the programs that run under the chroot environment.

The use of CGI wrappers make users accountable for the actions of their individual scripts, rather than having an amorphous mass of scripts that various users have responsibility for, all running as "nobody". CGI wrappers are not a security panacea however. All of the nasty things that CGI scripts running as "nobody" can do can also be done by CGI scripts running as any other user. There are a lot of world readable files on the typical UNIX system that you don't want anybody with a Web browser to access.

Developing a CGI Security Strategy

There are obviously many security issues that a Webmaster must consider, and high on the list should be the security issues associated with CGI scripts.

Taking Responsibility

The Webmaster must ensure that all CGI scripts placed on any Web server have been through a process to find and fix security holes. Some of the items that should be on the CGI script security checklist include:

Is all input parsed to ensure that the input is not going to make the CGI script do something unexpected? Is the CGI script eliminating or escaping shell metacharacters if the data is going to be passed to a subshell? Is all form input being checked to ensure that all values are legal? Is text input being examined for malicious HTML tags?
Is the CGI script starting subshells? If so, why? Is there a way to accomplish the same thing without starting a subshell?
Is the CGI script relying on possibly insecure environment variables such as PATH?
If the CGI script is written in C, or another language that doesn't support safe string and array handling, is there any case in which input could cause the CGI script to store off the end of a buffer or array?
If the CGI script is written in Perl, is taint checking being used?
Is the CGI script SUID or SGID? If so, does it really need to be? If it is running as the superuser, does it really need that much privilege? Could a less privileged user be set up? Does the CGI script give up its extra privileges when no longer needed?
Are there any programs or files in CGI directories that don't need to be there or should not be there, such as shells and interpreters?

Language Considerations

One very important thing to consider is what programming languages will be allowed for CGI scripts. Perl has the best security features, but it is an interpreted language and therefore inherently slower than a compiled language like C. If the job can be done quickly enough by a Perl CGI script, it's probably better to go with Perl; otherwise, use C. Never allow CGI scripts to be written in shell scripting languages. There are too many potential security problems.

Using Other People's Code

Using code that you have downloaded from the Internet is fine. In fact, it has its advantages. A CGI script that has been used previously by lots of other people probably has fewer bugs than one that somebody has just cooked up. However, you need to be cautious. When you get CGI scripts off the Internet, make sure you check on what bug fixes might be available. Use the most current, stable version. Read through the fix history to make sure you have the latest applicable security bug fixes. It should be checked as rigorously as any script written in house. If it is written in a compiled language like C, do not just download a binary and install it, even if you've seen the source code. How do you know that the binary matches the source?

Conclusion

CGI security is a difficult and complex subject to tackle. There are many variables, involving the CGI script itself, its environment, the Web server, the operating system, and whatever input all the millions of users might throw at a CGI script. However, it is still extremely important to to come to grips with CGI security. Not doing so could be disastrous.

Charles Walker is a computer consultant specializing in IP based protocols. Originally from the U.S., he currently lives in London. He can be contacted at: chw@trionetworks.com.

Larry Bennett is a networking consultant specializing in security and performance. He is based near London and can be contacted at: larry.bennett@trionetworks.com.