Article

An older Option in C for find

Larry Reznick

Why a C Program?

UNIX comes with so many utilities that a lot of work can be done by shell scripts that use those utilities -- and shell scripts can usually be put together faster than C programs. Sometimes, however, when the utility that does exactly what you want simply does not exist, you just have to write a C program.

A client of mine is putting together a system that will receive files from customers transmitted by uucp. These files will come in daily from all over the country and will contain various transaction records that need further processing locally. At least once a day, we want cron to wake up on my client's machine and move the files out of the uucppublic directory into a directory where the additional processing can be done. However, new files can come in literally at any time, including the time that the cron job wants to move the files away. The file currently being transmitted by uucp must not be moved away while it is still being uploaded, but all prior files will qualify.

My first thought was to use the find utility, knowing that it has a bunch of interesting options for qualifying files and then emitting the names of those that qualify. None of the timestamp comparisons (such as -atime) works with arguments of minutes, only days. find has a -newer option that allows a specific file's timestamp to be the base time, so that all files newer than that time will qualify. What we really needed was an -older option, since we didn't want to take the file being uploaded currently, but did want all files prior to that one. Using ! -newer might do the trick if I could touch a file with the appropriate timestamp. However, find has no -older option that would work with a specific amount of time in minutes, so I decided to write one.

I wanted this program to act as if I had given a find command that had this unsupported syntax:

find dirname -older
num_minutes -print

where dirname was the name of the directory containing files I cared about, and num_minutes was a number, not a filename. If any file in that directory was older than the specified minutes, the pathname for that file would be printed out. Since most of the customer files would take only a couple of minutes maximum to transmit at 2,400 bps, and a few might take as long as 5 to 10 minutes, files older than 15 minutes would qualify. Anything more recent than that would be picked up next time around.

How older.c Works

older.c, shown in Listing 1, meets these requirements. To make it more generally useful, the program accepts a number of minutes and a set of directories on the command line. While this particular application would work with a 15-minute timeframe and only with a specific directory, I was sure we would find other uses that would have different requirements. If no time is given, the program defaults to 15 minutes, and if no directories are named, it defaults to the current directory.

Looking at Listing 1, start with the FEBDAYS() macro. This implements the rule that it is leap year in every year divisible by 4 except the century years, which must be divisible by 400. The mod (%) operator gives the remainder after a division, so if the result of a mod operation is zero, the value is evenly divisible by the divisor. C implements short-circuiting for the Boolean logic operators and (&&) and or (||). This means that if the total truth can be determined by the first part of the operation, the second part does not need to be evaluated. If the first part of an && operation is FALSE, the whole thing is FALSE. If the first part of an || operation is TRUE, the whole thing is TRUE. The opposite requires that the second part be evaluated to be certain. So, if the year is 2000, which is evenly divisible by 400, this routine will set February to 29 days without checking further. A year that is not a century year is recognized as a leap year if it is evenly divisible by 4. Instead of using the % operator for the division by 4, whenever the divisor is a power of 2, the binary AND operator (&) can be used with a value one less than the divisor (3) to get the remainder. The quotient, if needed, can be delivered by using a shift to the right (>> ) instead. The shift and binary operators are faster than the mod and division operators.

The variable progname is made global for error handling. If the program reports an error, the program's name will be part of that error message. Since any of the functions might deliver an error message, but only main() can know the program name from the command line, using a global variable to hold the name eliminates the need to pass it around to all the functions as a parameter. So, the first thing that main() does is grab argv[0], the program's name, and put it into that variable. No matter how many other arguments will be given on the command line, even the wrong number, argv[0] will be present.

The next step in main() is to check the command line argument list. The total number of arguments on the command line is in argc, the arguments themselves being in argv[]. For this program, no arguments need be given, but there are no hyphenated options. So, if a hyphen is the first character of the first option, the program takes it as a request for help and outputs a Usage message.

If no arguments other than the program's name are given, argc will be 1 and the default time of 15 minutes will be taken off the current system time. Otherwise, the command line has the number of minutes to be taken off. The program will presume that only an int capacity will be needed. (On most UNIX systems, an int is the same size as long, so this can be quite a number of minutes. If a long versus short is strictly required, one of them should be used, not int. While the int is theoretically the most efficient type for the system [this is a religious issue and different compiler writers will disagree for the same system], it is also the least portable type since the ANSI C Standard allows it to be the same size as short or long, or somewhere in between.) If an int is 16 bits, the maximum value of a signed int is 32,767 minutes, which amounts to over 22 days. If it is 32 bits, the maximum signed value is 2,147,483,647 minutes, which amounts to almost 4,083 years! Either way is sufficient for this program's needs.

If a minutes argument is given on the command line, directory names may also be given (directory names may not be given without a minutes argument). If no directories are named, the current directory is used. If directories are named, a loop runs through them one at a time. If an error occurs, the program quits the loop.

The reduce_time() function takes the passed number of minutes off the current time. While the ANSI C Standard gave us a lot of flexibility in working with calendar and clock times, it did not provide date arithmetic functions. The closest it came to that was the difftime() function, which takes two time_t values and subtracts them, giving the difference as a type double.

It is extremely important to avoid the trap of assuming that the time_t values are arithmetic types. While this may be the case for many compilers, some might use a structure instead. A specific calendar date and time can be converted into a time_t by the mktime() function, but a specific amount of time cannot be added or subtracted from a time_t value, since there is no guarantee that the time_t is a number of seconds. Moreover, even where compilers do deliver a time_t as a number of seconds elapsed from an epoch, you cannot assume that all will use the same epoch. difftime() allows you to handle these differences.

While it would be easy to multiply minutes by 60 and take the resulting seconds off the current time represented as a time_t value to get the starting time for the timestamp comparisons, to do so would risk making the code nonportable. The only truly portable solution is to go through the struct tm data type and muck around with the various parts of the calendar and clock.

Therefore, I take the current time() and plug that into the localtime() function, which translates the time_t value into calendar and clock information for the local timezone. The minutes and hours can be adjusted easily, and the days will be whatever is left over if enough minutes were given. I take the total minutes (0 to 59) and subtract those from the current time's minutes. A negative result means that the time crossed backward into the previous hour, so I add the hour back into the minutes and subtract one from the hour. I do the same thing with the hours, except this time a negative result means a cross back into the previous day.

These steps may seem to be a lot of trouble, but if the current time is just a few minutes after midnight, the subtraction will have to deal with a day on the calendar. The real problem is in the lack of standard functions for doing date arithmetic. Maybe the ANSI committee will do something about this the next time around. While accounting requirements of, say, 30-, 60-, and 90-day aging or more are usually met by adding 1, 2, 3, or more to the month number rather than using a strict number of days, other applications might need to be more precise. It would help tremendously if mktime() would take unusual numbers in its struct tm argument. Then, if it were given more seconds, minutes, hours, days, or months than is reasonable -- or even a negative value -- it could convert the number to the correct calendar amount and hand back the adjusted time_t value with leap years, timezones, and so forth accounted for.

Once the problem has been reduced to a specific number of days by which the calendar should be adjusted, a loop is needed to work within the days of each month. Leap day fluctuation is accounted for by adjusting February's days (day[1]). If the number of days to be removed from the date is greater than the day of the month, I reduce the day of the month by that number of days; since this brings the calendar date back to the previous month, I reduce the month number. If reducing the month number requires it, I reduce the year also and recalculate February's days. Regardless, I take the number of days in this new month and repeat the operation until the number of days to be taken off becomes less than the value of the day of the month. At this point, I take that number of days off, and the correct date of the adjusted month (in the adjusted year if needed) is delivered. By building this directly into the struct tm, the result can be passed directly to mktime(), which returns the resulting time_t value from the reduce_time() function.

Two interesting side-effects result from this. First, since subtracting a negative number is equivalent to adding, a negative number of minutes will add minutes to the current time. While not useful for this particular application, since it works with file timestamps, this capability could be handy for other programs using this function. Second, since mktime() takes the timezone and Daylight Savings Time into consideration, the result will be plus or minus an hour depending on whether the new time has crossed over one of the DST boundary dates. This program is calculating an absolute time in minutes without regard to adjustment of clocks made at DST boundary dates, so the hour lost or subsequently recovered will show up in the difftime() between the starting time and the new reduced time. Nevertheless, the result is a correct absolute number of minutes prior to the current time.

One remaining issue about the reduce_time() function would be to eliminate its association with the current time. Instead of calculating the now variable from the time() function, you could pass it into reduce_time() as a parameter, also named now. With that, a specific number of minutes can be removed from (or added to by using a negative number of minutes) any time.

Finding the Target Files

The show_files() function takes a directory name and a starting time. It comes up with every filename in the specified directory and checks each file's timestamp against the starting time. Reading the filenames from a directory is no more complicated than reading records from a sequential data file. The directory is opened with the opendir() function, the filenames are delivered with the readdir() function in a structure, and the directory is closed with the closedir() function.

The function takes the given directory name, opens the directory, and copies the name into the pathname[] variable. A trailing slash is concatenated to it and the null terminator is replaced to make it a regular string again. Notice that strlen() is used to figure the subscript of the terminator. That information gets translated into a direct placing of the / character on top of the terminator without having to use strcat(), which would make yet another pass through the string to find that terminator. (The dirlen variable needs to be increased to represent the adding of that / character.) Since the length of the pathname part containing the directory's name is known, every filename within that directory can be appended to the same path, once the name is discovered. All you have to do is keep track of where the pathname part ends -- and that is what dirlen is for.

pathname[] variable is set to 256 characters, allowing the full pathname to be no more than 255 characters. Since BSD and SVR4 allow 255-character filenames, the path added to that would exceed this buffer, so this is not a particularly safe strategy. Still, this method should work with most pathnames, and serves to keep the example simple. A more robust solution would allocate the buffer from the heap and allow it to grow on demand. You might want to challenge yourself to rewrite it that way.

The readdir() loop checks the file's name to see if the first character is a dot (.). The readdir() function delivers every name, including the directory names . and .., and the hidden files beginning with a dot. The program should ignore such files, so, if the name does not begin with a period, the program appends it to the pathname[] variable and passes the result to the stat() function. This handy function reads the inode information for that file, delivering all sorts of useful facts about the file, including the time of the last modification (st_mtime). That time is a time_t type, so it can be plugged directly into the difftime() function.

The difftime() function delivers the difference between two time_t values in seconds, represented as a double type. Treating time as an increasing value, regardless of the form of that value, difftime() subtracts the second argument from the first. If the result is greater than zero, the file's modification timestamp must be older than the start_time, and so the file's name is printed.

Conclusion

The older program emits full pathnames as the -print option might do in the find program. Since we have started using it in shell scripts, we have found additional uses for it. I hope you'll find it equally handy.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX and DOS. He teaches C language courses at American River College in Sacramento and is the owner of Rezolution Technical Books. He can be reached via email at: rezbook!reznick@csusac.ecs.csus.edu.