| Reading beyond a bad Header with tar
 
Ben Reaves 
Introduction 
Recently a colleague returned to the US carrying an
Exabyte tape containing 
several hundred megabytes of software and data representing
over one 
year's work here. Less than halfway through the tape,
tar found 
an unreadable header and refused to read anything beyond
that. 
The files had already been deleted from the disk, but
I did have a 
copy of his Exabyte tape here. Sure enough, I got the
same error and 
couldn't read past it. I went to work on the problem
and came up with 
the software described in this article. We were able
to read all files 
on that tape. 
The Solution 
The solution consists basically of three steps: 
1. Find the bytes where the bad data appears on 
the tape; 
2. Read the tape, skipping those bytes; 
3. Run regular UNIX tar x on the result. 
The first step is done in a UNIX command like this: 
 
dd ibs=10240 if=/dev/rst0 | tartt > file 
 
Note that the ibs block size should be the same 
as the block size the tape was written with; for tar
the default 
is 10240 (or 20b). The second and third steps are done
in a UNIX pipe 
like this: 
 
dd ibs=10240 if=/dev/rst0 | passPart file | tar xvf - 
 
The passPart takes as input the results of step 
1. It is a simple filter which skips over the bad bytes. 
Step 1 is the heart of the solution: it performs the
function of tar 
t, but in addition to each file name, it lists the start
and end 
byte number of each file's information on the tape.
Also, if it finds 
a bad header, it reports the start and end byte of that
header, searches 
for the next valid header, and continues from there.
Because of its 
similarity to tar t, I call this program tartt. It is
shown in Listing 1. Figure 1 shows an
example of its
output. The passPart.c 
program is presented in Listing 2. I separate the functions
of tartt 
and passPart because tartt can be used by itself to
list the complete contents of a tape that has a bad
segment in it. 
In this article, I first describe tartt (with line numbers
keyed to Listing 1) and its output, then discuss passPart.c,
which is a relatively simple program. In this article,
a "tape" 
refers to one "tar file" and a "file"
refers 
to one individual file that was archived on that tape:
thus tar 
t takes one tape as input, and gives a list of files
as output. 
tartt 
Lines 13-20 of tartt constitute the main program and
show the 
basic flow of this software. It searches for a valid
header and reads 
it, writes one line of information based on that header,
then skips 
over the number of blocks determined by that header.
findHeader() 
calls readHeader() repeatedly until a valid header is
found; 
writeInformation() simply writes a line of information
on stdout, 
and skipOverData() skips over the input data to the
place that 
the next valid header is expected to be. findHeader()
and readHeader() 
do most of the work of this software. 
Lines 26-41 are straight from the online man 5 tar documentation
-- they describe the layout of the data in a tar header,
which is 512 bytes long and appears before each file.
Thus, for example, 
a 1000-byte file takes 512x3 = 512+1000+24 = 1536 bytes
of 
tape: 512 for the tar header, 1000 for the file, and
24 to round up 
to the next multiple of 512. 
Lines 49-103 fill that structure with data from the
header. This section 
of code also does validity checking on the header, looking
for end-of-file, 
zero-length name, nonprintable or blank characters in
the filename, 
too long a filename, or an improperly zero-filled filename
field. 
Lines 105-136 check the validity of the checksum of
the header. This 
module was written more by experience and by looking
at valid tar 
headers than by looking at man 5 tar, where the information
was insufficient for writing this module. For example,
the checksum 
should be read by %7o, not %8o as the online documentation
implies (though does not clearly state).  
If no error is found, nBlocks remains, as it was set
in readHeader(), 
the number of 512-byte blocks that the file takes on
the tape, where 
the next tar header is expected to be found. If an error
is 
found, a line beginning with the word "HEADER"
is printed 
on stdout, and a "continue" statement is executed;
this forces readHeader() to be called again and again
until 
a valid header is found. Only when a valid header is
found does findHeader() 
return. 
Lines 138-143 print a line on stdout consisting of the
byte 
number, starting sequentially from 0 at the start of
the tape, where 
the valid header starts, the byte number where the next
header should 
start, and the name of the file. 
Lines 145-154 simply skip the next nBlocks 512-byte-blocks
of input data, where nBlocks is the number of 512-byte
blocks 
that the file is expected to occupy on the tape, according
to the 
file's header on the tape. 
When tar t is run on a small example with a corrupted
header, 
the output is 
 
1 /h/ben/work/
2 /h/ben/work/fullmeeting.txt
3 /h/ben/work/nc
4 tar: directory checksum error (3370 != 3250) 
 
This means that there was a checksum error on the fourth
file, and there's no way of knowing what was past it.
When tartt 
is run on the same example, the output is as shown in
Figure 1. 
The first three lines show the same information as tar
t does, 
with byte numbers. The fourth line reports the checksum
error. At 
that point, tar t gave up, but tartt doesn't. 
In lines 5 and 6, tartt searches for a valid header,
and finally 
finds one at line 7, byte 39936. Thus, there is a bad
header and possibly 
a file from byte 38400 through 39935 of that tape. 
Lines 7, 8, and 9 show files that tar t completely missed.
In this example, that's only three files, but in my
colleague's case, 
it was thousands of files, hundreds of megabytes: months
of work to 
regenerate. 
Lines 10 and 11 show the two null headers that, according
to the tar 
specification, signify the end of the tape (the tar
archive 
file, to be specific). tartt ignores these, just in
case there 
might be some valid data past the null headers. It reads
until it 
can read no more: at the EOF marker on the tape, which
stops the reading 
at the device driver level. 
passPart.c 
The information generated by tartt and shown in Figure
1 suggests 
that if it were possible to skip over bytes 38400 through
39935, it 
should also be possible to run tar x to extract all
files from 
the tape with no problem. That is precisely what passPart.c,
in Listing 2, does: it looks at the output of tartt,
decides 
which parts of the corrupted tape to block and which
to pass, and 
passes them. 
passPart.c uses two input streams, one for reading the
tartt 
output and stdin for reading the corrupted tape; and
two output 
streams, one for tar x to read and stderr for debug
output. 
Lines 3-15 show the simple main program, which calls
two subroutines: 
one to read the file specified in argv[1] and make the
list 
of bytes to skip, and one to pass the appropriate bytes
from stdin 
to stdout. Lines 10-13 are just for debugging output
-- 
note that it must be directed to stderr, not stdout,
to be sure that the list is made properly (this simple
version does 
no error checking).  
The makeSkipList() module is written in readable, though
perhaps 
slow, style because the output of tartt is usually not
too 
long: a few thousand lines at most. Basically, it just
looks at the 
first letter of each output line to determine whether
it describes 
a good header or a bad header. From this information
it creates a 
list of integers corresponding to the byte numbers to
skip: start 
contains the first byte to skip, and end contains the
first 
byte to not skip. 
The passBytes() module is written to be fast, because
the amount 
of data it must process is typically hundreds of megabytes.
Its function 
is straightforward: it passes the stdin stream to stdout,
or blocks it, depending on the byte number from the
list of start 
and end points. 
Conclusion 
The software described here is a relatively simple set
of tools to 
recover all files from a tar-format tape that contains
bad 
headers. It does not, of course, catch all types of
header errors 
-- for example, if a few bytes, not a multiple of 512,
have been 
mistakenly inserted or deleted from the tape, these
tools cannot recover 
it.  
However, the code could be rewritten to do just that,
by having it 
read and verify a header based on a moving window of
length 512 bytes, 
shifting one byte for each iteration. This would be
extremely slow 
and, in my experience, this is usually unnecessary:
most errors are 
due to substitution, not deletion or insertion. A moving
window has 
its own problems: if the tape contains a file which
is itself a tar-format 
archive, the "moving window" algorithm will
become confused. 
And if it does hundreds of millions of comparisons,
the chance of 
a nonsensical header mistakenly passing the readHeader()
and 
findHeader() tests increases. 
Now what about the files that were skipped because of
their bad headers? 
I will leave that as an exercise for the reader.  
 
 About the Author
 
Ben Reaves received a BSEE degree from the University
of Southern 
California in 1981 and an MSEE from Stanford University
in 1983. He 
was a Research Engineer and System Administrator for
Speech Technology 
Laboratory in Santa Barbara, California from 1985 to
1987 and now 
works on location at Matsushita Electrical Industrial
Company's Central 
Research Laboratory near Osaka, Japan. 
 
 
 |