Recovering backups from differently formatted CD-ROMs

Over the past 20 years, I have built up a collection of backup CD-ROMs. There is quite large variation in their formats due to the different platforms I was using at various times and the development of the standards. I have plain ISO-9660 media, some using Unix Rock Ridge extensions, and others using Windows Joliet extensions. Most of my media are multi-session because I thought it might be useful to be able to add some data later on.

Getting data from these media isn't actually very difficult in itself. As long as there are no read errors, the file contents can be retrieved using any system.

However, the problem starts when one would like to restore correct file names, directory hierarchies and properties such as ownership, read/write/execute permissions or hard links.

Note for the future: even if it's slightly less convenient, it's actually a good idea to put archives on CDs (or backups in general)... much easier to read later on than some silly file system that you have to find the right mount options for.

Image creation

After a day of fiddling, here's a workflow that I settled on:

I use Harald Bögeholz's H2cdimage to read the contents of the CD into an .ISO file. This tool can be run on multiple machines using different drives which increases the chances of recovering data from defective media.

Unfortunately, H2cdimage doesn't create a .CUE file with the track layout information which appears to be needed to correctly mount multi-session CDs. I use IsoBuster which creates beautifully annotated .CUE files that also show the linear block address (LBA) of sessions which will be needed for mounting.

FILE "CD.iso" BINARY

REM ORIGINAL MEDIA-TYPE: CD

  REM SESSION 01        ; Not supported by other applications (*)
    TRACK 01 MODE2/2352
      INDEX 01 00:00:00
      REM MSF: 00:00:00 = LBA: 0

  REM RUN-OUT  18:12:70 ; Not supported by other applications (*)
  REM LEAD-OUT 18:12:72 ; Not supported by other applications (*)
  REM SESSION 02        ; Not supported by other applications (*)
    TRACK 02 MODE2/2352
      INDEX 01 20:44:72
      REM MSF: 20:44:72 = LBA: 93372

REM (*) SESSION directives are unfortunately not properly supported
REM     'out there'.  IsoBuster however supports them !

IsoBuster also provides »managed image files« which are based on the same idea of using different hardware to read as much data as possible from damaged discs, but I have not tried that.

File extraction

In order to extract the files, here are the things I've tried:

7zip

The simplest and most pleasant way is to just use

$ 7z x CD.iso
This works great for plain ISO9660 files and even deals very well with Joliet extensions, but when I tried it on a multi-session Rock Ridge disc it only saw the first session, so I'm not sure whether it would have dealt correctly with the RR_MOVED folder, for example.

IsoBuster

IsoBuster can extract data from any session and even allows specifying the filesystem character set which can be a problem for incorrectly burned media. For example, one of my ISO9660 discs uses the ancient DOS codepage 437, and one of my Rock Ridge discs uses ISO8859-1 filenames. I have no idea whether this is correct or not (since I burned it myself), but the names definitely have to be converted before being stored on a modern filesystem.

However, if there are any hard linked files on the medium, IsoBuster will extract them multiple times and the information which files are identical is lost in the process. Additionally, the resulting folder might be much bigger than the original CD image.

Also, IsoBuster is unable to place files that were moved to the RR_MOVED folder back into the correct place in the hierarchy. As a reminder, here's the deal with RR_MOVED:

If mkisofs is creating a filesystem image with Rock Ridge attributes and the directory nesting level of the source directory tree is too much for ISO-9660, mkisofs will do deep directory relocation. This results in a directory called RR_MOVED in the root directory of the CD. You cannot avoid this directory in the directory tree that is visible with ISO-9660 but it it automatically hidden in the Rock Ridge tree.

Lastly, since it runs on Windows IsoBuster cannot extract some files from Rock Ridge discs. For example, I have one file that contains a ':' colon (created by CVSup back in the day) which Windows does not allow in file names.

mount

The only way I found to correctly restore file hierarchies from Rock Ridge discs is to use Unix. You could even say that's fair enough since those discs were created on Unix in the first place, and contain Unix files.

But there's also an advantage over the 7z or IsoBuster for pure ISO9660 filesystems: filenames are presented in lower-case.

If you need to read later sessions of a multi-session disc it's not sufficient to just mount -o loop CD.iso /mnt/iso it. Apparently, there is information on the disc that allows mount to pick the latest session as advertised, but it's missing from the image file and mount itself does not understand .CUE files and needs a little help. This is where the LBA from IsoBuster comes into play:

# mount -o loop,sbsector=93372 CD.iso /mnt/iso

Now you can use

$ cd /mnt/iso
$ pax -r -w -p p . /.../dest
to copy the files to the new backup medium.

Unfortunately, Linux's mount only supports character set conversions via the iocharset option for Joliet extensions. If you're stuck with a Rock Ridge or ISO9660 disc in a different encoding, you're going to need a script to rename those files manually after copying

#!/bin/sh

# uncomment for debugging
#ECHO=echo

if [ $# -lt 2 ]
then
	cat >&2 <<EOF
usage: $0 from to <file> ...

	Convert file name encodings. Use "iconv -l" for supported encodings.
EOF
	exit 1
fi

from="$1"
shift
to="$1"
shift

for p in "$@"
do
	f="${p##*/}"
	d="${p%"$f"}"
	t=$(echo "$f" | iconv -f "$from" -t "$to")
	if [ "$f" != "$t" ]
	then
		echo "$d$f -> $t"
		$ECHO mv -i "$d$f" "$d$t"
	fi
done

This can be run on the entire tree using, for example

$ find . -depth -exec /.../fconv cp437 utf8 {} \+
The -depth primitive is important in case directories are renamed, after which subsequent commands won't find the remaining files to rename.

As a final step, don't forget to mark the extracted images and files read-only just to prevent stupid mistakes.

$ chmod -R a-w *

Removing duplicates

When working with backups it's common to get some duplication of files. To deal with this, I looked at the fslint suite (which includes a GUI), rmlint which is extremely complex, fdupes which didn't seem very encouraging, duff which seemed promising but was a bit slower than rdfind that I eventually settled on.

My requirement was to find duplicate files in two (or more) given directories and delete duplicates in all but one of them, which was supposed to remain untouched. That is, even if the first directory contained duplicates, I didn't want to remove them. While rmlint supports this out of the box using »tagging,« it wasn't available on my platform and so I wrote a small script to process the rdfind output.

#!/bin/sh

# uncomment for debugging
#ECHO=echo

if [ $# -lt 2 ]
then
	cat >&2 <<EOF
usage: $0 <dir> ...

	Find duplicate files in the named directories. Files from the first
	named directory are never deleted even if they are duplicates.
EOF
	exit 1
fi

rdfind -checksum sha1 -outputname dedup.$$ "$@"
cat dedup.$$ | \
	grep -v '^#' | \
	grep -v '^DUPTYPE_FIRST_OCCURRENCE' | \
	cut -d' ' -f8- | \
	grep -v "^$1/" | \
while read f
do
	$ECHO rm "$f"
done
rm dedup.$$

Comments