Recovering backups from differently formatted CD-ROMs

Over the past 20 years, I have built up a collection of backup CD-ROMs. There is quite large variation in their formats due to the different platforms I was using at various times and the development of the standards. I have plain ISO-9660 media, some using Unix Rock Ridge extensions, and others using Windows Joliet extensions. Most of my media are multi-session because I thought it might be useful to be able to add some data later on.

Getting data from these media isn't actually very difficult in itself. As long as there are no read errors, the file contents can be retrieved using any system.

However, the problem starts when one would like to restore correct file names, directory hierarchies and properties such as ownership, read/write/execute permissions or hard links.

Note for the future: even if it's slightly less convenient, it's actually a good idea to put archives on CDs (or backups in general)... much easier to read later on than some silly file system that you have to find the right mount options for.

Image creation

After a day of fiddling, here's a workflow that I settled on:

I use Harald Bögeholz's H2cdimage to read the contents of the CD into an .ISO file. This tool can be run on multiple machines using different drives which increases the chances of recovering data from defective media.

Unfortunately, H2cdimage doesn't create a .CUE file with the track layout information which appears to be needed to correctly mount multi-session CDs. I use IsoBuster which creates beautifully annotated .CUE files that also show the linear block address (LBA) of sessions which will be needed for mounting.

FILE "CD.iso" BINARY

REM ORIGINAL MEDIA-TYPE: CD

  REM SESSION 01        ; Not supported by other applications (*)
    TRACK 01 MODE2/2352
      INDEX 01 00:00:00
      REM MSF: 00:00:00 = LBA: 0

  REM RUN-OUT  18:12:70 ; Not supported by other applications (*)
  REM LEAD-OUT 18:12:72 ; Not supported by other applications (*)
  REM SESSION 02        ; Not supported by other applications (*)
    TRACK 02 MODE2/2352
      INDEX 01 20:44:72
      REM MSF: 20:44:72 = LBA: 93372

REM (*) SESSION directives are unfortunately not properly supported
REM     'out there'.  IsoBuster however supports them !

IsoBuster also provides »managed image files« which are based on the same idea of using different hardware to read as much data as possible from damaged discs, but I have not tried that.

File extraction

In order to extract the files, here are the things I've tried:

7zip

The simplest and most pleasant way is to just use

$ 7z x CD.iso
This works great for plain ISO9660 files and even deals very well with Joliet extensions, but when I tried it on a multi-session Rock Ridge disc it only saw the first session, so I'm not sure whether it would have dealt correctly with the RR_MOVED folder, for example.

IsoBuster

IsoBuster can extract data from any session and even allows specifying the filesystem character set which can be a problem for incorrectly burned media. For example, one of my ISO9660 discs uses the ancient DOS codepage 437, and one of my Rock Ridge discs uses ISO8859-1 filenames. I have no idea whether this is correct or not (since I burned it myself), but the names definitely have to be converted before being stored on a modern filesystem.

However, if there are any hard linked files on the medium, IsoBuster will extract them multiple times and the information which files are identical is lost in the process. Additionally, the resulting folder might be much bigger than the original CD image.

Also, IsoBuster is unable to place files that were moved to the RR_MOVED folder back into the correct place in the hierarchy. As a reminder, here's the deal with RR_MOVED:

If mkisofs is creating a filesystem image with Rock Ridge attributes and the directory nesting level of the source directory tree is too much for ISO-9660, mkisofs will do deep directory relocation. This results in a directory called RR_MOVED in the root directory of the CD. You cannot avoid this directory in the directory tree that is visible with ISO-9660 but it it automatically hidden in the Rock Ridge tree.

Lastly, since it runs on Windows IsoBuster cannot extract some files from Rock Ridge discs. For example, I have one file that contains a ':' colon (created by CVSup back in the day) which Windows does not allow in file names.

mount

The only way I found to correctly restore file hierarchies from Rock Ridge discs is to use Unix. You could even say that's fair enough since those discs were created on Unix in the first place, and contain Unix files.

But there's also an advantage over the 7z or IsoBuster for pure ISO9660 filesystems: filenames are presented in lower-case.

If you need to read later sessions of a multi-session disc it's not sufficient to just mount -o loop CD.iso /mnt/iso it. Apparently, there is information on the disc that allows mount to pick the latest session as advertised, but it's missing from the image file and mount itself does not understand .CUE files and needs a little help. This is where the LBA from IsoBuster comes into play:

# mount -o loop,sbsector=93372 CD.iso /mnt/iso

Now you can use

$ cd /mnt/iso
$ pax -r -w -p p . /.../dest
to copy the files to the new backup medium.

Unfortunately, Linux's mount only supports character set conversions via the iocharset option for Joliet extensions. If you're stuck with a Rock Ridge or ISO9660 disc in a different encoding, you're going to need a script to rename those files manually after copying

#!/bin/sh

# uncomment for debugging
#ECHO=echo

if [ $# -lt 2 ]
then
	cat >&2 <<EOF
usage: $0 from to <file> ...

	Convert file name encodings. Use "iconv -l" for supported encodings.
EOF
	exit 1
fi

from="$1"
shift
to="$1"
shift

for p in "$@"
do
	f="${p##*/}"
	d="${p%"$f"}"
	t=$(echo "$f" | iconv -f "$from" -t "$to")
	if [ "$f" != "$t" ]
	then
		echo "$d$f -> $t"
		$ECHO mv -i "$d$f" "$d$t"
	fi
done

This can be run on the entire tree using, for example

$ find . -depth -exec /.../fconv cp437 utf8 {} \+
The -depth primitive is important in case directories are renamed, after which subsequent commands won't find the remaining files to rename.

As a final step, don't forget to mark the extracted images and files read-only just to prevent stupid mistakes.

$ chmod -R a-w *

Removing duplicates

When working with backups it's common to get some duplication of files. To deal with this, I looked at the fslint suite (which includes a GUI), rmlint which is extremely complex, fdupes which didn't seem very encouraging, duff which seemed promising but was a bit slower than rdfind that I eventually settled on.

My requirement was to find duplicate files in two (or more) given directories and delete duplicates in all but one of them, which was supposed to remain untouched. That is, even if the first directory contained duplicates, I didn't want to remove them. While rmlint supports this out of the box using »tagging,« it wasn't available on my platform and so I wrote a small script to process the rdfind output.

#!/bin/sh

# uncomment for debugging
#ECHO=echo

if [ $# -lt 2 ]
then
	cat >&2 <<EOF
usage: $0 <dir> ...

	Find duplicate files in the named directories. Files from the first
	named directory are never deleted even if they are duplicates.
EOF
	exit 1
fi

rdfind -checksum sha1 -outputname dedup.$$ "$@"
cat dedup.$$ | \
	grep -v '^#' | \
	grep -v '^DUPTYPE_FIRST_OCCURRENCE' | \
	cut -d' ' -f8- | \
	grep -v "^$1/" | \
while read f
do
	$ECHO rm "$f"
done
rm dedup.$$

C++14 best practices

I just watched Herb Sutter's CppCon 14 address and felt a need to make a note to myself of a couple of things:

  • Call functions with raw references or pointers to entities that live at least as long as your scope.
  • Don't overuse pass-by-value even for parameters you're going to copy unconditionally — a combination of copy and move assignment perform better in important cases because an unconditionally copied parameter cannot take advantage of existing capacity at the final destination. Prefer
        void set_name(std::string const& name);
        void set_name(std::string&& name);
    
    to this
        void set_name(std::string name);
    
  • Assuming you have a class hierarchy C0 <| C1 <| … and corresponding factory template functions. If you'd like to keep an arbitrary instance around for the duration of a scope while erasing its type, you might think this requires heap allocation. But it's not true: a reference to the base type will do just fine!
    {
        C0 const& c0 = cFactory(…);
        /* … */
    }
    The result of the factory is stored in a temporary and its lifetime is determined by the lifetime of the reference. The idea came from Andrei Alexandrescu's ScopeGuard.

Mercurial EolExtension is broken beyond repair...

I have an admittedly complex setup with EOL conversion because development is on a Windows host where some programs require CRLFs.

But even so, I've never had any trouble with git in similar circumstances, and this happens multiple times per week: at some point during MQ patch refreshment, EolExtension gets confused and reports »abort: inconsistent newline style in ...«. Subsequently, hg stat takes forever (once) and hg diff shows all files as added with Unix line endings, even though the actual physical files have Windows line endings just like before.

While trying to get all the files to show up as unmodified again, the following happens.

And the real fun part: because the files are actually still using the correct line endings on disk, hg revert is completely ineffective! How is that even possible?!

Because it's a really large repo, I don't really fancy going via hg up -C null since that takes forever. But in the end it was the only way to remedy the »situation«.

And don't even get me started on the observed fact that hg stat does caching – why else would it take so long only the first time after EolExtension screws up? I'll never use Mercurial on any project if I get a choice, and that's especially true on Windows where according to some it works so much better than git.

Doing some more investigating, it turns out that the problematic files show up with missing data in hg debugstate:

n   0         -1 unset               browser/components/places/tests/unit/test_txnGUIDs.js
n   0         -1 unset               browser/components/search/test/browser_415700.js
n   0         -1 unset               browser/components/sessionstore/test/browser/browser_346337_sample.html

This suggests that running hg debugrebuildstate -r tip might help to fix the problem. After this, hg status takes a long time again, but without setting the correct file size.

n 644         -1 1970-01-01 01:00:00 browser/components/places/tests/unit/test_txnGUIDs.js
n 644         -1 1970-01-01 01:00:00 browser/components/search/test/browser_415700.js
n 644         -1 1970-01-01 01:00:00 browser/components/sessionstore/test/browser/browser_346337_sample.html

After going through the lengthy hg co -C null and hg update -C tip exercise, the state of the above files looks like this:

n 644       4030 2016-07-16 02:48:59 browser/components/places/tests/unit/test_txnGUIDs.js
n 644       3729 2016-07-16 02:49:00 browser/components/search/test/browser_415700.js
n 644       1049 2016-07-16 02:49:00 browser/components/sessionstore/test/browser/browser_346337_sample.html

Reloading services and kexts using AppleScript

This weekend, I had to try to convince a slightly reluctant (no-name) bluetooth mouse to work on an Apple laptop. Especially after a sleep-wakeup cycle it tends to stop reacting and my theory is that it confuses OS X's bluetooth driver to the point where it stops functioning.

As a small test, I decided to try to unload both the bluetooth daemon blued and the IOBluetoothHIDDriver when the problem occurs and see if that fixes it. For this purpose, I came up with the following one-click experiment, my very first AppleScript.

set bluedController to serviceController for "com.apple.blued" given defaults:"/Library/Preferences/com.apple.Bluetooth", preference:"ControllerPowerState"

set driverController to kextController for "com.apple.driver.IOBluetoothHIDDriver"


restart({bluedController, driverController})


-- Shutdown a list of services, waiting for each to stop, and restart them in reverse order

on restart(controllers)

repeat with controller in controllers

repeat

tell controller

shutdown()

if not isRunning() then exit repeat

end tell

display dialog "Waiting for " & controller's name & " to stop" giving up after 1

end repeat

end repeat

repeat with controller in reverse of controllers

repeat

tell controller

start()

if isRunning() then exit repeat

end tell

display dialog "Waiting for " & controller's name & " to start" giving up after 1

end repeat

end repeat

end restart


-- Return a script object that controls a launch service using its associated preference.

on serviceController for serviceLabel given defaults:prefsFile, preference:prefName

script controller

property name : "service " & last item of (wordList of serviceLabel at ".")

on launchCtl(command)

-- launchctl only knows about blued if run as admin

do shell script "launchctl " & ¬

quoted form of command & " " & ¬

quoted form of serviceLabel ¬

with administrator privileges

end launchCtl

on setPref(value)

-- turn off the service's preference

do shell script "defaults write " & ¬

quoted form of prefsFile & " " & ¬

quoted form of prefName & " " & ¬

quoted form of (value as text) ¬

with administrator privileges

end setPref

on isRunning()

return launchCtl("list") contains "\"PID\" = "

end isRunning

on shutdown()

setPref(0)

launchCtl("stop")

end shutdown

on start()

setPref(1)

launchCtl("start")

end start

end script

return controller

end serviceController


-- Return a script object that loads and unloads a kernel extension.

on kextController for bundleIdentifier

script controller

property name : "kernel extension " & last item of (wordList of bundleIdentifier at ".")

on isRunning()

do shell script "kextstat -l -b " & ¬

quoted form of bundleIdentifier

return the result contains bundleIdentifier

end isRunning

on shutdown()

try

do shell script "kextunload -b " & ¬

quoted form of bundleIdentifier ¬

with administrator privileges

end try

end shutdown

on start()

do shell script "kextload -b " & ¬

quoted form of bundleIdentifier ¬

with administrator privileges

end start

end script

return controller

end kextController


on wordList of theWords at delimiters

set oldDelimiters to text item delimiters

set text item delimiters to delimiters

try

set res to text items of theWords

set text item delimiters to oldDelimiters

return res

on error m number n

set text item delimiters to oldDelimiters

error m number n

end try

end wordList