Comparing JPEGs

6:41pm, 11th May 2008

Sometimes hard disks break - this is why we have backups. Sometimes parts of hard disks become silently corrupted and you don’t find out until months or years later when you come to view an old photo and find the top 20% intact followed by a field of grey - this is why we have incremental backups.

To compare today’s version of a file to last year’s version you might do something like this:

$ sha256sum newcopy.jpg oldcopy.jpg
ce5d948cf9bfe9d3709cef57016b4e1c6db5391b9efca4c1ef958fe557d81e1b  newcopy.jpg
56bb613661ebb20ab212f748caff2023f80f9854ada17c7001af4a1f9d30b6aa  oldcopy.jpg

The different checksums indicate differences in the file that a comparison of file size might not show up. But checking sums of JPEGs is complicated by software which modifies the EXIF header. Digikam, for example, will change the image header whenever you correct the timestamp, do a rotation, add a comment or tag, or basically do any metadata operation. The image itself might never be altered, but the JPEG file can go through many iterations. If your incremental backups are perfect, you should still be able to find a backed-up copy of what today’s file should be, but unless you’re using some kind of wicked insane ZFS system, your last backup will have a non-zero age.

In my particular case, I suspected a photo had become corrupted, but couldn’t prove it, as past versions from my rdiff-backup snapshots had a different checksum at different points in time, reflecting changes that had been made to the metadata by successive versions of Digikam and occasional tagging operations. Because the file had been changed, I couldn’t use checksums to prove the difference was due to corruption.

jhead to the rescue!

jhead is a JPEG metadata utility (and an Ubuntu package of the same name). Use it like this to strip all metadata out of a JPEG file, leaving only the pure image:

$ jhead -purejpg newcopy.jpg oldcopy.jpg
Modified: newcopy.jpg
Modified: oldcopy.jpg
$ sha256sum newcopy.jpg oldcopy.jpg
42f61a8e8d78153a41cc9c7fcb20055ed3e403c88366395a577bea64166218a9 newcopy.jpg
42f61a8e8d78153a41cc9c7fcb20055ed3e403c88366395a577bea64166218a9 oldcopy.jpg

As it turned out, the image was not corrupted at all! Yay!


Leave a comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Please enter the following words to prove your humanity: