In computer forensics, it is not uncommon to come across a corrupt file. If you can repair the corruption, then you can access the data. Unfortunately, some file formats are more difficult to repair than others. For example, a plain text file may simply require identifying a splice around missing data. In contrast, graphic file formats (and audio and video, for that matter) can be extremely complicated.
One of the challenges from this year's
DC3 Forensic Challenge concerned corrupt header reconstruction. The
DC3 provided GIF, JPEG, and BMP files that each had some form of header corruption. The goal was to rebuild the header and create a valid image. Earlier this month, I was told by the DC3 that I had successfully recovered each file. In fact, they compared my recovered files with the original (pre-corruption) files and they turned out to be identical. My secret was not to find "a" header that worked. Rather, I recreated the headers based on the remainder of the file's data.
Back in 2005, I created a tool for analyzing image format structures. The program, imgana, dissected GIF and JPEGs. Keep in mind, this program does not view pictures. Instead, it identifies the file structure, meta information, and abnormalities.
Generalized GIF Structure
The
GIF format is relatively simple. The big parts are the header, global and optional local color tables (GCT and LCT), and the compressed data segments. The catch with a GIF is that a single corrupt byte can (and likely will) damage the everything after it.
Knowing this, if you have a corrupt GIF header, then you just need to replace the header structure. When you see the GCT, the header must define the GCT size. The most difficult part is identifying the image size. However, if you can decode the encrypted data, then you can identify likely sizes. For example, if it decodes 100 pixels, then the resolution is either 1x100, 2x50, 4x25, 5x20, 10x10, 20x5, 25x4, 50x2, or 100x1.
With GIF, every segment is a different size. The header consists of 13 bytes. The header defines the size of the GCT, which comes next. Finally, there are the data blocks. Each block has a type, size, and actual data. For example, a type 0x21 subtype 0xfe is a text comment, and Netscape contents use subtype 0xff. Following the all of this is a graphics control block, image block, and image data. (And multiple blocks for animated GIFs.)
Generalized SWF Format
GIF requires you to decode all of the different fields in order to check the structure. In contrast, the
SWF format (used by Flash applications) is much simpler. It consists of a series of tag-length-data blocks. Even if you don't know what the tag means, you can still read the length and know how many data bytes are associated with the tag. Parsing an SWF and checking the structure is relatively simple.
With SWF, some of the data sets are further associated with subtag-sublength-subdata sets. The total size of the subsets should be equal to the full set size. (Macromedia wasn't checking this boundary condition and that led to some exploits back in
December 2000.) However, even if you don't recognize the tag and subsets, you can still know how many bytes are in the outer tag.
Generalize JPEG Format
Sadly,
JPEG really looks like a format designed by committee (
because it was). It is one of the most idiotic file formats I have ever come across.
What brings about this rant: Recently I have been rebuilding imgana to look for more structural information buried in JPEGs. There is a public program called
Exiftool that extracts meta information from GIF, JPEG, and other formats. If you just need to view the meta information, then this is absolutely the best program I have come across. However, it doesn't do everything I need, so I originally built imgana and now I am revising my tool.
Here is the quick list of things I dislike about JPEGs:
- Endian. Most file formats either use big endian or little endian. In big endian, the highest order byte comes first; little endian starts with the smallest-order byte. For JPEGs, they couldn't make up their mind. The image may specify big or little endian -- so the JPEG program must support both.
But even more bothersome: JPEG does not call it "big or little endian". They call it "MM" for Motorola (because Motorola processors use big endian), or "II" for Intel (because they usually use little endian). Why don't they also have "AA" for AMD or "HP" for Hewlett-Packard? Choosing the endian based on a company shows nothing more than a political influence.
Note to the JPEG committee: Next time you revise the format, just choose an endian. Either is fine, but please make a decision.
- Inconsistent blocking. JPEG tries to use a tag-length-data format. However, they don't stick with it. Most of the meta data uses tag-length-data, but then they forget it for the image data stream (which flows more like the GIF data without the blocking).
- Inconsistent subtypes. Within the JPEG file are optional application data areas. Some use big endian, while others allow selecting the endian. Yes: Even if the meta data uses Intel-endian, other application data may use Motorola-endian.
Similarly inconsistent, some meta data uses a tag-length-data structure, while others use fixed or dynamic data divisions. And even the different application data areas are not self-contained. For example, the Flashpix multi-part sections (denoted by "FPXR") work together and not independent... even though they are listed in independent application sections.
- Overly complicated additions. With regards to Flashpix, Phil Harvey -- the author of Exiftool -- said it best:
The FlashPix file format, introduced in 1996, was developed by Kodak, Hewlett-Packard and Microsoft. Internally the FPX file structure mimics that of an old DOS disk with fixed-sized "sectors" (usually 512 bytes) and a "file allocation table" (FAT). No wonder the format never became popular.
There is no standardized format for application data. Every application type is totally different, and most are overly complex.
- Wasted data space. Alright, so meta data like "camera model" and "was a flash used?" is stored in tag-length-data blocks. But JPEG isn't even consistent with the data! If the data size is less than 8 bytes, then the next 8 bytes contain the actual data. Otherwise the next 8 bytes contain an offset to the data.
Some meta data consists of 1-2 bytes of information. This means, there are 2-3 bytes of wasted space. And that offset value? It is from the start of the application data marker. This is fine if you are memory mapping the data, but horrible if you are streaming the information. I still don't understand why JPEG thought this was better than just using tag-length-data and actually having the data in the data area...
The confusing use of offsets is bound to cause problems. For example, the Canon EOS-1D Mark II digital camera is supposed to have a 'make' (tag 0x010f) of "Canon" and a 'model' (0x0110) of "Canon EOS-1D Mark II". However, the camera specifies the wrong offset. The make is "Canon " (the first part of the model's string and not null-terminated), model is "EOS-1D Mark II" (with 6 bytes of crap after it), and there is an unused string "Canon" in the data area. (Here is an example camera image. The offset data contains "Canon\0Canon EOS-1D Mark II\0". The first null-terminated "Canon" string is unused; the make and model are both wrong by six bytes.)
Again, "Dear JPEG Committee", next time either always use offsets or never use offsets. Don't mix and match. "Consistent" is better than "better".
Oh, and those data lengths? They are from the start of the tag (the thing you already read), so be sure to subtract 2 bytes to get the real data length. Considering the amount of unused bytes in a JPEG, this is an ideal format for hiding steganographic information -- without disturbing the image itself.
- Multiple images. Finally, a JPEG file may contain multiple JPEG images. One is the big file that you see, another is an optional thumbnail, and then different application data sections may include their own thumbnail or preview images. Moreover, each of those sub-images could include other sub-images. Want to really hide data? Place an image in a JPEG in a JPEG in a JPEG. I don't know any tool (besides my own) that can easily extract this. (And yes, I have already come across a few examples.)
Of course, I could understand it if every image was at a different size. However, I have observed that most preview and thumbnail images are the same size -- the exact same data. Since there is no data reuse between application sections, a single JPEG may (and usually does) contain multiple copies of the same image.
The JPEG file format uses inconsistent internal structures, confusing offsets, and lacks basic decisions such as endianness. There is also a significant amount of wasted space -- either from unused bytes or information redundancy.
Keep in mind, none of these issues are associated with the actual compression and image storage algorithms. GIF's use of dynamic tables is very creative and efficient, and JPEG's frequency-based compression is just cool. However, in hindsight the algorithm information could have been stored in less complex and more efficient formats.
Having said all of that, I cannot wait to look at PNGs...
(if memory serves, PNG is relatively simple)