Identifying Image Format from the First Few “Magic” Bytes in C++
All popular image file formats ( jpeg, png, gif, etc. ) can be identified from the first few bytes in the file. This is a good thing, because you cannot always trust file name extensions to be correct, and because images these days are often transferred in other ways – via http or embedded in other documents – where the image data may not have a file name.
A function to identify the most common formats is easy to write. First lets define an enumeration for all the file types we will support:
enum ImageFileType { IMAGE_FILE_JPG, // joint photographic experts group - .jpeg or .jpg IMAGE_FILE_PNG, // portable network graphics IMAGE_FILE_GIF, // graphics interchange format IMAGE_FILE_TIFF, // tagged image file format IMAGE_FILE_BMP, // Microsoft bitmap format IMAGE_FILE_WEBP, // Google WebP format, a type of .riff file IMAGE_FILE_ICO, // Microsoft icon format IMAGE_FILE_INVALID, // unidentified image types. };
And now the image type detection function:
ImageFileType getImageTypeByMagic( const u8* data, u32 len ) { if ( len < 16 ) return IMAGE_FILE_INVALID; // .jpg: FF D8 FF // .png: 89 50 4E 47 0D 0A 1A 0A // .gif: GIF87a // GIF89a // .tiff: 49 49 2A 00 // 4D 4D 00 2A // .bmp: BM // .webp: RIFF ???? WEBP // .ico 00 00 01 00 // 00 00 02 00 ( cursor files ) switch ( data[0] ) { case (u8)'\xFF': return ( !strncmp( (const char*)data, "\xFF\xD8\xFF", 3 )) ? IMAGE_FILE_JPG : IMAGE_FILE_INVALID; case (u8)'\x89': return ( !strncmp( (const char*)data, "\x89\x50\x4E\x47\x0D\x0A\x1A\x0A", 8 )) ? IMAGE_FILE_PNG : IMAGE_FILE_INVALID; case 'G': return ( !strncmp( (const char*)data, "GIF87a", 6 ) || !strncmp( (const char*)data, "GIF89a", 6 ) ) ? IMAGE_FILE_GIF : IMAGE_FILE_INVALID; case 'I': return ( !strncmp( (const char*)data, "\x49\x49\x2A\x00", 4 )) ? IMAGE_FILE_TIFF : IMAGE_FILE_INVALID; case 'M': return ( !strncmp( (const char*)data, "\x4D\x4D\x00\x2A", 4 )) ? IMAGE_FILE_TIFF : IMAGE_FILE_INVALID; case 'B': return (( data[1] == 'M' )) ? IMAGE_FILE_BMP : IMAGE_FILE_INVALID; case 'R': if ( strncmp( (const char*)data, "RIFF", 4 )) return IMAGE_FILE_INVALID; if ( strncmp( (const char*)(data+8), "WEBP", 4 )) return IMAGE_FILE_INVALID; return IMAGE_FILE_WEBP; case '\0': if ( !strncmp( (const char*)data, "\x00\x00\x01\x00", 4 )) return IMAGE_FILE_ICO; if ( !strncmp( (const char*)data, "\x00\x00\x02\x00", 4 )) return IMAGE_FILE_ICO; return IMAGE_FILE_INVALID; default: return IMAGE_FILE_INVALID; } }
JPEG
Like a lot of digital image formats, jpeg consists of a container format (JFIF), and then a codec format (JPEG proper). In theory the Jfif container can hold images encoded with other codecs.
All JFIF containers start with these three bytes:
FF D8 FF
In practice it is enough to detect just those. If you want to be more stringent and detect that the codec is indeed jpeg then you can also detect with the following, where ?? can be any value:
FF D8 FF E0 ?? ?? 4A 46 49 46 00
FF D8 FF E1 ?? ?? 4A 46 49 46 00
All other strings, will be other codecs packed into a JFIF container. Most of these are proprietary codecs for digital cameras.
PNG – Portable Network Graphics
The PNG specification simply lists the following 8 bytes as the file signature:
89 50 4E 47 0D 0A 1A 0A
Bytes 1-3 is the string “PNG”, followed by a CR-Lf sequence and then a control-z character.
GIF – Compuserve Graphics Interchange Format
These files simply start with one of two identifying strings: “GIF87a” or “GIF89a”. Both formats are in common use. The GIF87a was the original format. GIF89a is an improved format that adds animation and transparency.
Tiff – Tag Image File Format
Tiff is one of those formats that consists of a container that can hold one or more images stored using some other encoding. Tiff images can even contain other image formats – like jpeg.
There is a “little endian” and a “big-endian” version of the format with different signatures.
To detect if a file is a tiff container, check the first 4 bytes.
49 49 2A 00 // little endian
4D 4D 00 2A // big endian
BMP
The .bmp format starts with 2 bytes “BM”.
WebP
WebP files are technically RIFF files. RIFF is a container format like TIFF. WebP files are RIFF files that contain a single WEBP chunk.
To detect it, first check that the first 4 bytes are “RIFF”, and then that bytes 8-11 are “WEBP”.
ICO
The ICO format was designed by Microsoft to contain icons and cursors, has two variants, for icon images (.ico), and cursor images (.cur). Except for the header, both formats are identical.
The file signatures are:
00 00 01 00 // .ico format
00 00 02 00 // .cur format
Resources:
File Signatures Table | Gary Kessler’s list of magic bytes at the beginning of many popular file types. |
List of file signatures | Wikipedia’s list of magic bytes, less complete than Kessler’s list |
JPEG | Wikipedia entry about JPEG |
CCITT Recommendation T.81 | The original specification for jpeg. |
JFIF, JPEG File Interchange Format, Version 1.02 | The Library of Congress’ reference on jpeg. |
Portable Network Graphics (PNG) Specification and Extensions | The Libpng website has a section for PNG specifications. |
GIF | Wikipedia article about the GIF format. |
Graphics Interchange Format (GIF) Specification | The original 1987 Gif specification |
Graphics Interchange Format version 89a | The orignal specification for GIF89a |
Tagged Image File Format | Wikipedia article about the TIFF format. |
BMP file format | Wikipedia article about the BMP format. |
WebP – A new image format for the Web | Google’s page about WebP |
WebP | Wikipedia article about the WebP format. |
Multimedia Programming Interface and Data Specifications 1.0 | Microsoft’s original RIFF Specification, the container format for WebP |
ICO (file format) | Wikipedia article about the ICO format. |
compact, direct and usefull..
thanks!
thanks