Packing Data Files into Compiled Executables

May 31, 2013 · by rafael · in Programming

Have you ever wanted to distribute a compiled binary that included data files packed into the executable file?

Embedding a Data File Before Compilation

You can do this before compilation by encoding the file into a binary representation, and then compiling that data into statically allocated buffers.

For example you could take a text file that looks something like this:

[STD_STYLE]
<style type="text/css">
   body {
      text-align:center;
      padding:0;
      margin:1px 0 0;
      font: normal 12px Arial, Helvetica, Sans-serif;
      color:#333;
   }

   .... etc ...

      </div>
   </body>
</html>

And use a tool to convert it to c code:

unsigned int std_tpl_len = 7807;
static unsigned char std_tpl_buf[] = {
   0x5b, 0x53, 0x54, 0x44, 0x5f, 0x53, 0x54, 0x59, 
   0x4c, 0x45, 0x5d, 0xa, 0x3c, 0x73, 0x74, 0x79, 

   ... etc ...

   0x62, 0x6f, 0x64, 0x79, 0x3e, 0xa, 0x3c, 0x2f, 
   0x68, 0x74, 0x6d, 0x6c, 0x3e, 0xa, 0xa, 0x0, 
};
const char* std_tpl = (const char*)&std_tpl_buf[0];

Then in your program, instead of calling fopen() to read the file, you just use the statically allocated buffer, in this case std_tpl just as you would a buffer you normally would have read in from a file.

The function to convert a file into c code looks something like this:

void convertToC( char* fileName )
{
   u32 srcLen;
   char* srcBuf = readFromFile( fileName, srcLen );

   // generate a name
   char tmp[1024];
   char* ptmp = &tmp[0];
   char* symp = fileName;
   while ( *symp )
   {
      *ptmp = ( isalnum( *symp ) ? *symp : '_' );
      ptmp++;
      symp++;
   }
   *ptmp = '\0';
   ptmp = &tmp[0];

   // We're just going to dump it to the screen. 
   // But you will want to write this to a file.
   cout << "unsigned int "  << ptmp << "_len = " << srcLen-1 << ";\n"
        << "static unsigned char " << ptmp << "_buf[] = {\n" << KXS_PUSH_TAB;

   // iterate over file buf and generate hex output:
   os << hex;
   for ( u32 i = 0; i < len; i++ )
   {
      if ( i && !( i % 16 )) cout << "\n";
      cout << (u32)data[i] << ", ";
   } 
   cout << dec << "\n};\nconst char* " << ptmp 
        << " = (const char*)&" << ptmp << "_buf[0];\n";
}

Embedding Data After Compilation

But what if you want to embed the data files into your binary after compilation? This is useful for example if you want to embed license files into each binary, or if you have different data for each user. This way you don’t have to send the data and the binary as separate files for each installation. It can be sent as a custom binary to each user.

An easy trick is to just write the data files at the end of the binary. Windows, Linux and OsX binaries will all allow you to do this without affecting the functioning of the binary itself.

The trick is to create a magic value that you write at the end of the executable, plus an offset into the binary where the packed data payload starts. If the value exists, then your program knows that the data from that offset to the end of the binary is packed data.

Packing data into an executable looks like this:

// the magic as a string is: KJPK
#define PACK_MAGIC 0x4B4A414B

// an object that holds the magic value.
class PackEnd
{
public:
   PackEnd( u32 offset ) { mMagic = PACK_MAGIC; mOffset = offset; }
   PackEnd() { ; }
   u32 mMagic;
   u32 mOffset;
};

void packDataFiles( const char* fileName, const char* execName )
{
   const char* sourceFileName = getExecutableName();

   // read in the binary executable - error handling removed.
   u32 execLen;
   const char* execBuf = readFile( execName, execLen );

   // examine the last 8 bytes of the binary.
   PackEnd* end = (PackEnd*)( execBuf + ( execLen - sizeof( PackEnd )));

   // If this is packed data, remove it. We just have to set the end of the 
   // binary back to the true end before the payload.
   if ( end->mMagic == PACK_MAGIC )
      execLen = end->mOffset;

   // KxSerialObj, is a buffer container, that we will be writing data into.
   KxSerialObj serObj;

   // write the executable out.
   serObj.write( execBuf, execLen );

   // read in the payload file.
   u32 fileLen;
   const char* fileBuf = readFile( fileName, fileLen );

   // append it to the binary
   serObj.write( fileBuf, fileLen );

   // write out the pack structure
   serObj << (u32)PACK_MAGIC << execLen;

   // We're using the current name of the executable to write to.
   // but you should probably write to a modified version of the exe
   // name. e.g. myapp_new.exe
   writeFile( execName, serObj.getBuffer(), serObj.getLenth() );
}

Reading it while you are running is just a matter of first getting the file name of the running executable, reading it into memory, and extracting a pointer into the data payload.

To find the executable file name on Windows:

const char* getExecutableName()
{
   static char buf[MAX_PATH] = { '\0' };
   if ( buf[0] == '\0')
     GetModuleFileName( NULL, buf, MAX_PATH );
   return buf;
}

Or you could do it like this:

const char* getExecutableName()
{
   static char buf[MAX_PATH] = { '\0' };
   if ( buf[0] == '\0')
   {
      HANDLE hSnapshot = ::CreateToolhelp32Snapshot( 
         TH32CS_SNAPMODULE, GetCurrentProcessId() );

      MODULEENTRY32 me32        = {0}; 
      me32.dwSize = sizeof(MODULEENTRY32); 

      Module32First( hSnapshot, &me32 );
      strcpy( buf, me32.szExePath;
      CloseHandle(hSnapshot);
   }
   return buf;
}

On POSIX systems, like Linux and OsX, you get the executable name like this:

#include <unistd.h>

#ifdef DARWIN
#include <sys/param.h>
#include <mach-o/dyld.h>
#endif // !DARWIN

const char* getExecutableName()
{
   static char buf[MAX_PATH] = { '\0' };
   if ( !buf[0] )
   {
#ifdef DARWIN
      u32 size = 0;
      _NSGetExecutablePath( 0, &size );
      _NSGetExecutablePath( buf, &size );
#else //!DARWIN
      // on linux, you can get a symlink directly to the binary 
      // through the /proc directory
      s32 len;
      if (( len = readlink( "/proc/self/exe", buf, sizeof( buf ) - 1 )) == -1 )
      buf[len] = '\0';
#endif //!DARWIN
   return buf;
}

So to get a pointer to the packed data, would be something like this:

const char* getPackDataFile( KxSymbol fileName )
{
   const char* execFileName = getExecutableName();

   u32 execLen;
   const char* execFile = readFile( execFileName, execLen );

   JamPackEnd* end = (JamPackEnd*)( execBuf + ( execLen - sizeof( PackEnd )));
   if ( end->mMagic != PACK_MAGIC )
   
   return ( execFile + end->mOffset );   
}

This is a very simple example. For a real application you might want to consider multiple files with file names, etc. An easy way to do this would be to zip all the files you want to pack into a zip file, and then use an unzip library to read files out of the zip file.

Tags: c++, hacks, linux, osx, win32

2 Responses

SteveB · September 25, 2016 at 16:18:28 · →

This is a really interesting post. I am curious on the limitations of using a static buffer vs linking an .obj using ld.exe or similar and retrieving via extern char.
1. rafael · October 7, 2016 at 10:52:40 · →
  
  Interesting idea… but I’m not sure how to do that directly. The data that you link in would have to be packed into an object file – so you’d have to pack the data into an ELF file. I don’t know an easier way to do that than to transform it into text and compile it.

Packing Data Files into Compiled Executables

Embedding a Data File Before Compilation

Embedding Data After Compilation

2 Responses

Leave a Reply Click here to cancel reply.

About Me

Social

Projects