Detecting Crashes in Multi-Threaded Programs

Detecting crashes in multi-threaded systems can be difficult. Often the crashes are difficult to reproduce, and when they happen the faulty operation could have happened many calls ago and so the you don’t get a stack trace at the time the crashes happens.

We’ll describe a simple method using tryLock() to find the code that is corrupting a shared data structure.

A typical problem is shared resource leaks. This is when you’ve carefully written a system to protect it with mutexes, but as the protected system becomes complex it is no longer clear if all shared state is properly protected with mutexes. A structure that you thought was well protected is occasionally being corrupted in when threading load is high – because somewhere shared state is not being protected.

( You can read more about good practice for using mutexes in multi-threaded programs )

Use assert() to Detect Corrupted Data

If you program defensively, using assertions, you can often find the data structure that is being corrupted. You should be in the habit of asserting on the value of anything you are sure about. In multi-threading it is important to assert() even on things that for single threading would be obvious. For example asserting shortly after you set a variable that the variable is still set the way you expect. Because if shared state is not protected it is possible that between when you wrote the value, and when you read it back, another thread may have modified it.

If your code is well protected with assertions you should be able to track down the data structure that is being corrupted shortly after the operation that corrupted it.

Once you know the data that gets corrupted you have to find the code that is corrupting it.

Use tryLock() to Detect Errors in the Code

tryLock() is a mutex function that will attempt to lock a mutex, and return false if the mutex is busy – that is if another thread has already locked it. So you just put a new mutex protecting the data that is getting corrupted, but instead of locking that mutex you assert on tryLock().

This of course will not really protect that data. The tryLock() will not actually lock the mutex if it is busy. But you are expecting that this data structure was already being protected by some other mutex. What this will do is assert any time the real mutex fails to protect this data. That is, if despite your efforts with the original mutex, two threads still manage to access the same structure – tryLock() will fail and trigger your assert().

The great thing about this is that you will get an assert during the unsafe operation – not many calls later when the damage is detected.

Lets give an example:

class SharedObject
{
   void doStuff()
   {
      mMutex.lock();
      ... complex operations on mObjects ...
      mMutex.unlock();
   }

   void doMoreStuff()
   {
      mMutex.lock();
      ... other operations on mObjects ...
      mMutex.unlock();
   }

   SharedSubsidiaryObject* getObject( Handle h ) 
   {
      mMutex.lock();
      SharedSubsidiaryObject* obj = mObjects.get( h );
      mMutex.unlock();
      return obj; 
   }

   Container<SharedSubsidiaryObject> mObjects;
   KxMutex mMutex;
};

class SharedSubsidiaryObject
{
   bool isValid(); // returns true if this object's internal state is not corrupt
   void doStuff()
   {
      assert( isValid() );
      ... make changes ...
      assert( isValid() );
   }
};

So you’ve designed this system such that only one thread can access the system at a time. You’re pretty sure the system can’t get corrupted. And you’re pretty sure that calls to get the contained SharedSubsidiaryObject only happen at times when the object being returned is not being modified in another thread.

Yet under threaded operations sometimes one of the assertions in SharedSubsidiaryObject::doStuff() is being triggered. Great. You’ve found the data whose state is not being protected – it is the contents of this object. Now you need to find the faulty code.

So you add a mutex to SharedSubsidiaryObject, but you only call tryLock() on that mutex:

class SharedSubsidiaryObject
{
   bool isValid(); // returns true if this object's internal state is not corrupt
   void doStuff()
   {
      assert( mMutex.tryLock() );
      assert( isValid() );
      ... make changes ...
      assert( isValid() );
      assert( mMutex.unlock() );
   }

   KxMutex mMutex;
};

Say for example that in this case that some function outside the system is holding a pointer to SharedSubsidiaryObject longer than expected and calling doStuff() on it during an unexpected time such that sometimes two threads are using that object. The corruption doesn’t happen often – as will usually be the case.

As soon as that condition happens, even if no actual corruption happens, you will be alerted to the existence of the shared resource leak, and get a stack trace for the exact code that is causing it.

One aspect of this that might be confusing is why we’re asserting on the unlock() call since it should always succeed.

This is because code in an assertion is only run when assertions are enabled – typically
in debug builds. That’s the only time we want to unlock the mutex.

Adding Redundant Mutexes is a Very Bad Idea

You may be wondering why we don’t just put a real mutex in SharedSubsidiaryObject and be done with it. If there are synchronization issues with that object it will probably solve them.

Mutexes are necessary to protect shared state in multi-threaded programs – but they are also the primary source of inefficiency. Any time you have to synchronize shared data structures you by necessity limit the amount of parallelism that is possible.

Adding unnecessary mutexes will usually significantly reduce the amount of parallelism that is possible. In a perfectly optimized multi-threaded program you use just the synchronization methods that are necessary and no more.

This is why it is much better to try to identify the sychronization issues you have and design a specific solution to address them than to defensively add extra mutexes everywhere that you think there might be an issue.

Implementing tryLock() Portably

In an earlier article I went over how to implement mutexes portably on Windows, Linux and OsX. We’ll go over how to add tryLock() to those classes.

Its pretty easy, as you just have to call down to standard functions. On Windows:

bool KxMutex::tryLock_i()
{
   return ( TryEnterCriticalSection( (CRITICAL_SECTION*)this) ? true : false );
}

OsX and Linux both use pthreads, so the implementation is the same:

bool KxMutex::tryLock_i()
{
   return ( pthread_mutex_trylock( (pthread_mutex_t*)this ) ? false : true );
}