>

The Hidden Bugs of Distributed Systems

_
November 14, 20247 min read

Exciting! Your team has a new project that includes user event processing, webhooks, time series analysis, a statistics dashboard, AI model training, and many more buzzwords that make this project the coolest in the world.

One of the talented engineers designed it, and they’ve done an amazing job! The diagram looks great, and you understand exactly what goes where.

You start coding like a ninja! 🥷 Every PR gets approved, and there is 100% code coverage, including E2E and load testing. You configure metrics, monitors, and tracing.

You go to production, and everything is great; some early-bird customers start using the feature and love it! 🚀

Then, the first whale comes 🐳. The system starts to slow down, and you see some errors in the logs. You dive into the code, and you see that the system is not handling the load as expected. On-call starts feeling like a stressful duty, and no one understands what's happening. You reread the code, check the CPU and memory, and review other metrics. You test it, and everything works like a charm.

Deadlocks

You dive deeper into the logs and see a DEADLOCK message from the DB. The stack trace leads you to:

await db.transaction(async (txn) => {
  await Promise.all([
    itemModel.update(itemData, txn),
    itemRefModel.update(refData, txn),
  ]);
});

It causes a deadlock if two events arrive for the same item, one acquires the lock on the item table, and the other acquires the lock on the itemRef table. The first event waits for the lock held by the second, and vice versa.

Less common race condition that becomes more frequent when you handle thousands of concurrent events per second. I have another post about deadlocks, you can read it here.

The solution is simple, run the updates in the same order every time:

await db.transaction(async (txn) => {
  await itemModel.update(itemData, txn),
  await itemRefModel.update(refData, txn),
})

Idempotent operations

A few minutes after you move the ticket to "Done", you see the log:

"Can't delete item because it was not found"

It is because the "deleted" event arrived twice, you should make the operation idempotent by simply ignoring the event if the item does not exist:

function handler(event) {
  if (event.action === 'deleted') {
    try {
      await itemModel.delete(event.item.id)
    } catch (e) {
      if (e.name === 'ItemNotFound') {
        logger.warn(`Item ${event.item.id} not found, ignoring the deletion event`)
      } else {
        throw e
      }
    }
  }
}
NOTE
You should not silently ignore errors in APIs. This is an example of asynchronous event processing, where the event is already processed and you can't return an error to the user.

Unordered events

You fix it, and you think that you solved the issue. But the next day, another issue occurs, the customer says that an entry that should not exist appears in the dashboard. And you see the same log message as before:

"Can't delete item because it was not found"

After some investigation, you find out that the "deleted" event arrived before the "created" event, but the "created" event initiated before the "deleted" event.

There are multiple ways to solve this issue, one of which is to store the deletion event. When handling the "created" event, check if it is already been created. This also ensure idempotency for the "created" event:

function handler(event) {
  if (event.action === 'deleted') {
    try {
      await itemModel.update({ deleted: true }, event.item.id)
    } catch (e) {
      if (e.name === 'ItemNotFound') {
        logger.warn(`Item ${event.item.id} not found, assuming that the creation event will arrive soon`)
        await itemModel.create({ id: event.item.id, deleted: true })
      } else {
        throw item.id
      }
    }
  } else if (event.action === 'created') {
    try {
      await itemModel.create(event.item)
    } catch (e) {
      if (e.name === 'ItemAlreadyExists') {
        logger.warn(`Item ${event.item.id} already exists, ignoring the creation event`)
      } else {
        throw e
      }
    }
  }
}

Dirty reads

It is not over yet... This time, the customer says that events arrive in a one-hour delay, which is unexpected. It is a wrong use of cache:

function collect() {
  const cacheKey = 'collectionMetadata'
  let collectionMetadata = await cache.get(cacheKey)
  if (collectionMetAadata == null) {
    collectionMetadata = await collectionMetadataModel.get()
    await cache.set(cacheKey, collectionMetadata, { ttl: Date.now() + 1 * HOUR })
  }
  const response = await fetch(collectionMetadata.cursor)
  const data = await response.json()
  for (const event of data.events) {
    await handler(event)
  }
  await collectionMetadataModel.updateCursor(data.cursor)
}

Do you see the problem? The collectionMetadata is fetched from the cache, but the cursor is updated in the DB, and the cache is stale. You should update the cache after the cursor is updated.

Thundering herd

The customer is finally happy, but not for long, once every hour, the system is down for a few seconds. You check the logs, and you see that the cache expires every hour, which causes many requests to the API at the same time:

function handler(event) {
  const cacheKey = 'cachedItem:' + event.item.id
  let cachedItem = await cache.get(cacheKey)
  if (cachedItem == null) {
    // cachedItem = Fetch item from the API
    await cache.set(cacheKey, cachedItem, { ttl: Date.now() + 1 * HOUR })
  }
  // ...
}

This item has many events, and many items usually arrive together. Once every hour, a lot of events are read from the API. You can add a random value, so the TTL will be different for each event:

function handler(event) {
  const cacheKey = 'cachedItem:' + event.item.id
  let cachedItem = await cache.get(cacheKey)
  if (cachedItem == null || Date.now() > cachedItem.ttl - Math.random() * 10 * MINUTE) {
    // cachedItem = Fetch item from the API
    cachedItem.ttl = Date.now() + 1 * HOUR
    await cache.set(cacheKey, cachedItem, { ttl: cachedItem.ttl })
  }
  // ...
}

Starvation

That is satisfying! After all these bug fixes, the system finally operates smoothly, and the performance looks great! But now, some small customers complain that their events are processed too slowly. You see that most of the events come from a few big customers, and the small customers are waiting in the queue.

There are many ways to solve this issue:

  1. Rate limiting - Limit the number of events per customer.
  2. Priority queue - Separate queues based on the customer size.
  3. Multi-tenancy - Have dedicated resources for each customer. (or at least for the big ones)
  4. Auto-scaling - Scale the system to handle events faster, giving more time to small customers.

Conclusion

These kinds of issues occur because of a gap. The high-level design is not detailed enough to solve such edge cases, and reading code line by line is too deterministic. Adding another event and thinking about what happens in every line of code while the other event can be in every other line is too complicated, but it is a muscle that every developer can train, and usually, once you start thinking this way, you can’t stop.

Over the years, I found a pattern in how I think when working in such systems:

  1. Assume events arrive in any order - Events can arrive in any order, and you should handle them correctly. Go for solutions that do not rely on timing or timestamps, like using sequentially ordered IDs.
  2. Prefer idempotent operations - If you can make your operation idempotent, you can retry it without worrying about the state of the system.
  3. Dirty reads - Assume that the entity state has changed since you fetched it. A few ms after you fetch, another event causes an update, which makes your local memory stale.
  4. Thundering herd - Using TTL somewhere (cache, lock, rate limiting)? Use a random value to prevent all requests from expiring together.
  5. Be fair - Limited resources and naive algorithms can cause small users to get too little processing time. Use rate limiting, priority queues, multi-tenancy, or other solutions.

Of course, there are so many more things to consider (integrity, security, infrastructure, networking, cost, and more), but these are the most common issues I see in code.

In conclusion, building distributed systems is not easy but not impossible. Consider edge cases and how your code can fail. Understand the underlying system and how it interacts with your code. Monitor your system and understand what’s happening. Once you have the knowledge and the skill of thinking about your code in a distributed way, you can prevent such bugs and build a system that can handle whale-size customers 🐳.