Return

Your Worst-case Serverless Scenario Part III: The Invisible Process

Niels van Bree

In this third and last installment of “Your Worst-case Serverless Scenario” we will talk about a nasty ‘invisible process’ of DynamoDB and briefly discuss partition designs and table indexes. We will close this case by summarizing our lessons learned. If you haven’t read the first two parts of this series, we highly recommend checking them out, since this article builds further upon the previous articles.

The Invisible Process

Last but not least, we have the issue of the origin table still experiencing problems after everything had been cleared up. More specifically, if you tried to write more items to the table even a day later, you would still end up with timeout errors. However, reading from the table worked just fine. The strange thing was; you couldn’t see any activity on the metrics indicating that write requests were being throttled at the moment or that any process was running at all. Also, the amount of items present in the table didn’t increase anymore and auto scaling had brought down read and write capacities to its normal capacities. As we knew from another case already, this had to do with updating the index. Once you write an item to a table, this item will also be written to all indexes of the table. For this to happen, a read on the base table and a write on the target indexes is needed.

Our auto scaling options were set on both read and write capacity separately with the same settings applied to all Global Secondary Indexes (we had just one index). That meant that during the massive amount of writes on the base table, the write capacities of the base table and target index skyrocketed, but the read capacity remained the same. In other words: the items could initially be written to the base table, but then the index noticed that it couldn’t keep up, because even though the write capacity of the index had scaled up, the read capacity of the base table was too low. To make things even worse, the upscaled write capacity turned out to be insufficient as well, because the partition key of the index was very badly chosen. In this particular case, all the written items on the base table had different partition key values on the base table, but the same partition key value on the index, which in our case were 56 million items! According to the official AWS documentation on designing partition keys to distribute your work evenly:

“The partition key portion of a table’s primary key determines the logical partitions in which a table’s data is stored. This in turn affects the underlying physical partitions. Provisioned I/O capacity for the table is divided evenly among these physical partitions. Therefore, a partition key design that doesn’t distribute I/O requests evenly can create “hot” partitions that result in throttling and use your provisioned I/O capacity inefficiently.”

In short: a very badly chosen index’ partition key and the separation of read and write capacity auto scaling lead to an index that simply couldn’t keep up with all the writes being performed on the base table. The nasty side effect was that on the background, the index was still being filled bit by bit, but there isn’t any metric that shows you how the process is faring. If anyone has an idea if and how we could further inspect and/or influence this “invisible process”, please let us know in the comments, because that would be very helpful in getting a better understanding of DynamoDB’s background processes.

Throttled write requests on the Index of the origin table at the same time of writes being performed on the base table. Everything seemed to go back to no throttling at all after that, since this metric and all others related to the index remained ‘calm’: there was nothing to see. Yet, on the 3rd of March, there was clearly a process still running, because writes were not possible anymore.

Our best solution to clean up this mess was to simply take a snapshot of the table from right before when all this started and restore the table to that state.

The Lessons We Have Learned

Although we have experienced quite some problems when all of this happened, it was also the perfect case to learn from. I think it’s good to close this case by summing up some of the key things that we have learned from it.

  • Knowledge + experience is power: Looking back at everything that went wrong here, some things had to do with a simple lack of knowledge or experience of some of the inner workings of various AWS services, such as the retry attempt of asynchronous Lambda invocations and the ‘invisible process’ of a DynamoDB table. We notice that the more experienced we are, the less unpleasant surprises we experience. However, keeping up to date and double checking the AWS documentation is necessary if we want to stay on top of things.
  • Avoid Lambda chains as much as possible: If something goes wrong somewhere along the chain, the consequences could be catastrophic. Whenever we think we need this pattern again, we should consider redesigning the architecture.
  • Use the tools given to you: In addition to the point above, for example, making use of Lambda’s maximum execution time already solved the issue of needing a chain. Plus, even if more executions were needed, the chain would have been a lot smaller, not causing so much trouble.
  • Better error handling + logging: Tests are there to prevent foreseen errors, but what about unforeseen errors? This case once again showed us that good error handling is key. Wherever errors can occur, make sure there’s logic that catches the error, stops or alters further execution to prevent more errors to arise and logs it to a place where we can inspect what exactly went wrong.
  • Think better about table partition design: A poorly chosen partition design can cause a bottleneck in your (DynamoDB) table. Next time when we set up a table, we should definitely take this into consideration and follow best practices, especially on tables where we expect a large amount of items to be inserted.

And with the biggest lessons learned summarized, that closes this case. We strive to never experience problems on such a large scale again and we hope you never have to. Altogether, this case has been rich with learning moments for us and makes us better Serverless developers in the long run. We hope that you enjoyed the articles and if you’re left with any questions or remarks, feel free to ask them.


Your Worst-case Serverless Scenario Part III: The Invisible Process was originally published in Levarne Cloud Software Services on Medium, where people are continuing the conversation by highlighting and responding to this story.