In this third and last installment of “Your Worst-case Serverless Scenario” we will talk about a nasty ‘invisible process’ of DynamoDB and briefly discuss partition designs and table indexes. We will close this case by summarizing our lessons learned. If you haven’t read the first two parts of this series, we highly recommend checking them out, since this article builds further upon the previous articles.
Last but not least, we have the issue of the origin table still experiencing problems after everything had been cleared up. More specifically, if you tried to write more items to the table even a day later, you would still end up with timeout errors. However, reading from the table worked just fine. The strange thing was; you couldn’t see any activity on the metrics indicating that write requests were being throttled at the moment or that any process was running at all. Also, the amount of items present in the table didn’t increase anymore and auto scaling had brought down read and write capacities to its normal capacities. As we knew from another case already, this had to do with updating the index. Once you write an item to a table, this item will also be written to all indexes of the table. For this to happen, a read on the base table and a write on the target indexes is needed.
Our auto scaling options were set on both read and write capacity separately with the same settings applied to all Global Secondary Indexes (we had just one index). That meant that during the massive amount of writes on the base table, the write capacities of the base table and target index skyrocketed, but the read capacity remained the same. In other words: the items could initially be written to the base table, but then the index noticed that it couldn’t keep up, because even though the write capacity of the index had scaled up, the read capacity of the base table was too low. To make things even worse, the upscaled write capacity turned out to be insufficient as well, because the partition key of the index was very badly chosen. In this particular case, all the written items on the base table had different partition key values on the base table, but the same partition key value on the index, which in our case were 56 million items! According to the official AWS documentation on designing partition keys to distribute your work evenly:
“The partition key portion of a table’s primary key determines the logical partitions in which a table’s data is stored. This in turn affects the underlying physical partitions. Provisioned I/O capacity for the table is divided evenly among these physical partitions. Therefore, a partition key design that doesn’t distribute I/O requests evenly can create “hot” partitions that result in throttling and use your provisioned I/O capacity inefficiently.”
In short: a very badly chosen index’ partition key and the separation of read and write capacity auto scaling lead to an index that simply couldn’t keep up with all the writes being performed on the base table. The nasty side effect was that on the background, the index was still being filled bit by bit, but there isn’t any metric that shows you how the process is faring. If anyone has an idea if and how we could further inspect and/or influence this “invisible process”, please let us know in the comments, because that would be very helpful in getting a better understanding of DynamoDB’s background processes.
Our best solution to clean up this mess was to simply take a snapshot of the table from right before when all this started and restore the table to that state.
Although we have experienced quite some problems when all of this happened, it was also the perfect case to learn from. I think it’s good to close this case by summing up some of the key things that we have learned from it.
And with the biggest lessons learned summarized, that closes this case. We strive to never experience problems on such a large scale again and we hope you never have to. Altogether, this case has been rich with learning moments for us and makes us better Serverless developers in the long run. We hope that you enjoyed the articles and if you’re left with any questions or remarks, feel free to ask them.
Your Worst-case Serverless Scenario Part III: The Invisible Process was originally published in Levarne Cloud Software Services on Medium, where people are continuing the conversation by highlighting and responding to this story.