Your Worst-case Serverless Scenario Part I: Invocation Hell

How one mistake turned 100k records into 56M records, caused multiple parts of a production environment to fail for hours and what we’ve learned from it.

Serverless applications have been booming the last few years. At Levarne, we’re also serverless advocates, having experienced the great advantages first hand. Since it’s an ever developing field, there is still much to learn every single day. And sometimes, you will have to learn the hard way. That’s exactly what happened to us a few weeks ago.

Although it was very painful at the moment, we learned a ton from a certain case that we experienced in terms of architectural design and some of the inner workings of a few AWS services. In this three part series, we will tell you what problems we faced, what the causes were, how we solved the problems and what we have learned from this case. If you read until the end, you will learn the same things as we did, but without having to experience that same painful moment.

In this first part we will focus on the problems we had to face, explain our way of working through the issues and some interesting technological parts of AWS Lambda invocations.

Your Worst-case Serverless Scenario Part I: Invocation Hell
Your Worst-case Serverless Scenario Part II: Magic With Numbers
Your Worst-case Serverless Scenario Part III: The Invisible Process

“Houston, We Have A Problem…”

This story started a few weeks back, at the beginning of March 2020, when we received a phone call from one of our customers. According to them, multiple parts of the application suddenly stopped working. They gave some examples of things that went wrong to help us with the investigation. Because the errors were business critical, we immediately started looking for what exactly was going on. First, we started to verify the customer’s claims by looking at services like AWS CloudWatch, X-Ray, Lambda and DynamoDB. During the process of solving the issues, we found some more problems related to the same root cause. Below, we have summarised all of these issues to give you a complete picture of everything that we were facing.

Issues We Initially Found

One part of the application always failed. This had to do with a DynamoDB table that couldn’t be read anymore. Since it normally doesn’t receive much reads, finding the cause of this problem would be easy through X-Ray and CloudWatch.
Multiple parts of the application sometimes failed. We couldn’t see right away why this was happening or find a connection between the services that were failing. Finding the root cause of this problem would probably take more time.

Issues We Found Later On

One DynamoDB table, let’s call it the origin table, should contain around 100k records, but somehow consisted of almost 56 million records! It’s a staggering amount compared to what it should be. Although we already found the piece of code that was causing the previous problems and knew it also caused this figure, we didn’t understand this one right away. A more thorough look at the malicious code and discussion amongst our engineers was needed.
This same origin table was readable after everything cooled down, but was not writable anymore. We had seen behavior like this before on a DynamoDB table, but normally it was resolved a lot sooner. But, because of the enormous amount of almost 56 million records, this was most likely causing the extra delay.

Let’s Start Easy: A Simple Coding Mistake

Luckily for us, we had a good starting point with the first issue we described: One part of the application always failed. Since we already observed that this was caused by a DynamoDB table that received a lot more reads than usual, we could look up which Lambda function was performing these reads through X-Ray. This Lambda function was basically responsible for inserting an x-amount of items into a database, based on a user’s input (it could be 1 item, but it could go up to +- 250k items). Because this Lambda can be invoked through API Gateway and might run longer than 30 seconds, which is the maximum runtime through API Gateway, we had chosen a design for this Lambda to invoke itself if needed. However, on startup, it needed to read some data from a DynamoDB table. This read was unnecessary after invoking itself, but due to a coding mistake, the read was still being performed here. And, you may already be thinking: “but hey, just one extra read? Does this really cause such a big problem?” No, it doesn’t, but it turned out that the function was being invoked way too many times than it should have been. We’ll get into this in more detail in the next section, but for now it’s good to know that this happened.

To resolve this issue we fixed the spotted coding mistake by checking for an event path, which is only available when invoked through API Gateway, which in turn is always the first invocation in this case.

In case you may be wondering, for our DynamoDB table we were using provisioned capacity with auto scaling, but because of the way too many invocations of this function, it reached its maximum provisioned capacity pretty soon.

(Concurrent) Invocation Hell

Handling the first issue gave us an immediate good look at the broader picture of what was causing all this trouble. Remember that our second issue was that multiple parts of the application sometimes failed and that we couldn’t find a relation between these parts. Now that we knew that the same Lambda function that was causing the previous issue was being invoked way more than it should be, we suspected that this was causing problems with invoking other Lambda functions, because they would most likely hit the concurrent Lambda invocation limit. Because this limit is regional and per AWS account, once its hit, it basically means that no more Lambda execution contexts can be spun up, unless already existing execution contexts are being removed. As it turned out, this limit was indeed hit on this day, because not even was this Lambda function being invoked more frequently, but it also kept its execution contexts open longer than usual, because of DynamoDB timeouts. As for why the timeouts occurred, we’ll get into this in more detail in the next sections.

Confusion And Cooling Down

Okay, good, so we found the cause. How do we fix it? Well, initially we were quite confused, because even if these Lambda’s have been invoked more than they should have, they now hit a limit, causing an error, thus according to our code it should stop the chain. But clearly, this wasn’t happening, because we saw more invocations for this function and according to the logs they were not directly invoked by a user. Due to the fact that the business was losing money every minute this problem prolonged and this particular function wasn’t business critical, we thought to completely remove it from the stack and redeploy it again. However, shortly after that, the amount of invocations cooled down and it seemed as if the chain of invocations had come to an end.

More Confusion And Heating Up (again)

After calling with the customer it looked like everything went back to normal again, so at least the pressure of the customer losing money was gone and we could focus on why exactly the function was invoked more than we thought it would be invoked and then redesigning this function in a better way. Nonetheless, we received another phone call from our customer not much later after that, stating that they experienced the same problems again. But wait… the function wasn’t invoked another time by the customer, but it was somehow repeatedly being invoked by itself in higher demands!

Okay, so… how was this possible? The chain seemed to have been cooled down at first, but now it looked like another chain had been started, while the logs clearly show no user action had been taken. After talking to Levarne’s most experienced Cloud Architect, he was able to point out that the function was invoking itself with the Event InvocationType and that this type of invocation means that instead of waiting for a response, Lambda itself handles these invocations. If an error is thrown during invocation, it will put the invocation on a queue and retry it after a while up to 2 times. And of course, because the retries happen after some period of time and the functions errored roughly at the same time during the first wave, the second wave was caused by all these retries happening almost simultaneously. Realising this, we had no other option than to remove the Lambda function from the stack until we found a long term solution to this problem.

That’s enough information for now. In part II, we will explain how 100k records turned into 56 million records, which will dive deeper into the Event InvocationType and Retry patterns from this article and adds some intriguing behaviour of DynamoDB as well.

Visuals in this post have been made by David Kempenaar.

From ideato product