My last couple posts have been about developing and designing cloud based applications. In the What Are Microservices blog post I wrote about the importance of having small shippable applications that can easily scale depending on demand and are decoupled from other services to allow for faster development and independent deployments. In A Tangled Mess we explored how to decouple the systems even more by publishing events as they occur and having other systems react in appropriate ways.

So this is great, right? We have independent systems with independent data flows so we’re all good, done, no problems. Wrong.


Often times we will want to know the current state of a system but in a distributed architecture with distributed data how can we know if the state of something is correct? What happens if something fails? How do you rollback?

In the good old days in our handy dandy Monolith we would say, “Easy, just roll back your transaction”. But you can’t do that on a distributed system who’s in charge of the roll back? How do you communicate that something went wrong? It’s not as easy as it sounds and requires a lot of coordination amongst all of your systems. In a cloud architecture there aren’t transactions because often there is more than one database. So you need to figure out how to control the dataflow or how to make it correct itself if something goes wrong.

Two Phase Commit

The first approach that many can think of is Two Phase Commit where a calling service will make several requests to various services saying, “are you okay if I do this?” and wait for the responses to say, “yeah that works for me” (voting phase) then based on the responses makes a decision to commit call to the other services. In this scenario the first phase of commit prepares all of the other services to commit. This can be done by locking up resources or creating a “pending” phase on their systems. The second phase is the actual commit based on the caller’s request.

The calling method is a coordinator of all of the other systems in order to create the appropriate data and controls the flow. This is a fairly tightly coupled system and requires a lot of coordination.

A simple example of this would be to text each one of your friends individually to find out if they would be okay seeing The Foo Movie. Each would return a response saying yes or no. If everyone says yes then you text back each one of your friends and tell them to meet you at the movies at 8 and we’ll be seeing The Foo Movie.

Obviously this is time consuming and everyone is waiting for the actual response. What happens when one friend doesn’t want to see The Foo Movie because it stars Dennis Quaid, and he doesn’t like Dennis Quaid (I mean who does) and he’d much rather see The Bar Movie starring his talented brother Randy. Then the process starts all over again or you decide that no one is going to see a movie.

Or what if you, the coordinator, loses the phone in the middle of this process? People may not get the confirmation message and delay all their plans for the evening because they haven’t heard from you.

From a technology standpoint this process obviously doesn’t scale. In the worst case scenario you are making O(N^2) messages and have a single point of failure in the coordinator. The time it actually takes to complete the message is based on the slowest service in the chain which could delay the whole system.


So let's take a different perspective on how to handle the overall state by revisiting an Event Driven Approach. When an event is sent a service can either act or not act on it. If you are a coordinating system you may want to post an event back to let the calling service know you got the message and have acted on it. This action can either be a success or a failure and the calling service can determine what it wants to do next.

A Saga can work in a similar way. The definition of a saga is a series of requests (T) paired with compensating requests (C). A request should always have a compensating request that essentially undoes the previous request. So if I make a call to a service and it begins a saga that has chain of requests T1, T2, T3….TN it will have a compensating request series of C1, C2, C3 … CN such that if one request fails we can unroll the other actions by executing the corresponding compensating request.

A naive approach would be to imagine Service A calling Service B that calls Service C and C throws a 500 to which B rolls back whatever it did and A does the same.

However if you look from an Event Driven approach all of these calls can be made asynchronously through an event. If an event were to fail the saga originator can request rollbacks on all of the other systems or act however it wants. In the same way the consuming applications can decide to act or not act on the rollback or handle each state in a different way.

For example, if a service announces a new account creation and an email service notices the event and sends a welcome email only to find out that it needs to rollback because the payment wasn’t processed it can’t take back the email. Instead it can act differently maybe sending a follow up email about the failed payment and therefore the account isn’t active.

This is obviously more complicated because you are creating extra methods to reverse work that has already been done which in some cases actually can’t be undone (sending an email). You also have the situation of creating a bad state by having the original caller going down in the middle of all of these requests meaning that a rollback was never called. Or if one of the calling systems fails it may not receive the failure message. However this can be resolved by ensuring a message is delivered no matter what (unlimited retries).

The trade off is the same as all other microservice design patterns in which you have complex designs to reduce coupling. In the end you can have faster final state that are independent of other systems.

Bringing it All Together

The biggest obstacle to overcome in a cloud based system is the inevitability of failure. Systems need to be built in ways that expect to fail and in the same way we need to make sure that we can make sure our data is in a state that we want. In order for this to be successful we need the data to be “self-healing” and in doing so allow the systems to work independently. This is obviously more complex and requires planning, thought, and proper documentation.

Implementing this saga process can be done in many different ways. It could be done through a series of REST calls where the calling function understands if a rollback is needed based on responses from the systems. This would be different than the two phase commit example in that no downstream system is in a pending state but just knows how to rollback if need be.

Alternatively if you are using a serverless approach you may develop a “state-machine” where on a failure another serverless function is called that triggers a rollback of the systems. In the above saga definitions of T requests and C compensating requests as serverless functions and if Tn fails it will call C(n-1) to begin the undoing process.

If you want to move to an event driven approach you can do as we described above and publish events that notify all systems of a failure and can begin to undo whatever work needs to be undone. The complication here is making sure that the system knows it should rollback from failure and not just call out another event that other systems may want to react to (e.g. delete because of failure versus a regular delete).

Again, this all seems like a layer of added complexity in an already complex system. However, cloud based systems need to have a level of complexity so they can function independently. By functioning independently we get resilient systems. Removing a single point of failure is key to a successful cloud based system.