From Akka in Action, Second Edition by Francisco Lopez-Sancho 

This article discusses fault tolerance.


Take 35% off Akka in Action, Second Edition by entering fccabraham2 into the discount code box at checkout at manning.com.


What fault tolerance is (and what it isn’t)

Let’s start with a definition of what we mean when we say a system is fault tolerant, and why you’d write code to embrace the notion of failure. In an ideal world, a system is always available and can guarantee that it’s successful with each undertaken action. The only two paths to this ideal are using components that can never fail or accounting for every possible fault by providing a recovery action, which is also assured of success. In most architectures, what you have instead is a catch-all mechanism that terminates as soon as an uncaught failure arises. Even if an application attempts to provide recovery strategies, testing them is hard, and being sure that the recovery strategies themselves work adds another layer of complexity. In the procedural world, each attempt to do something requires a return code which is checked against a list of possible faults. Exception handling has become a fixture of modern languages, promising a less onerous path to providing the various required means of recovery. Although it has succeeded in yielding code that doesn’t need to have fault checks on every line, the propagation of faults to ready handlers hasn’t significantly improved.

The idea of a system which is free of faults sounds great in theory, but the sad fact is that building one which is also highly available and distributed is impossible for any non-trivial system. The main reason for this is because large parts of any non-trivial system aren’t under your control, and these parts can break. Then there’s the prevalent problem of responsibility: as collaborators interact, often using shared components, it’s not clear who is responsible for which possible faults. A good example of potentially unavailable resources is the network: it can go away at any time or be partly available, and if you want to continue operation, you’ll have to find some other way to continue communicating, or maybe disable communication for some time. You might depend on third-party services that can misbehave, fail, or be sporadically unavailable. The servers your software runs on can fail or can be unavailable, or even experience total hardware failure. You obviously can’t magically make a server reappear out of its ashes or automatically fix a broken disk to guarantee writing to it. This is why let it crash was born in the rack-and-stack world of the telcos, where failed machines were common enough to make their availability goals impossible without a plan that accounted for them.

Because you can’t prevent all failures from happening, you’ll have to be prepared to adopt a strategy, keeping the following in mind:

  • Things break. The system needs to be fault tolerant to stay available and continue to run. Recoverable faults shouldn’t trigger catastrophic failures.
  • In some cases, it’s acceptable if the most important features of the system stay available as long as possible, although in the meantime failing parts are stopped and cut off from the system to prevent interference with the rest of the system, producing unpredictable results.
  • In other cases, certain components are important enough that they need active backups (probably on a different server or using different resources) that can kick in when the main component fails, to quickly remedy the unavailability.
  • A failure in certain parts of the system shouldn’t crash the entire system, and you need a way to isolate particular failures that you can deal with later.

The Akka toolkit doesn’t include a fault tolerance silver bullet. You’ll still need to handle specific failures, but can do it in a cleaner, more application-specific way. The Akka features described in table 1 enables you to build the fault tolerant behavior you need.

Table 1. Available fault avoidance strategies

Strategy

Description

Fault containment or isolation

A fault should be contained within a part of the system and not escalate to a total crash.

Structure

Isolating a faulty component means that some structure needs to exist to isolate it from the rest of the system; the system needs a defined structure in which active parts can be isolated.

Redundancy

A backup component should be able to take over when a component fails.

Replacement

If a faulty component can be isolated, you can also replace it in the structure. The other parts of the system should be able to communicate with the replaced component as they did before with the failed component.

Reboot

If a component gets into an incorrect state, you need the ability to get it back to a defined initial state. The incorrect state might be the reason for the fault, and it might not be possible to predict all the incorrect states the component can get into because of dependencies out of your control.

Component lifecycle

A faulty component needs to be isolated, and if it can’t recover, it should be terminated and removed from the system or re-initialized with a correct starting state. Some defined lifecycles need to exist to start, restart, and terminate the component.

Suspend

When a component fails, you’d like all calls to the component to be suspended until the component is fixed or replaced, and that when it is, the new component can continue the work without dropping a beat. The call that was handled at the time of failure should also not disappear—it could be critical to your recovery, and further, it might contain information which is critical to understanding why the component failed. You might want to retry the call when you’re sure that there was another reason for the fault.

Separation of concerns

It’s great if the fault-recovery code can be separated from the normal processing code. Fault recovery is a cross-cutting concern in the normal flow. A clear separation between normal flow and recovery flow simplifies the work that needs to be done. Changing the way that the application recovers from faults is simpler if you’ve achieved this clean separation.

“But wait a minute,” you might say, “Why can’t we use plain old objects and exceptions to recover from failures?” Normally exceptions are used to back out of a series of actions to prevent an inconsistent state instead of recovering from a failure in the sense we’ve discussed this far. Let’s see how hard it is to add fault recovery using exception handling and plain old objects in the next section.

Note: In this article we aren’t covering the tooling for recovering from lightning bolts or things alike. If a computer bursts into flames, we need to have in place clustering.

Plain old objects and exceptions

Let’s look at an example of an application that receives logs from multiple threads, “parses” interesting information out of the files into row objects, and writes these rows into some database. Some file watcher process keeps track of added files and informs many threads in some way to process the new files. Figure 1 gives an overview of the application and highlights the part that we’ll zoom in on (“in scope”).


Figure 1. Process logs application


If the database connection breaks, we want to be able to create a new connection to another database and continue writing, instead of backing out. If the connection starts to malfunction, we might want to shut it down; no part of the application uses it anymore. In some cases, we’ll want to reboot the connection, hopefully to get rid of some temporary bad state in it. Pseudo code is used to illustrate where the potential problem areas are. We’ll look at the case where we want to get a new connection to the same database using standard exception handling.

First, we set up all objects which are used from the threads. After setup, they’ll be used to process the new files that the file watcher finds. We set up a database writer that uses a connection. Figure 2 shows how the writer is created.


Figure 2. Create a writer


The dependencies for the writer are passed to the constructor as you’d expect. The database factory settings, including the different URLs, are passed in from the thread that creates the writer. Next, we set up some log processors; each gets a reference to a writer to store rows, as shown in figure 3.


Figure 3. Create log processor


Figure 4 shows how the objects call each other in this example application.

The flow shown in figure 3. gets called from many threads to simultaneously process files found by the file watcher. Figure 4 shows a call stack where a DbBroken-ConnectionException is thrown, which indicates that we should switch to another connection. The details of every method are omitted; the diagram only shows where an object eventually calls another object.


Figure 4. Call stack diagram


Instead of throwing the exception up the stack, we’d like to recover from the DbBrokenConnectionException and replace the broken connection with a working one. The first problem we face is that it’s hard to add the code to recover the connection in a way that doesn’t break the design. Also, we don’t have enough information to re-create the connection: we don’t know which lines in the file have already been processed successfully and which line was being processed when the exception occurred. In Figure 5 we show the difficulties of communication failures among different parts of the stack.


Figure 5. Call stack as you process log files


Making both the processed lines and the connection information available to all objects breaks our simple design and violates some basic best practices like encapsulation, inversion of control, and single responsibility, to name a few. (Good luck at the next code peer review with your clean coding colleagues!) We want the faulty component replaced. Adding recovery code directly into the exception handling entangles the functionality of processing log files with database connection recovery logic. Even if we find a spot to re-create the connection, we’d have to be careful that other threads don’t get to use the faulty connection as we’re trying to replace it with a new one, because otherwise some rows are lost.

Also, communicating exceptions between threads isn’t a standard feature; you’ll have to build this yourself, which isn’t a trivial thing to do. Let’s look at the fault tolerance requirements to see if this approach even stands a chance:

  • Fault isolation—Isolation is made difficult by the fact that many threads can throw exceptions at the same time. You’ll have to add some kind of locking mechanism. It’s hard to remove the faulty connection out of the chain of objects: the application needs to be rewritten to get this to work. No standard supports cutting off the use of the connection in the future, and this needs to be built into the objects manually with some level of indirection.
  • Structure—The structure that exists between objects is simple and direct. Every object possibly refers to other objects forming a graph; it isn’t possible to replace an object in the graph at runtime. You need to create a more involved structure yourself (again, with a level of indirection between the objects).
  • Redundancy—When an exception is thrown, it goes up the call stack. You might miss the context for making the decision of which redundant component to use, or lose the context of which input data to continue with, as seen in the preceding example.
  • Replacement—No default strategy is in place to replace an object in a call stack; you’ll have to find a way to do it yourself. Dependency injection frameworks provide some features for this, but if any object refers directly to the old instance instead of through the level of indirection, you’re in trouble. If you intend to change an object in place, you’d better make sure it works for multithreaded access.
  • Reboot—Similar to replacement, getting an object back to an initial state isn’t automatically supported and takes another level of indirection that you’ll have to build. All the dependencies of the object must be reintroduced as well. If these dependencies also need to be rebooted (let’s say the log processor can also throw some recoverable error), things can get quite complicated with regard to ordering.
  • Component lifecycle—An object only exists after it has been constructed or it’s garbage collected and removed from memory. Any other mechanism is something you’ll have to build yourself.
  • Suspend—The input data or some of its context is lost or unavailable when you catch an exception and throw it up the stack. You’ll have to build something yourself to buffer the incoming calls as long as the errors unresolved. If the code is called from many threads, you’ll need to add locks to prevent multiple exceptions from happening at the same time. And you’ll need to find a way to store the associated input data to retry again later.
  • Separation of concerns—The exception-handling code is interwoven with the processing code and can’t be defined independently of the processing code.

This doesn’t look promising: getting everything to work correctly is going to be complex and a real pain. It looks like some fundamental features are missing for adding fault tolerance to our application in an easy way:

  • Re-creating objects and their dependencies and replacing these in the application structure isn’t available as a first-class feature.
  • Objects communicate with each other directly, and it’s hard to isolate them.
  • The fault-recovery code and the functional code are tangled up with each other.

Luckily we have a simpler solution. You’ve already seen some of the actor features that can help simplify these problems. Actors can be (re-)created from Behaviors functions and communicate through actor references instead of direct references. In the next section, we look at how actors provide a way to untangle the functional code from the fault-recovery code, and how the actor lifecycle makes it possible to suspend and restart actors (without invoking the wrath of the concurrency gods) in the course of recovering from faults.

Wrap it up and Let it crash

In the previous section, we showed that building a fault-tolerant application with plain old objects and exception handling is a complex task. Let’s look at how actors simplify this task. What should happen when an Actor processes a message and encounters an exception? We already discussed why we don’t want to graft recovery code into the operational flow, and catching the exception where the business logic resides isn’t an option.

Instead of using one flow to handle both normal code and recovery code, an Akka Behavior provides composition that allows creating and then mixing two different flows: one for normal logic and one for fault recovery logic. The normal flow consists of actors that handle normal messages, the happy path; the recovery flow consists of another behavior that, wrapping up the normal one, only deals with Exceptions. Figure 6 shows a supervisor (supervising behavior) only concerning about recovery actions and an actor taking care of the domain logic. Then both can be mixed by wrapping up the actor’s behavior inside the supervision one.


Figure 6. Normal and recovery flow


Instead of catching exceptions in an actor, we’ll let the actor crash. The actor code for handling messages only contains happy path processing logic and no error handling or fault recovery logic, and it’s effectively independent of the recovery process, which keeps things much clearer. The mailbox for a failing actor is suspended until the supervising behavior in the recovery flow has decided what to do with the exception. Messages still can be sent to it though; they’ll be stashed.

A supervisor doesn’t only “catch exceptions;” rather it decides what should happen with the failing actors that it supervises based on the Exception. The supervising behavior doesn’t try to fix the actor or its state. It renders a judgment on how to recover, and then triggers the corresponding strategy. The supervisor has three options when deciding what to do with the actor:

  • Restart—The actor must be re-created. After it’s restarted (or rebooted, if you will), the actor continues to process messages. Because the rest of the application uses an ActorRef[T] to communicate with the actor, they keep using the same reference and underneath the same mailbox from where the new actor instance automatically keeps reading the next messages.
  • Resume—The same actor instance should continue to process messages; the exception that crashes the actor is only logged by the supervisor.
  • Stop—The actor must be terminated. It no longer takes part in processing messages.

We’ll need to take some special steps to recover the failed message, which we’ll discuss in detail later when we talk about how to implement a restart. Suffice it to say that in most cases, you don’t want to reprocess a message, because it probably caused the error in the first place. An example of this is the case of the LogProcessor encountering a corrupt file or unparsable line in it: reprocessing corrupt files could end up in what’s called a poisoned mailbox—no other message will ever get processed because the corrupting message is failing over and over again. For this reason, Akka chooses not to keep the failing message into the mailbox after raising an exception, but there’s a way to do this yourself if you’re absolutely sure that the message didn’t cause the error, which we’ll discuss later.

Let’s recap the benefits of the let-it-crash approach:

  • Fault isolation—A supervisor can decide to terminate an actor. The actor is removed from the actor system.
  • Structure—The actor system hierarchy of actor references makes it possible to replace actor instances without other actors being affected.
  • Redundancy—An actor can be replaced by another. In the example of the broken database connection, the fresh actor instance could connect to a different database. The supervisor could also decide to stop the faulty actor and create another type instead. Another option is to route messages in a load–balanced fashion to many actors.
  • Replacement—An actor can always be re-created from its Behavior. A supervisor can decide to replace a faulty actor instance with a fresh one, without having to know any of the details for re-creating the actor.
  • Reboot—This can be done through a restart.
  • Component lifecycle—An actor is an active component. It can be started, stopped, and restarted. In the next section, we’ll go into the details of how the actor goes through its lifecycle.
  • Suspend—When an actor crashes, its mailbox is suspended until the supervisor decides what should happen with the actor.
  • Separation of concerns—The normal actor message-processing and supervision fault recovery flows are orthogonal, they can be defined and evolve completely independently of each other.

Bear in mind that the most dangerous actors (actors which are most likely to crash) should be as low down the hierarchy as possible. Faults that occur far down the hierarchy can be -handled or monitored by more actors than a fault that occurs high up in the hierarchy. When a fault occurs in the top level of the actor system, it could restart all the top-level actors or even shut down the actor system.

If you want to learn more about the book, check it out on Manning’s liveBook platform here.