By Anthony Brown

Akka.Net provides several implementations of the core routing logic to allow for a variety of techniques for distributing messages through the router to its routees. These routers allow for a wide variety of behaviours which ultimately allow us to continue to build applications that remain responsive even under intense load. In this article, we’ll look at the routers included with Akka.Net and the advantages they provide.

Save 37% off Reactive Applications with Akka.NET with code fccbrown at manning.com.

 

Akka.Net routers can distribute a single message to several routees associated with the router, but this router only allows us to broadcast a message to the intended targets. In certain circumstances, this proves to be beneficial, such as those in which we want to parallelise workloads. In the case of a distributed load testing system, we want to ensure that we’re able to perform as many operations simultaneously as possible.

Random routing

The simplest possible approach to distributing a message through a router to a single node is to choose one of the routees randomly and then send the message through to it. Assuming the random number generator used is truly random, given enough time, it’ll generate a random distribution of targets which are chosen to receive messages.

Figure 1 – The random router selects a routee randomly for each message

To create a random router, we need to provide the number of routees it should create, like a broadcast router. The core difference between the two is that the random router only forwards the message onto a single routee, which is chosen at random. To create this type of router in code, we need to create an instance of the RandomPool class which we pass to the router configuration in Props.

Figure 2

The router can be configured using HOCON and the configuration is mostly the same as the broadcast pool, with the exception that we specify the name for the round robin pool being used. Once again, to configure this we need to supply the number of routees to be created by the router pool.

 

akka.actor.deployment {
  /smsgateway {
    router = random-pool
    nr-of-instances = 5
  }
}

 

Once this is deployed, we can see the effect on message distribution. If our random number generator is truly random, we should see all routees receive an equal number of messages. In the diagram below, we can see the number of messages each of the five routees received, if we sent a total of 10,000 messages. Whilst all routees received a similar number of messages, there’s still a difference of several hundred between the lowest number of messages processed and the highest number of messages processed. For small messages this effect may be negligible, but when dealing with messages which rely on large amounts of processing, we may be left with long time differences between the smallest queue finishing and the longest queue finishing.

Figure 3 – The random router may not evenly distribute messages to every routee

The random router is an incredibly simple router which can effectively distribute the messages it receives to many routees. Despite the simple nature of the router, it still manages to generate an even distribution of messages across the routees, but it’s ultimately dependent upon the performance of the platform’s random number generator. Because the random number generator generates a random sequence of numbers, it may be the case that only one number is generated for an extended period-of-time, putting a bottleneck on that individual routee, whilst the other routees have empty queues.

Round robin routing

The random router ultimately depends on the underlying random number generator’s performance when it comes to generating random numbers, but the potential for a bottleneck to occur exists when only one routee is receiving most of the messages. To counter this difficulty, we need a way to ensure that an even distribution of messages is spread across the routees of a router. For example, in the case where we have three routees, we want to ensure that nine messages are distributed, each in an even order. In the diagram below, we can see the order of messages in the mailbox for each routee. We see that the router sends the first message to the first routee, the second message to the second routee and the third message to the third routee before it wraps around and sends the fourth message to the first routee.

Figure 4 – The round robin router chooses the next routee as each message arrives and then starts again from the beginning

We can create a round robin pool by specifying the number of routees to be used by the router. When using the code-based approach, we need to create an instance of the RoundRobinPool class to provide the number of routees.

Figure 5

We can also choose to create the router using HOCON by specifying the number of routees to use within the router, as well as providing the name of the router type to use.

 

akka.actor.deployment {
  /smsgateway {
    router = round-robin-pool
    nr-of-instances = 5
  }
}

 

The round robin approach to message routing is incredibly simple and ensures that we get an even distribution of messages to all routees associated with a router. It’s particularly effective where we have a stateless actor that we use to communicate with an external service and we want to scale it out to multiple instances. By using a round robin router, we get a consistent throughput, assuming all routees can process a message at a similar rate.

This router is still only a best-effort router and we still might encounter problems when using it that prevent the round robin router from enabling a truly even distribution.

Shortest mailbox queue

Internally within Akka.Net every actor has a mailbox. Once we send a message to an actor, it gets appended to the end of the queue to be picked up at some point in the future by the processing component of an actor. Because an actor can only process a single message at a time, if it’s slow to process an individual message then the message queue can start to grow quickly. If we were to use this slow actor alongside a fast actor, the faster actor completes its messages and runs out of work to do whilst the slow actor might still have a large number of messages left to process.

Figure 6 – The shortest mailbox router routes messages to the routee with the shortest mailbox

To counter this problem, we can choose to use the shortest mailbox queue based router within Akka.Net. The concept behind this actor is explained within the name; when a message is sent through the router, it consults the routees to see which has the shortest message queue. The message is then automatically forwarded onto the actor with the smallest mailbox. This ensures that the actors which are processing messages more quickly can receive more messages than slower running actors.

To use a shortest mailbox router, we need to once again specify the number of routees to use within the router. It automatically handles creating these actors and processing based on mailbox size. To create the router in code, we need to create an instance of the SmallestMailboxPool class which is becomes the router instance to the Props for an actor.

Figure 7

The router can equally be created through a HOCON configuration, where we need to supply the name of the router, in this case within HOCON, its smallest-mailbox-pool, along with the number of instances to use.

akka.actor.deployment {
  /smssender {
    router = smallest-mailbox-pool
    nr-of-instances = 5
  }
}

The smallest mailbox router is useful in cases where we don’t necessarily know in advance how long it’s going to take to process a message. If we have a message with a payload that leads to a larger amount of work than others, the smallest mailbox router is an ideal candidate for reducing the impact of the message on subsequent messages. It’s far from a silver bullet, because we don’t know how long it’ll take to process a message until it’s processed. If we enqueue a message after a message that takes a long time to process, the enqueued message encounters a delay before it reaches the processing stage.

Consistent hashing

All the routers we’ve seen relied on selecting a random routee to process the message, but sometimes we want to ensure that messages with a common trait always get sent to the same target routee. Let’s take an example of building a simple key-value based datastore which persists its data to the filesystem upon which it’s running. In this case, the common trait between each message is the key that identifies the item in the database.

A similar scenario is when a unique means of identifying the target actor through the use of a sensor identification string is used to directly route new messages straight through to the actor. This allows us to have every sensor running in parallel without concern over whether we’d encounter any bottlenecks. Given these benefits, it might seem to be a natural fit because it ensures that we’re able to perform concurrent operations across large numbers of keys simultaneously, but it also comes with some downsides. The most notable of which is the need to ensure we’ve created an instance of an actor at the provided name before we’re able to communicate with it. Every time a request comes in for a given key, wemust ensure that an actor exists at that key’s address before sending a message to it. If the key doesn’t exist we’d need to create an actor instance and then send it a message. This then adds a lot of overhead to every request and introduces a significant amount of latency, which can ultimately lead to applications becoming unresponsive. We also need to have one actor per key and store that actor in memory, with relatively little overhead on an actor. It starts to add up at larger scales if we’re implementing a key value store, and we might see millions or even billions of keys and actors in memory, which then presents us with problems. The final difficulty is ensuring the isolation of actors; we saw that every actor should keep its state internal and not share it with any other actors. Because the data being persisted to disk for an actor is, by association, part of the actor’s internal state, it can’t share it with other actors in the system. This can impose a lot of resultant pressure on the file system where we end up with one file per key, and the potential for millions of tiny files on the file system.

Ultimately, we want to create a means of having a single actor responsible for a select portion of the available keys and repeat this with each other instance being responsible for a different portion of the key space. In order to do this, we could store each of the keyspace regions in a specific location which allows the router to automatically route the message based on what the lookup table says. This approach leads to a lot of required co-ordination to update the lookup table whenever a new key value pair’s added. Ideally we want a completely stateless router that allows us to select a given routee based on the message in a repeatable manner.


Figure 8 – The consistent hashing router directs every message with the same properties to the same actor

A simple approach to this is to calculate the hash of the message or a specific property within the message. This hash can be used to calculate the target routee for the message. A hash is calculated by passing the message through a hash function whose sole responsibility’s to map data down to a fixed size from an arbitrary length. In Akka.Net, the hash function maps the message property from whatever it’s data size is, which in the case of a key value datastore is a string for the key with variable length, down to a fixed length, which is the number of routees available to the router.


Figure 9

This forms the basis of the consistent hashing router within Akka.Net which computes a hash for the message before using the hash as a means of deciding which routee to send the message. We can create a consistent hashing router in code by creating an instance of the ConsistentHashingPool class. To do this, we need to specify the number of routees to use.

Figure 10

We also need to choose the correct property on our message such that all messages with that property in common reach the correct target. We have three possible options for how to manage that within Akka.Net. The least intrusive way, to either the routees or the message, is to create a hash mapping delegate at the point when we create the router. This takes in the message and returns the property to use as the thing to hash. If we’ve got a message defined for our key value datastore, then we create the actor instance and we can supply a hash mapping delegate by using the WithHashMapping method on the router, which supplies a new router with the mapping applied. In the code below, we have a common interface for the database-related operations which allows us to retrieve the key out of the key value pair easily. We use this interface as a means of retrieving the key for the given key value pair.

Figure 11

Because our messages use a common interface, we could also use IConsistentHashable interface directly, which allows us to specify what the hash key should be. The router checks to see if the message implements this interface and if it does, it uses this interface to retrieve the hash key. Using this approach relies on us editing the messages to rely upon an underlying implementation detail and may not be a valid option dependent upon where the messages originate from.

Figure 12

The final option is to wrap the messages sent to the router in an envelope that provides the internal message along with the hash. The router uses the hash to direct the message before forwarding on the original message stripped out of the envelope. To use the envelope, we create an instance of the ConsistentHashableEnvelope with the message and the hash key to use.

Figure 13

Once we’re able to calculate a hash for a given message, we can choose the routee to which we should send the message. Due to our hash function, we have a known fixed size for the possible output values, for example, it might generate a number in the range between 0 and 255 inclusive. We create a circle and place these possible values around the edge of it. We use the same hash function used for mapping keys to map the node identifers onto the ring. Given that we created our router with three routees which act as nodes in the consistent hashing router, we can position them onto the ring. The diagram below shows an example where we’ve placed a node on the ring. That node’s responsible for the hash values we encounter as we move clockwise around the ring until we reach the next node.

Figure 14  In a consistent hashing router, routees are placed around a ring and each routee’s responsible for a portion of the ring

 

Whilst consistent hashing helps us choose a target for a message in a simplistic way, we can also see some disadvantages using it. The most notable problem is caused by hashing each node. Some nodes are inevitably responsible for more keys than others, creating the potential for an increased load on one of the routees. In order to counter this problem, the consistent hashing router allows us to specify the number of virtual nodes per routee. This allows us to have five routees with one node each, or we can create five nodes with ten nodes. This provides us with a total of fifty nodes around the ring, which helps us ensure there’s a more even distribution of keys around the nodes. It’s important to note that the new nodes are entirely virtual and are only for calculations; we still have only five routees created at any time. By default, Akka.Net uses a value of ten for the virtual nodes factor, but if we want to, we’re able to change this by supplying a different value in the constructor when we create the router, as shown in the code below, where we use a virtual nodes factor of twenty.

Figure 15

As is the case with most routers, we’re able to create the router using the HOCON configuration rather than relying on storing these values in code. The values we need to set are the router name, the number of routees, as well as the virtual nodes factor. We’re still left needing to handle how to retrieve the hashing property with code using any of the techniques we saw above.

Figure 16

Whilst the router pool supports automatic resizing in the same way as other routers, we need to be more careful when using it. If we’re to add another node, it’d affect which node was queried for a given key. This change would mean we’re left in a situation where historical data isn’t able to be retrieved from the routees. Auto-resizing with the router is only useful if the routees are completely stateless. If the routees are stateful then auto resizing and the consistent hashing router should be avoided.

Consistent hashing routers are a great means of ensuring an even distribution of keys are handled by the routees. The consistent hashing based approach to stateful distribution is useful in many large projects, including several distributed NoSQL databases such as Amazon’s DynamoDB and Basho’s Riak, as well as the internals of some big data processing tools such as Hadoop’s MapReduce. By using it within Akka.Net, we’re able to ensure the router distributes messages with a common trait to the same actor repeatably, with minimal overhead required and no co-ordination within the router.

Scatter gather first completed

Most routers don’t prevent us from returning data, but they’re designed for scenarios where we’re likely to dispatch work without a result being returned. From time-to-time we want to use a request response based model to ensure that we’re able to get data out of a service. If we use this request-response based approach, our aim should be that the latency between sending a message and returning a value in response to this message is as low as possible.

With a single destination choice router, such as the round robin router, if the routee has a long queue we suffer from a long delay in reaching the processing stage. But if we choose a short queue, which happens to have several large messages within it, we’re still left with longer latencies. The easiest way to ensure the shortest possible latency is by sending the request to the potential candidates capable of processing the message. Whichever gets to the message first can send the response back to the sender. In order to model this, we could use the broadcast router to send the message to all routees.

By doing this we’re left with the problem of the routees replying to the actor awaiting a response. Because we’re interested in getting a result back, we want to minimise latency; the minimum latency is the first response it gets back. We want to ignore the other messages and ensure that it doesn’t end up filling the message queue of the original requesting actor. We need to ignore every subsequent message which is returned in response to a given message once we’ve received a first result.

Akka.Net provides a router designed specifically for this purpose, in the form of the scatter gather first completed router. The aim of this router is to distribute a message to its routees and wait for the first result. Once it receives that first result, it sends it to the original actor asking for a response. Every message it receives from it’s routees in response to its original request are ignored. This ensures that we get the shortest possible latency from the collection of routees, whilst also preventing the original requester from getting flooded with the same response multiple times. If no reply is received from the routees within a given timespan, it automatically sends a Failure message back to the original sender to notify them of the timeout.

Figure 17 – The scatter gather router broadcasts the message to every routee and then returns the first response

An example of a potential usage of this is in cases where we might have a database with a number of replicas, each of the routees is responsible for communicating with a single replica. When we want to retrieve a value from the database we query the actor which executes the request. We can get back the response from the first replica which replied and return that to the original actor requesting the data. This allows us to focus on using a database with the lowest latency.

As we’re dealing with actors with independent configurations, we’ll create a group router using the ScatterGatherFirstCompletedGroup class. We’re able to specify the maximum timeout before a timeout failure response is sent back to the original sender. In this case, we specify that if we don’t receive a response within 200 milliseconds from one of the database servers that we send a timeout failure.

Figure 18

We’re also able to create the router in configuration as well. Below, we create the same scatter gather router using HOCON. We specify the type of router to create as well as the standard number of routees. We also specify the timeout period in HOCON. When using times in HOCON, we’re able to specify the time through the use of units. In the example below we specify that we should wait 0.2 seconds. We can use other suffixes to represent other units of time such as minutes and milliseconds.

Figure 19

The scatter gather first complete router proves to be an incredibly useful router in cases where we want to minimise latency of request and response based message passing. In this case, despite the fact that we’re distributing the work to as many routees as possible, we only need to worry about a single message being returned from the router.

Tail chopping router

Within asynchronous systems, there’s always the potential for a message to be processed at a slower rate by one actor than another. This could be caused by any number of possibilities ranging from the hardware it’s running on to operating systems, in addition to transient issues any external services may be experiencing. Whilst these slowdowns are relatively infrequent, it ultimately propagates the issue onto the user and they are delayed waiting for the result of their request. When we graph out the latency across numerous requests, it typically looks like the graph below. We have a small number of requests which execute and return almost instantaneously, and most the requests are around the median latency, with a long tail of high latency.

Figure 20  Typically, latencies follow a bell curve with a majority of responses having a common time with some arriving earlier and some arriving later

 

To ensure users enjoy a responsive application, we need to try to minimise the effects of the long tail latencies which we frequently see. To achieve this, we need to try to prevent the seemingly random slowdowns that system components might experience. The scatter gather first complete router helped make this a possibility thanks to the distribution of messages to all routees. If one of its routees was experiencing a slowdown, then it won’t cause a significant degradation of service because other routees will pick up the message and try to process the message. If it manages to process it sooner than the original, its response is forwarded to the original sender. Using this approach can cause a significant amount of redundant work even if we manage to return a result quickly. This has the potential to lead to further problems down the line due to the extra work we’re forcing ourselves to perform regardless of status. Using the scatter gather approach, we assume the worst possible scenario will occur, which is that the original response takes longer is tolerable. This isn’t likely because most responses complete within a respectable amount of time. The aim is to shorten the tail of the graph as much as possible, which is frequently much less than 1% of the total requests.

To get around this, we should only retry the work if we believe we’re going to encounter a scenario which is likely to end up forming part of the graph’s tail. This is the aim of the tail chopping router, to significantly shorten the tail of the graph. It does this by combining a number of components. The router first selects a routee at random to send the message to, and if it doesn’t receive a response within a certain amount of time, it sends the message to another routee before it awaits a response from it. Once it receives a response, it forwards the message onto the original sender and ignores all subsequent responses, but if it doesn’t receive a message before a certain period, sends a Failure message to the original sender. The tail chopping router works on the basis that there’s a high probability that one of the other workers can process the message faster than the chosen routees.

Figure 21  The tail chopping router first sends a request to a random routee and starts a timer, awaiting a response before the timer expires

Figure 22  If no response is received then a second routee is chosen at random and the timer is reset, when a message is eventually received, it’s sent back to the original asker

 

The tail chopping router requires a little more configuration to get working than most routers. We can create a tail chopping router by creating an instance of the TailChoppingPool. As with others, we need to specify the number of routees to use and much like the scatter gather router, we need to specify the overall timeout before we deem the request has failed. We also need to specify that we should attempt to contact the next routee after a specific period of time known as the interval. The example below shows an example where we create a tail chopping router with five routees and a timeout of 1.5 seconds. We also say that we should forward the message onto a second routee after 200 milliseconds if no response is returned from the contacted routees.

Figure 23

We can create this router using a HOCON based configuration where we follow a similar pattern to the scatter gather router. Many of the configuration variables used are the same, notably the number of routees and the maximum timeout, but we also supply the tail-chopping-router.interval to specify the time between multiple routee calls.

Figure 24

The tail chopping router approach allows us to perform redundant work when we believe we can get a response quicker from another target is used by several distributed NoSQL databases to help reduce the tail end of response latencies. Due to the simplicity of Akka.Net’s routers, it ends up being easy to implement in your applications to try and ensure that your users aren’t left waiting for a response for too long. It has downsides though, the most important of which is that you’re aware of the expected latency of the target routees. Without an understanding of the latency it’s likely that any configuration values supplied for the interval period won’t be effective at reducing the latency tail.

Routing strategies summary

Whilst it may seem like there’s an abundance of routers within Akka.Net, many of them are tailored to a specific set of situations and designed to ensure that your applications are responsive when faced with an increased load. The routers Akka.Net provides can use the resizer functionality to automatically resize once the application sees a certain sustained level of load upon it.

 

That’s all for this article. For more, download the free first chapter of Reactive Applications in Akka.NET.