From Chaos Engineering by Mikolaj Pawlikowski

This article covers

● Setting up the VM to run through accompanying code

● Basics of Linux forensics – why did my process die?

● Blast radius


Take 40% off Chaos Engineering by entering fccpawlikowski into the discount code box at checkout at manning.com.


I’m going to take you step-by-step through building your first chaos experiment using nothing more than a few lines of bash. I’ll also use the occasion to introduce and illustrate new concepts like blast radius.

First let’s set up the workspace.

DEFINITIONS  I’ll bet you’re wondering what a blast radius is. Let me explain. Much like an explosive, a software component can go wrong and break other things it connects to. We often speak of a blast radius to describe the maximum of things which can be affected by something going wrong

Setup: working with the code in this article

I care about your learning process. To make sure that all the relevant resources and tools are available to you immediately, I’m providing a virtual machine image that you can download, import and run on any host capable of running VirtualBox (https://www.virtualbox.org/wiki/Downloads). Throughout this article, I’m going to assume that you’re executing the code provided in the VM. This way you won’t have to worry about installing the various tools on your pc. It also allows us to be more playful inside of the VM than if it was your host OS.

Before you get started, you need to import the virtual machine image into VirtualBox. To do that, you’ll complete the following steps:

  1. Download the VM image
  1. Go to https://github.com/seeker89/chaos-engineering-in-action
  2. Click on “releases”
  3. Find the latest release
  4. Download, verify and decompress the VM archive following release notes
  1. Install VirtualBox by following instructions at https://www.virtualbox.org/wiki/Downloads
  1. Import the VM image into VirtualBox:
  1. In VirtualBox, click File->Import Appliance
  2. Pick the VM image file you downloaded and unarchived
  3. Follow the wizard until completion
  1. Configure the VM to your taste (and resources):
  1. In VirtualBox, Right click your new VM and choose Settings
  2. In General -> Advanced -> Shared Clipboard check Bidirectional
  3. In System -> Motherboard choose 4096 MB of Base Memory
  4. In Display -> Video Memory choose at least 64MB
  5. In Display -> Remote Display uncheck Enable Server
  6. In Display -> Graphics Controller choose what VirtualBox recommends
  1. Start the VM and log in
  1. The username and password are both ‘chaos’

This is it! The VM comes with all the source code needed and all the tools pre-installed. The versions of the tools match the ones I used in the text of this article. All of the source code, including the code used to prebuild the VM, can be found at https://github.com/seeker89/chaos-engineering-in-action. Once you’ve completed these steps, you should be able to follow everything. If you find any issues, feel free to create an issue on that GitHub repo. Let’s get to the meat of it by introducing an ironically realistic scenario!

Scenario

A client is having trouble with their latest product: their early clients are complaining that the product sometimes doesn’t work, but when the engineers test it, it seems fine. As a growing authority in the chaos engineering community you agree to help them track and fix the issue they are facing. Challenge accepted.

This is a more common scenario than any chaos engineer wants to admit. Something is not working, the existing testing methods don’t find anything and the clock is ticking. In the ideal world, we’d proactively think about and prevent situations like this, but in the real world, we’ll often face ongoing problems. To give you the right tools to cope, I wanted to start you off with a scenario of the latter category.

In this kind of situation, we’ll typically have at least two pieces of information to work with: the overall architecture and application logs. Let’s start by taking a look at the architecture of the FizzBuzz service. As the figure 1 illustrates, it consists of a load balancer (nginx) and two identical copies of an API server (implemented in Python). When a client makes a request through their internet browser (1) the request is received by the load balancer. The load balancer is configured to route incoming traffic to any instance which is up and running (2). If the instance the load balancer chooses becomes unavailable (3), the load balancer is configured to re-transmit the request to the other instance (4). Finally, the load balancer returns the response provided by the instance of API server to the client (5) and the internal failure is transparent to the user. Please take a look at the figure 1.


Figure 1. FizzBuzz as a Service technical architecture


The other element at our disposal is the logs. A relevant sample of the logs looks like this (similar lines appear multiple times):

 
 [14658.582809] ERROR: FizzBuzz API instance exiting, exit code 143
 [14658.582809] Restarting
 [14658.582813] FizzBuzz API version 0.0.7 is up and running.
  

Although it’s a little bit light on details, it provides us with some valuable clues about what’s going on: we can see that their API server instances are restarted and we can also see something called an exit code. These restarts are a good starting point to design a chaos experiment around. But before we do that, it’s important that you know how to read exit codes like these and use them to understand what happened to a process before it died. With the Criminal Minds theme in the background, let’s take a look at the basics of Linux forensics.

Linux forensics 101

When doing chaos engineering, we often find ourselves trying to understand why a program died. It often feels like playing detective solving mysteries in a popular crime tv series. Let’s put the detective hat on and solve a case!

In the scenario above, what’s at our disposal amounts to a black box program that you can see died, and you want to figure out the why. What do you do, and how do you check what happened? In this section we’ll cover exit codes and killing processes, both manually through the kill command and by the Out Of Memory Killer, a part of Linux responsible for killing processes when the system runs low on memory. This prepares you to deal with processed dying in real life. Let’s begin with the exit codes.

DEFINITIONS  In software engineering, we often refer to systems which are “opaque” to us (we can only see their inputs and outputs, but have no insight into its inner workings) as black boxes. The opposite of a black box is also sometimes called a white box. You might have heard about the bright orange recording devices installed on airplanes. They’re also often referred to as black boxes, because they are designed to prevent tampering with them, despite their real color. When practicing chaos engineering, we’re often able to operate on entire systems or system components as black boxes.

Exit codes

When dealing with a black box piece of code, the first thing you might want to think about is running the program and seeing what happens. Unless it’s supposed to rotate the nuclear plant access codes, it might be a good idea.  To show you what that could look like, I wrote a program which dies for you. Let’s warm up by running it and investigating what happens. From the provided VM, open a new bash session and start a mysterious program by running this command:

 
 ~/src/examples/killer-whiles/mystery000
  

Notice that it exits immediately and prints an error message like this:

 
 Floating point exception (core dumped)
  

It was kind enough to tell us why it died: something to do with floating point arithmetic error. This is great for a human eye, but Linux provides a better mechanism of understanding what happened to the program. When a process terminates, it returns a number to inform the user whether it succeeded or not. This number is called an exit code. We can check the exit code returned by the preceding command by running the following command at the prompt:

 
 echo $?
  

In this case, you’ll see the following output:

 
 136
  

It means that the last program executed exited with code 136. Many (not all) Unix commands return 0 when the command succeeds, and 1 when it fails. Some use different return codes to differentiate between different errors. Bash has a fairly compact convention on exit codes[1], that I encourage you to take a look at. The codes in range 128-192 are decoded by 128+n, where n is the number of kill signal. In this example, the exit code is 136, which corresponds to 128 + 8, meaning that the program received a kill signal number eight, which is SIGFPE. It’s sent to a program when it tries to execute an erroneous arithmetic operation. Don’t worry, you don’t have to remember all the kill signal numbers by heart. You can see them with their corresponding numbers by running kill -L in the command prompt. Note, that some of the exit codes are going to differ between bash and other shells.

Remember that a program can return any exit code, sometimes by mistake. But assuming that it gives us a meaningful exit code, we know where to start debugging and life tends to be good. The program did something wrong, it died, the cold kernel justice was served. But what happens if you suspect it was a murder?

Available signals

If you’re curious about the various signals you can send (for example via kill command) you can list them easily by running the following command in your terminal:

 
 kill -L
  

You’ll see output similar to the following one:

 
  1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
  6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL      10) SIGUSR1
 11) SIGSEGV      12) SIGUSR2      13) SIGPIPE      14) SIGALRM      15) SIGTERM
 16) SIGSTKFLT    17) SIGCHLD      18) SIGCONT      19) SIGSTOP      20) SIGTSTP
 21) SIGTTIN      22) SIGTTOU      23) SIGURG       24) SIGXCPU      25) SIGXFSZ
 26) SIGVTALRM    27) SIGPROF      28) SIGWINCH     29) SIGIO        30) SIGPWR
 31) SIGSYS       34) SIGRTMIN     35) SIGRTMIN+1   36) SIGRTMIN+2   37) SIGRTMIN+3
 38) SIGRTMIN+4   39) SIGRTMIN+5   40) SIGRTMIN+6   41) SIGRTMIN+7   42) SIGRTMIN+8
 43) SIGRTMIN+9   44) SIGRTMIN+10  45) SIGRTMIN+11  46) SIGRTMIN+12  47) SIGRTMIN+13
 48) SIGRTMIN+14  49) SIGRTMIN+15  50) SIGRTMAX-14  51) SIGRTMAX-13  52) SIGRTMAX-12
 53) SIGRTMAX-11  54) SIGRTMAX-10  55) SIGRTMAX-9   56) SIGRTMAX-8   57) SIGRTMAX-7
 58) SIGRTMAX-6   59) SIGRTMAX-5   60) SIGRTMAX-4   61) SIGRTMAX-3   62) SIGRTMAX-2
 63) SIGRTMAX-1   64) SIGRTMAX
  

Killing processes

To show you how to explicitly kill processes, let’s play both the good cop and the bad cop. Open two bash sessions in two terminal windows. In the first one, run the following command to start a long-running process:

 
 sleep 3600
  

Much like it says on the tin, sleep command blocks for the specified number of seconds. This is to simulate a long running process. Your prompt blocks as it waits for the command to finish. To confirm it’s in the second terminal, run the following command to list the running processes (the f flag shows visually parent-child relationships between processes):

 
 ps f
  

In the following output, you can see sleep 3600 as a child of the other bash process:

 
   PID TTY      STAT   TIME COMMAND
  4214 pts/1    Ss     0:00 bash
  4262 pts/1    R+     0:00  \_ ps f
  2430 pts/0    Ss     0:00 bash
  4261 pts/0    S+     0:00  \_ sleep 3600
  

Now, still in the second terminal, let’s commit a process crime: kill our poor sleep process:

 
 pkill sleep
  

Notice the sleep process die in the first terminal. It prints this output and the prompt becomes available again:

 
 Terminated
  

This is useful to see, but most of the time the processes we care about die when we’re not looking at them, and we’ll be interested in gathering as much information about the circumstances of their death as possible. This is when the exit codes we covered become handy. We can verify what exit code the sleep process returned before dying by using the familiar command:

 
 echo $?
  

The exit code is 143. Similar to 136 before, it corresponds to 128 + 15, or SIGTERM, the default signal sent by the kill command. This is the same code which was displayed in the FizzBuzz logs, giving us an indication that their processes were being killed. This is an ‘aha’ moment: a first piece to our puzzle!

If we chose a different signal, we’d see a different exit code. To illustrate that, start the sleep process again from the first terminal by running the same command:

 
 sleep 3600
  

To send a KILL signal, run the following command from the second terminal:

 
 pkill -9 sleep
  

It results in getting a different exit code. To see what the exit code is, run this command from the first terminal, the one in which the process died:

 
 echo $?
  

You’ll see the following output:

 
 137
  

As you might expect, the exit code is 137, or 128 + 9. Note, that there’s nothing preventing us from using kill -8, and getting the same exit code as in the previous example where we had an arithmetic error in the program. All of this is a convention, but most of the popular tooling follows it.

So now you’ve covered another popular way for a process to die, by an explicit signal. It might be an administrator issuing a command, it might be the system detecting an arithmetic error, or it might be done by some kind of daemon managing the process. Of the latter category, an interesting example is the Out Of Memory Killer. Let’s take a look at the mighty OOM Killer.

Out Of Memory (OOM) Killer

The OOM Killer can be a surprise the first time you learn about it. If you haven’t yet, I’d like you to experience it firsthand. Let’s start with a little mystery to solve. To illustrate what OOM is all about, run the following program I’ve prepared for you from the command line:

 
 ~/src/examples/killer-whiles/mystery001
  

Can you find out what the program is doing? Where would you start? The source code is in the same folder as the executable, but stay with me for a few minutes before you read it. Let’s try to first approach it as a black box.

After a minute or two of running it you might notice your VM getting a little sluggish, which is a good hint to check the memory utilization. You can see that by running the top command from the command line, like the following. Note the use of -n1 flag to print one output and exit, rather than update continuously, and -o+%MEM to sort the processes by their memory utilization.

 
 top -n1 -o+%MEM
  

Your output is similar to the following:

 
 top - 21:35:49 up  4:21,  1 user,  load average: 0.55, 0.46, 0.49
 Tasks: 175 total,   3 running, 172 sleeping,   0 stopped,   0 zombie
 %Cpu(s): 11.8 us, 29.4 sy,  0.0 ni, 35.3 id, 20.6 wa,  0.0 hi,  2.9 si,  0.0 st
 MiB Mem :   3942.4 total,     98.9 free,   3745.5 used,     98.0 buff/cache             #A 
 MiB Swap:      0.0 total,      0.0 free,      0.0 used.      5.3 avail Mem
  
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                
  5451 chaos     20   0 3017292   2.9g      0 S   0.0  74.7   0:07.95 mystery001         #B                                    
  5375 chaos     20   0 3319204 301960  50504 S  29.4   7.5   0:06.65 gnome-shell                                            
  1458 chaos     20   0  471964 110628  44780 S   0.0   2.7   0:42.32 Xorg
 (...)
  

#A Available memory at around 100MB

#B Memory usage (RES and %MEM) and the name of mystery001 process in bold font

We can see that mystery001 is using 2.9GB of memory, almost three-quarters for my VM, and my available memory hovers around 100MB. Your top might start dying on you, or struggle to allocate memory. Unless you’re busy encoding videos or maxing out games, which is rarely a good sign. As you’re trying to figure out what’s going on, if my timing is any good, you should see your process die in the prompt:

 
 Killed
  

A murder! But what happened, who killed it? The title of this section is a little bit of a giveway; let’s check the kernel log to look for some clues. To do that, we can use dmesg. It’s a Linux utility which displays kernel messages. Let’s search for our mystery001 by running the following in a terminal:

 
 dmesg | grep -i mystery001
  

You’ll see something similar to the following output. As you read through these lines, the plot thickens. Something called oom_reaper killed your mysterious process.

 
 [14658.582932] Out of memory: Kill process 5451 (mystery001) score 758 or sacrifice child
 [14658.582939] Killed process 5451 (mystery001) total-vm:3058268kB, anon-rss:3055776kB, file-rss:4kB, shmem-rss:0kB
 [14658.644154] oom_reaper: reaped process 5451 (mystery001), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  

What is that and why is it claiming rights to your processes? If we browse through dmesg a bit more, we see a little bit of information about what OOM did, including the list of processes it evaluated before sacrificing our program on the altar of RAM.

Here’s an example, shortened for brevity. Notice the oom_score_adj column, which displays the scores of various processes from the OOM’s point of view (I bolded out the name for easier reading):

 
 [14658.582809] Tasks state (memory values in pages):
 [14658.582809] [  pid  ]   uid  tgid total_vm   rss pgtables_bytes swapents oom_score_adj name
 (...)
 [14658.582912] [   5451]  1000  5451 764567   763945  6164480        0             0 mystery001
 (...)
 [14658.582932] Out of memory: Kill process 5451 (mystery001) score 758 or sacrifice child
 [14658.582939] Killed process 5451 (mystery001) total-vm:3058268kB, anon-rss:3055776kB, file-rss:4kB, shmem-rss:0kB
 [14658.644154] oom_reaper: reaped process 5451 (mystery001), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  

OOM Killer is one of the more interesting (and controversial) memory management features in Linux kernel. Under low memory conditions, the OOM Killer kicks in and tries to figure out what processes to kill in order to reclaim some memory and for the system to regain some stability. It uses some heuristics (including niceness, how recent the process is and how much memory it uses, see https://linux-mm.org/OOM_Killer for more details) to score each process and pick the unlucky winner. If you’re interested in how it came to be and why it was implemented the way it was, the best article on this subject that I know of can be found at https://lwn.net/Articles/317814/.

This is the third popular reason for processes to die, one which often surprises newcomers. In the FizzBuzz logs sample, we know that the exit code we saw could be a result of either an explicit kill command or perhaps the OOM Killer. Unfortunately, unlike other exit codes which have a well-defined meaning, the one we saw in the logs sample doesn’t help us conclude the exact reason for the processes dying. Fortunately, chaos engineering allows us to make progress regardless of that. Let’s go ahead and get busy applying some chaos engineering!

OOM Killer settings  

The OOM Killer behavior can be tweaked via flags exposed by the kernel. From the kernel documentation, https://www.kernel.org/doc/Documentation/sysctl/vm.txt:

 
 ==============================================================
  
 oom_kill_allocating_task
  
 This enables or disables killing the OOM-triggering task in
 out-of-memory situations.
  
 If this is set to zero, the OOM killer scans through the entire
 tasklist and selects a task based on heuristics to kill.  This normally
 selects a rogue memory-hogging task that frees up a large amount of
 memory when killed.
  
 If this is set to non-zero, the OOM killer kills the task that
 triggered the out-of-memory condition.  This avoids the expensive
 tasklist scan.
  
 If panic_on_oom is selected, it takes precedence over whatever value
 is used in oom_kill_allocating_task.
  
 The default value is 0.
  

oom_dump_tasks dump extra information when killing a process for easier debugging. In the provided VM based off Ubuntu Disco Dingo, we can see both flags defaulting to 0 and 1, respectively, meaning that the OOM Killer attempts to use its heuristics to pick the victim and then dump extra information when killing processes. If you want to check the settings on your system, you can run the following commands:

 
 cat /proc/sys/vm/oom_kill_allocating_task
 cat /proc/sys/vm/oom_dump_tasks
  

That’s all for this article. If you want to learn more, you can preview the book’s contents on our browser-based liveBook platform here.