From Chaos Engineering by Mikolaj Pawlikowski
This article covers
● Setting up the VM to run through accompanying code
● Basics of Linux forensics – why did my process die?
● Blast radius
Take 40% off Chaos Engineering by entering fccpawlikowski into the discount code box at checkout at manning.com.
I’m going to take you step-by-step through building your first chaos experiment using nothing more than a few lines of bash. I’ll also use the occasion to introduce and illustrate new concepts like blast radius.
First let’s set up the workspace.
DEFINITIONS I’ll bet you’re wondering what a blast radius is. Let me explain. Much like an explosive, a software component can go wrong and break other things it connects to. We often speak of a blast radius to describe the maximum of things which can be affected by something going wrong
Setup: working with the code in this article
I care about your learning process. To make sure that all the relevant resources and tools are available to you immediately, I’m providing a virtual machine image that you can download, import and run on any host capable of running VirtualBox (https://www.virtualbox.org/wiki/Downloads). Throughout this article, I’m going to assume that you’re executing the code provided in the VM. This way you won’t have to worry about installing the various tools on your pc. It also allows us to be more playful inside of the VM than if it was your host OS.
Before you get started, you need to import the virtual machine image into VirtualBox. To do that, you’ll complete the following steps:
- Download the VM image
- Go to https://github.com/seeker89/chaos-engineering-in-action
- Click on “releases”
- Find the latest release
- Download, verify and decompress the VM archive following release notes
- Install VirtualBox by following instructions at https://www.virtualbox.org/wiki/Downloads
- Import the VM image into VirtualBox:
- In VirtualBox, click File->Import Appliance
- Pick the VM image file you downloaded and unarchived
- Follow the wizard until completion
- Configure the VM to your taste (and resources):
- In VirtualBox, Right click your new VM and choose Settings
- In General -> Advanced -> Shared Clipboard check Bidirectional
- In System -> Motherboard choose 4096 MB of Base Memory
- In Display -> Video Memory choose at least 64MB
- In Display -> Remote Display uncheck Enable Server
- In Display -> Graphics Controller choose what VirtualBox recommends
- Start the VM and log in
- The username and password are both ‘chaos’
This is it! The VM comes with all the source code needed and all the tools pre-installed. The versions of the tools match the ones I used in the text of this article. All of the source code, including the code used to prebuild the VM, can be found at https://github.com/seeker89/chaos-engineering-in-action. Once you’ve completed these steps, you should be able to follow everything. If you find any issues, feel free to create an issue on that GitHub repo. Let’s get to the meat of it by introducing an ironically realistic scenario!
A client is having trouble with their latest product: their early clients are complaining that the product sometimes doesn’t work, but when the engineers test it, it seems fine. As a growing authority in the chaos engineering community you agree to help them track and fix the issue they are facing. Challenge accepted.
This is a more common scenario than any chaos engineer wants to admit. Something is not working, the existing testing methods don’t find anything and the clock is ticking. In the ideal world, we’d proactively think about and prevent situations like this, but in the real world, we’ll often face ongoing problems. To give you the right tools to cope, I wanted to start you off with a scenario of the latter category.
In this kind of situation, we’ll typically have at least two pieces of information to work with: the overall architecture and application logs. Let’s start by taking a look at the architecture of the FizzBuzz service. As the figure 1 illustrates, it consists of a load balancer (
nginx) and two identical copies of an API server (implemented in Python). When a client makes a request through their internet browser (1) the request is received by the load balancer. The load balancer is configured to route incoming traffic to any instance which is up and running (2). If the instance the load balancer chooses becomes unavailable (3), the load balancer is configured to re-transmit the request to the other instance (4). Finally, the load balancer returns the response provided by the instance of API server to the client (5) and the internal failure is transparent to the user. Please take a look at the figure 1.
Figure 1. FizzBuzz as a Service technical architecture
The other element at our disposal is the logs. A relevant sample of the logs looks like this (similar lines appear multiple times):
[14658.582809] ERROR: FizzBuzz API instance exiting, exit code 143 [14658.582809] Restarting [14658.582813] FizzBuzz API version 0.0.7 is up and running.
Although it’s a little bit light on details, it provides us with some valuable clues about what’s going on: we can see that their API server instances are restarted and we can also see something called an exit code. These restarts are a good starting point to design a chaos experiment around. But before we do that, it’s important that you know how to read exit codes like these and use them to understand what happened to a process before it died. With the Criminal Minds theme in the background, let’s take a look at the basics of Linux forensics.
Linux forensics 101
When doing chaos engineering, we often find ourselves trying to understand why a program died. It often feels like playing detective solving mysteries in a popular crime tv series. Let’s put the detective hat on and solve a case!
In the scenario above, what’s at our disposal amounts to a black box program that you can see died, and you want to figure out the why. What do you do, and how do you check what happened? In this section we’ll cover exit codes and killing processes, both manually through the kill command and by the Out Of Memory Killer, a part of Linux responsible for killing processes when the system runs low on memory. This prepares you to deal with processed dying in real life. Let’s begin with the exit codes.
DEFINITIONS In software engineering, we often refer to systems which are “opaque” to us (we can only see their inputs and outputs, but have no insight into its inner workings) as black boxes. The opposite of a black box is also sometimes called a white box. You might have heard about the bright orange recording devices installed on airplanes. They’re also often referred to as black boxes, because they are designed to prevent tampering with them, despite their real color. When practicing chaos engineering, we’re often able to operate on entire systems or system components as black boxes.
When dealing with a black box piece of code, the first thing you might want to think about is running the program and seeing what happens. Unless it’s supposed to rotate the nuclear plant access codes, it might be a good idea. To show you what that could look like, I wrote a program which dies for you. Let’s warm up by running it and investigating what happens. From the provided VM, open a new bash session and start a mysterious program by running this command:
Notice that it exits immediately and prints an error message like this:
Floating point exception (core dumped)
It was kind enough to tell us why it died: something to do with floating point arithmetic error. This is great for a human eye, but Linux provides a better mechanism of understanding what happened to the program. When a process terminates, it returns a number to inform the user whether it succeeded or not. This number is called an exit code. We can check the exit code returned by the preceding command by running the following command at the prompt:
In this case, you’ll see the following output:
It means that the last program executed exited with code 136. Many (not all) Unix commands return 0 when the command succeeds, and 1 when it fails. Some use different return codes to differentiate between different errors. Bash has a fairly compact convention on exit codes, that I encourage you to take a look at. The codes in range 128-192 are decoded by 128+n, where n is the number of kill signal. In this example, the exit code is 136, which corresponds to 128 + 8, meaning that the program received a kill signal number eight, which is SIGFPE. It’s sent to a program when it tries to execute an erroneous arithmetic operation. Don’t worry, you don’t have to remember all the kill signal numbers by heart. You can see them with their corresponding numbers by running kill -L in the command prompt. Note, that some of the exit codes are going to differ between bash and other shells.
Remember that a program can return any exit code, sometimes by mistake. But assuming that it gives us a meaningful exit code, we know where to start debugging and life tends to be good. The program did something wrong, it died, the cold kernel justice was served. But what happens if you suspect it was a murder?
If you’re curious about the various signals you can send (for example via kill command) you can list them easily by running the following command in your terminal:
You’ll see output similar to the following one:
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 63) SIGRTMAX-1 64) SIGRTMAX
To show you how to explicitly kill processes, let’s play both the good cop and the bad cop. Open two bash sessions in two terminal windows. In the first one, run the following command to start a long-running process:
Much like it says on the tin, sleep command blocks for the specified number of seconds. This is to simulate a long running process. Your prompt blocks as it waits for the command to finish. To confirm it’s in the second terminal, run the following command to list the running processes (the f flag shows visually parent-child relationships between processes):
In the following output, you can see
sleep 3600 as a child of the other bash process:
PID TTY STAT TIME COMMAND 4214 pts/1 Ss 0:00 bash 4262 pts/1 R+ 0:00 \_ ps f 2430 pts/0 Ss 0:00 bash 4261 pts/0 S+ 0:00 \_ sleep 3600
Now, still in the second terminal, let’s commit a process crime: kill our poor sleep process:
Notice the sleep process die in the first terminal. It prints this output and the prompt becomes available again:
This is useful to see, but most of the time the processes we care about die when we’re not looking at them, and we’ll be interested in gathering as much information about the circumstances of their death as possible. This is when the exit codes we covered become handy. We can verify what exit code the sleep process returned before dying by using the familiar command:
The exit code is 143. Similar to 136 before, it corresponds to 128 + 15, or SIGTERM, the default signal sent by the kill command. This is the same code which was displayed in the FizzBuzz logs, giving us an indication that their processes were being killed. This is an ‘aha’ moment: a first piece to our puzzle!
If we chose a different signal, we’d see a different exit code. To illustrate that, start the sleep process again from the first terminal by running the same command:
To send a
KILL signal, run the following command from the second terminal:
pkill -9 sleep
It results in getting a different exit code. To see what the exit code is, run this command from the first terminal, the one in which the process died:
You’ll see the following output:
As you might expect, the exit code is 137, or 128 + 9. Note, that there’s nothing preventing us from using kill -8, and getting the same exit code as in the previous example where we had an arithmetic error in the program. All of this is a convention, but most of the popular tooling follows it.
So now you’ve covered another popular way for a process to die, by an explicit signal. It might be an administrator issuing a command, it might be the system detecting an arithmetic error, or it might be done by some kind of daemon managing the process. Of the latter category, an interesting example is the Out Of Memory Killer. Let’s take a look at the mighty OOM Killer.
Out Of Memory (OOM) Killer
The OOM Killer can be a surprise the first time you learn about it. If you haven’t yet, I’d like you to experience it firsthand. Let’s start with a little mystery to solve. To illustrate what OOM is all about, run the following program I’ve prepared for you from the command line:
Can you find out what the program is doing? Where would you start? The source code is in the same folder as the executable, but stay with me for a few minutes before you read it. Let’s try to first approach it as a black box.
After a minute or two of running it you might notice your VM getting a little sluggish, which is a good hint to check the memory utilization. You can see that by running the top command from the command line, like the following. Note the use of
-n1 flag to print one output and exit, rather than update continuously, and
-o+%MEM to sort the processes by their memory utilization.
top -n1 -o+%MEM
Your output is similar to the following:
top - 21:35:49 up 4:21, 1 user, load average: 0.55, 0.46, 0.49 Tasks: 175 total, 3 running, 172 sleeping, 0 stopped, 0 zombie %Cpu(s): 11.8 us, 29.4 sy, 0.0 ni, 35.3 id, 20.6 wa, 0.0 hi, 2.9 si, 0.0 st MiB Mem : 3942.4 total, 98.9 free, 3745.5 used, 98.0 buff/cache #A MiB Swap: 0.0 total, 0.0 free, 0.0 used. 5.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5451 chaos 20 0 3017292 2.9g 0 S 0.0 74.7 0:07.95 mystery001 #B 5375 chaos 20 0 3319204 301960 50504 S 29.4 7.5 0:06.65 gnome-shell 1458 chaos 20 0 471964 110628 44780 S 0.0 2.7 0:42.32 Xorg (...)
#A Available memory at around 100MB
#B Memory usage (RES and %MEM) and the name of mystery001 process in bold font
We can see that
mystery001 is using 2.9GB of memory, almost three-quarters for my VM, and my available memory hovers around 100MB. Your top might start dying on you, or struggle to allocate memory. Unless you’re busy encoding videos or maxing out games, which is rarely a good sign. As you’re trying to figure out what’s going on, if my timing is any good, you should see your process die in the prompt:
A murder! But what happened, who killed it? The title of this section is a little bit of a giveway; let’s check the kernel log to look for some clues. To do that, we can use
dmesg. It’s a Linux utility which displays kernel messages. Let’s search for our
mystery001 by running the following in a terminal:
dmesg | grep -i mystery001
You’ll see something similar to the following output. As you read through these lines, the plot thickens. Something called
oom_reaper killed your mysterious process.
[14658.582932] Out of memory: Kill process 5451 (mystery001) score 758 or sacrifice child [14658.582939] Killed process 5451 (mystery001) total-vm:3058268kB, anon-rss:3055776kB, file-rss:4kB, shmem-rss:0kB [14658.644154] oom_reaper: reaped process 5451 (mystery001), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
What is that and why is it claiming rights to your processes? If we browse through
dmesg a bit more, we see a little bit of information about what OOM did, including the list of processes it evaluated before sacrificing our program on the altar of RAM.
Here’s an example, shortened for brevity. Notice the
oom_score_adj column, which displays the scores of various processes from the OOM’s point of view (I bolded out the name for easier reading):
[14658.582809] Tasks state (memory values in pages): [14658.582809] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name (...) [14658.582912] [ 5451] 1000 5451 764567 763945 6164480 0 0 mystery001 (...) [14658.582932] Out of memory: Kill process 5451 (mystery001) score 758 or sacrifice child [14658.582939] Killed process 5451 (mystery001) total-vm:3058268kB, anon-rss:3055776kB, file-rss:4kB, shmem-rss:0kB [14658.644154] oom_reaper: reaped process 5451 (mystery001), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
OOM Killer is one of the more interesting (and controversial) memory management features in Linux kernel. Under low memory conditions, the OOM Killer kicks in and tries to figure out what processes to kill in order to reclaim some memory and for the system to regain some stability. It uses some heuristics (including
niceness, how recent the process is and how much memory it uses, see https://linux-mm.org/OOM_Killer for more details) to score each process and pick the unlucky winner. If you’re interested in how it came to be and why it was implemented the way it was, the best article on this subject that I know of can be found at https://lwn.net/Articles/317814/.
This is the third popular reason for processes to die, one which often surprises newcomers. In the FizzBuzz logs sample, we know that the exit code we saw could be a result of either an explicit kill command or perhaps the OOM Killer. Unfortunately, unlike other exit codes which have a well-defined meaning, the one we saw in the logs sample doesn’t help us conclude the exact reason for the processes dying. Fortunately, chaos engineering allows us to make progress regardless of that. Let’s go ahead and get busy applying some chaos engineering!
OOM Killer settings
The OOM Killer behavior can be tweaked via flags exposed by the kernel. From the kernel documentation, https://www.kernel.org/doc/Documentation/sysctl/vm.txt:
============================================================== oom_kill_allocating_task This enables or disables killing the OOM-triggering task in out-of-memory situations. If this is set to zero, the OOM killer scans through the entire tasklist and selects a task based on heuristics to kill. This normally selects a rogue memory-hogging task that frees up a large amount of memory when killed. If this is set to non-zero, the OOM killer kills the task that triggered the out-of-memory condition. This avoids the expensive tasklist scan. If panic_on_oom is selected, it takes precedence over whatever value is used in oom_kill_allocating_task. The default value is 0.
oom_dump_tasks dump extra information when killing a process for easier debugging. In the provided VM based off Ubuntu Disco Dingo, we can see both flags defaulting to 0 and 1, respectively, meaning that the OOM Killer attempts to use its heuristics to pick the victim and then dump extra information when killing processes. If you want to check the settings on your system, you can run the following commands:
cat /proc/sys/vm/oom_kill_allocating_task cat /proc/sys/vm/oom_dump_tasks
That’s all for this article. If you want to learn more, you can preview the book’s contents on our browser-based liveBook platform here.