From Operations Anti-Patterns, DevOps Solutions by Jeffery Smith
This article covers:
• Longer release cycles and their impact to the team’s deployment confidence
• Automation techniques for deployments
• The value of code deployment artifacts
• Feature flags for releasing incomplete code
Patrick runs the product organization at the FunCo Company. One day his phone rings and it’s Jaheim from the sales department. Jaheim has been working his way into a sales meeting with a Quantisys, a large player in the market. He finally got a chance to demo the software in front of the Quantisys senior leadership team and they absolutely fell in love with it, but like all dream deals, there was a catch; the software needed to integrate with Quantisys’ current billing system. Jaheim knows that billing integrations are high on the list of features the development team is working on. He hopes to get the integration Quantisys needed prioritized and implemented quickly, enough to possibly salvage the deal.
Patrick listens to Jaheim’s story and agrees that it’s a tremendous opportunity. He can re-arrange priorities to land such a large customer, but there’s one catch. Even if the feature could be completed in two weeks’ time, the product operates on a quarterly release cycle. It was February and the next release cycle wasn’t until mid-April. Patrick went back and worked with the development team to see if they could figure out a way to do a targeted release with this feature set. Unfortunately, there were a slew of commits to the code base it’s difficult to untangle with any level of safety. Jaheim took the April date back to the customer, but they couldn’t wait that long to make a decision on their software; in April the feature would be brand new and might suffer from any number of bugs, delivery delays and general compatibility issues. The customer opted for another solution, Jaheim missed out on closing a great deal and FunCo lost what surely would have been a significant contributor to revenue.
Sometimes you can draw a direct line between technical operations and potential revenue and sales opportunities. When you think about the business of software, new features function like inventory. In a non-software business, a company deals with inventory on their shelves. Every moment that inventory is sitting represents unrealized revenue to the organization. The company had to pay to acquire the inventory, they have to pay to store it and to track it. That inventory also represents potential risk. What if you never sell the excess inventory? Does the inventory lose value over time? For example, let’s say your warehouse stocks the hottest Christmas present this season. If you don’t sell that stock, there’s little chance that they’ll fetch the same price next season when a hot new toy dominates children’s imaginations.
A similar situation occurs when it comes to features in software. In order to acquire these features, software developers must pour time and energy into bringing them to reality. Those developer hours cost, not only in hours worked, but also in the form of lost opportunities. A developer focused on one feature, may not be working on another. This is often referred to as “the opportunity cost”. But a feature doesn’t begin to create value for the organization once it’s complete. It only begins to build value for the company once it’s been deployed and released to users, whether it’s to all users or a specific subset. If the new features can’t be delivered to a customer in a timely fashion, value was created, but it couldn’t be captured by the company. The longer the software sits in a repository unused, the greater chance it becomes waste, and also potentially misses the market it was originally intended for. Think of it like this, if my software had a feature that integrated with Google Reader, that might be value that my customers might want to get their hands on. In fact, it might be the feature that causes people to sign up for my product. But due to issues with the release cycle, even though the software feature was finished in January, I wasn’t able to release it until April due to issues with the release process. In a shocking turn, Google announces that they’ll be shutting down Google Reader in early July.
Your knee jerk reaction might be that the development team can’t predict the future! Google Reader shutting down was an unforeseen event and was outside the scope of the delivery team. This is true, but the event reduced the amount of time that the feature could provide value for the team. If the team could release the feature when it was ready, it’d have seven months to attract customers to the platform based on that feature. Instead they were reduced to only being able to use the feature for three months before it became completely irrelevant.
The example above has plenty of room for “whataboutism” (What about X, what about y?) You can argue if the idea for the feature was a good one to begin with, but this isn’t the point. The point is that the feature was complete but couldn’t begin generating value for the organization until it was practically too late. Besides this business risk of wasting code value, a slow release process has other negative effects.
- Release cramming – The release process is painful enough that teams avoid doing it regularly. This leads to larger and larger releases, making the release riskier.
- Rushed features – Larger releases means less frequent releases. Teams might find themselves rushing a feature to ensure they can make it for the next release cycle.
- Crushing change control – When you have larger releases, they become riskier. The more risk, the bigger the impact of failure. The larger the impact of failure, the tighter the process becomes with additional layers of approvals and change control heaped on top.
This article focuses on the deployment process and how you can help to reduce the fear and risk of deployment inside your teams. Like most things in DevOps, one of the key things that help you do this is as much automation as the process can stand.
The layers of a deployment
When I think of a deployment, I think of it in layers. Once you start looking at it in terms of multiple pieces of a deployment, you can begin to see that there are multiple places where you can make a deployment and a rollback easier. In many organizations, particularly large ones, the thought is that there are a series of separate deployments that happen, but these different parts are all part of the same deployment! A deployment of code does you no good if the database change hasn’t been applied.
As illustrated in figure 1, the deployment can be seen as a few layers
- Functionality deployment is the process of enabling new features across the application.
- Fleet deployment is the process of getting the artifact deployment done across many servers.
- Artifact deployment is the process of getting new code installed on a single server.
- Database deployment is the process of any changes that need to happen to the database as part of the new code deployment.
Figure 1. The layers of deployments in an application.
It’s also valuable to think of these as separate concepts because in a lot of instances deployment processes have blended the steps together. If you haven’t thought of how you can rollback a single feature that means the feature deployment and rollback is part of the artifact or fleet rollback. And if you don’t think of these as all part of supporting the same deployment, then you create a world where the release process has to cross many different teams, with coordinated actions that don’t take into account the details of the other processes.
With the mindset of these being parts of a whole, I tend to look at the datastore deployment as being the first part of the deployment process. This is due to the sensitivity of the database, as well as the fact that it’s a shared piece of infrastructure. Every application server in the pool uses the same database, and ensuring that its deployment goes well should be a top concern. It’s also possible that your new version of code is dependent on database changes being performed prior to the code starting up. Imagine if your new code is expecting a table, but that table doesn’t exist yet.
Second, you have your artifact deployment. The deployment artifact is the logical unit of the deployment that concerns itself with getting new code on to a running server. The artifact deployment doesn’t concern itself with coordinating with other nodes or handling end-user traffic gracefully. Its primary focus is getting new code on to a server.
Third you have the fleet deployment. This is where you concern yourself with performing the artifact deployment across your entire fleet of target servers. At this layer you begin to concern yourself with things like user-traffic, load balancer routing etc. At this layer there’s a lot more coordination that needs to happen across servers because you need to ensure that there’s still enough available capacity to continue to serve users.
Lastly are feature deployments. It’s common to think that a feature deployment and a code deployment are one in the same, but as discussed previously, there’s a bit of nuance to it. A feature might not be deployed to users at the exact same time as the code that provides it is deployed. A feature which is hidden behind a feature flag allows us to separate those two ideas, albeit one must happen before the other. (The code deployment containing the feature has to go first or else there’s no feature to enable) But if you think of the feature deployment as separate, it means you can begin to think about a rollback of a feature which doesn’t necessarily involve the rollback of the code that provides it.
As you think about the deployment and rollback process, you should be thinking about it in the context of these various phases of a deployment. This helps you to create isolation for failure and recovery purposes, to ensure that everything doesn’t instantly resort to an entire rollback of the fleet. Sometimes there’s a more localized solution. I’ll talk about that in the next section.
Making deployments routine affairs
The deployment process details the steps necessary to get your code into a usable state by your customer. This involves moving code from some development or staging environment onto the production servers where customers interact with it. It can also include database level activities to ensure that the code and the database schema are compatible with each other. Up until now I’ve described the code and the features they provide as one in the same. Deploying new code means deploying new features, but in this article the delivery of new code can be decoupled from the delivery of new features In much the same way that a database schema change can be performed in phases versus a single big bang approach. One of the ways you make an activity routine is to do it in a controlled environment beforehand. The more you can make the environment like the real thing, the better and more comfortable you become. This all starts with our pre-production environments.
Accurate pre-production environments
Accuracy in pre-production environments is key to the confidence they inspire in subsequent steps throughout the process. Let me start by stating my definition of a preproduction environment. It is any application environment which isn’t production. Not all pre-production environments are created equal. You might have a pre-production environment where the nodes are extremely small, and the database is a smaller subset of production data, but the way the environment is configured should be the same, minus any performance specific tuning. Things that might differ are the number of database connection threads between production and pre-production, because the size of your database servers will likely be radically different. If you’re running Apache HTTPD 2.4.3 in production, you should be running that in pre-production as well. If you’re forcing all traffic to be TLS based in production, you should force all traffic to be TLS based in pre-production.
The more confidence that an environment mimics production, the more confidence in the deployment process. Unfortunately, staging environments are often targets for cost-cutting measures. Reproducing an entire production environment can become prohibitively expensive, and people begin to ask, “What is the minimum we can do and still get value?” The staging environment becomes a shell of what’s in production and not only with regards to hardware performance. The staging environment might be a shrinking of the current production infrastructure, where what is eight or nine distinct application servers in production, these services get boiled down to a single server, running eight or nine different applications on it as separate process. It is better than nothing but doesn’t even remotely reflect the realities of production. The focus should be on ensuring the environments are architecturally the same. By that I mean the patterns for how services are delivered should be replicated in pre-production environments, even if the size and count of servers differs. The closer you get to production in your development process, the more similar these environments should become.
The minor differences in environments can quickly begin to add up. How do you test a rolling deployment in this environment? What if there’s an accidental assumption on the part of development that a file is accessible by two different applications. For example, let’s say your product has an application server for processing web requests and then a background processing server. During the development process, someone makes the mistake of creating a file that needs to be accessible by both the application server and the background processing server. This passes through local development without any alarms because all the processes exist on a user’s workstation. It moves to the staging environment where again, different application processes exist on the same physical machine. Finally, you make the move to production and suddenly everything falls apart because the application server can access the file, but the background processing server can’t access the file because it’s on a separate physical host!
Another example of this happening is when assumptions are made about network boundaries. You have a process that normally doesn’t connect to the database server, but for this new feature it must establish a connection. Again, you fly through local development. Staging isn’t a problem because, again, all the application components live on the same machine. The chance of encountering any sort of network or firewall rule is zero. You don’t find this out until you’ve hit production and suddenly everyone is confused when this background processing job suddenly won’t start after the new deployment because it can’t connect to the database server.
The key to solving this is to make the staging environment mimic production as closely as possible in terms of architecture and topology. If there are multiple instances of an application service on separate machines in production, the same pattern should be replicated in the staging environment, albeit on a smaller scale in terms of the number and power of the hosts. This allows you to create a more accurate reflection of what the application will encounter and behave like in the production environment. Notice I said “more” accurate and not “completely” accurate. Even if you size your environment and match your environments to be the exact same specs as production, it will never truly mimic production. Knowing this is also a major part of making deployments routine affairs.
Staging is never exactly like production
In a lot of organizations there’s a ton of energy and effort put into making the staging environment behave exactly like production. But the unspoken truth is that production and staging are almost never alike. The biggest thing missing from staging environments is the thing that often causes the most problems in a system, users and concurrency.
Users are finicky creatures. They always do something unexpected by the system’s designers. The gamut ranges from using a feature in an unexpected way or deliberately malicious behavior. I’ve seen users trying to automate their own workflows, resulting in complete non-sensical behavior at an unhuman pace and frequency. I’ve seen users trying to upload massive movie files to an application which is expecting word documents. Users try to accomplish something unique and interesting in your system and if you don’t have real users in your staging environment, you need to be prepared for a plethora of scenarios that your system wasn’t tested against. Once you encounter one of those scenarios, you can add test cases to your regression suite, but know that this discover, repair, verify cycle is something that you’ll constantly experience in your platform. Leaning into that reality instead of constantly fighting it’s the only way you’re going to make sense of why your staging environment doesn’t reveal failures in the test cycle.
Many companies try to solve this by using synthetic transactions to generate user activity.
Synthetic transactions are a great option and strive to make the environment more production like, but don’t be fooled into thinking that this is a panacea for your testing woes. The same problems exist in that you can’t possibly brainstorm all the potential things that an end user is going to do. You can make a best effort to catch all the cases and continuously add to the list of synthetic tests, but you’ll always be chasing user behavior. Couple this with the fact that applications are seldom finished, you’ll constantly be adding new functionality to an application. Functionality which is ripe to be disrupted and or abused by your end user population. I’m not suggesting synthetic transactions aren’t worth the effort, but merely preparing you for the reality of their limitations.
Concurrency is another issue that can be difficult to simulate in staging environments. By concurrency, I mean the collection of multiple activities taking place at the same time on a system. The combination of an ad-hoc report executing at the same time as a large data import. The combination of hundreds of users all attempting to access a dashboard whose response time has increased by a single second. When doing testing it’s easy to make the mistake of testing in isolation. Testing the performance of an endpoint with a single user may yield different results when there are tens, hundreds, or thousands of users competing for that same resource. Resource contention, database locking, semaphore access, cache invalidation rates, all these things compound to create a different performance profile than the one that you might have seen in testing. I can’t count how many times I’ve encountered a database query that ran relatively quickly in the staging environment, but then the minute you run the same query into production, it’s now competing with other queries which are hell bent on monopolizing the database cache. Now the query that in staging was being served from memory, must go to disk in production. Now your 2ms query has ballooned to 50ms which can have a rippling impact throughout the system depending on resource demands.
You can try to mimic concurrency in your staging environments by doing synthetic transactions and ensuring that your staging environment is doing all the same background processing, scheduled tasks and so on, but it’s always an imperfect process. Sometimes a third-party system accesses your application haphazardly. Or there might be some background processing which is difficult to replicate because of the interaction it has with an external system. Despite our best efforts, generating concurrency runs into similar hurdles as simulating users. It’ll be a constant fight as the platform evolves. This is work which is never done. Despite that reality, it’s still a worthwhile effort, but it never answers the question “How do we make sure this never happens again?” That question is a symptom of not understanding the complexity in the systems being built and the endless number of possible scenarios that can be combined to create an incident.
In many organizations, deployments are big large productions. Communication is sent out weeks ahead of time, staff members are alerted and are on-the-ready in case something goes bad. The team might even create a maintenance window for the release to work without the prying eyes of customers looking to access the system. Sometimes a deployment process is detailed in pages upon pages of Word documents, with screenshots highlighting all the necessary steps that need to be taken. If you’re lucky, the document might even highlight steps that can be taken in the event something goes wrong, but those documents are rarely of any use due to the number of different ways that a deployment process can break down. Why is the deployment process fragile and feared? In a word, variability.
When you deal with software and computers one of the silent partners in everything you do is predictability. If you can predict how a system is going to behave, you’ve a much higher confidence in having that system perform more and more tasks. The predictability builds confidence and confidence leads to a faster cadence. Where does this variability come from? It comes from our feeling of how rehearsed the process is. For example, the reason many of us have staging environments isn’t only for testing, but for the sort of dry-run rehearsal prior to the real thing. As with a stage play, repetition builds confidence. A theater group who has rehearsed a show many times feels much more comfortable on opening night than a theater group that practiced the show one time in a rough mockup of what the stage will look like for the final show. In case you missed it, this is a metaphor for many lower environments and their management.
That’s all for this article. If you want to learn more about the book, you can preview its contents on our browser-based liveBook platform here.