When A Typo Can Bring Down Your Entire System, You've Got To Know Which Preventative Measures To Take
A nightmare scenario for any IT department is the critical failure of a production system, particularly during rollout of a new feature. An error during an upgrade can take an entire division of a company offline for employees and clients, sometimes resulting in serious profit loss. Even minor mistakes can cause unnecessary headaches for end-users and IT staff. Preparing and implementing a comprehensive deployment strategy should be a mission-critical initiative for any corporation that performs their own internal development and maintenance.
Even experienced companies aren’t immune to deployment problems. Consider the latest outage event for Amazon Web Services (AWS) that occurred earlier this year. The AWS team published an explanation shortly after resolving the error, stating root cause was a simple typo in executing a deployment script. This typo rendered the AWS S3 service (a cloud storage solution used by many websites for content hosting) in the US-EAST-1 region inaccessible, affecting dozens of websites and web-based applications across the Internet. Multimedia publication site The Verge reported apps such as Trello, Quora, IFTTT and GroupMe all experienced some level of outage, ranging from a mere loss of displayed images to complete site downtime. Ironically, even isitdownrightnow.com, the website that checks if a site is down, also had hosting and response issues. This isn’t to say that the event is a perfect example of poor deployment practice; in fact it’s far from it. It goes to show that even for an organization as practiced and efficient as the AWS team, something simple and easy-to-miss can occur when implementing a production-level change.
In general, the following high-level approach should be taken when standardizing production deployment practices:
1. Consider impact to integrated applications. Analyze connections between internal applications and determine acceptable downtime periods. For example, perhaps there is a continuous syncing process running between a directory and a database. Failing to account for downtime on both endpoints can result in errors or data going out-of-date. A complicated or sensitive system that requires continuous uptime may need a multi-step process involving temporary switchovers to secondary or disaster recovery environments.
2. Plot rollback steps. This step is essential no matter the deployment type: always take backups of data, applications or file systems that could be affected prior to starting a production change! Most importantly, the actual process of a system restore must be planned and tested. The last thing any company wants is to run into an error while pushing a change to a production system, attempt a rollback, then learn at that moment that the restoration process doesn’t actually work in this situation.
3. Test and automate as much of the deployment process as possible. Every manual action taken during deployment increases the risk of improper code promotion, loss of user information, or worse, unintended data changes on a large scale -- remember, a single typo caused region-wide impact for AWS S3! Any automated script must also be painstakingly scoped to follow the security principle of “least privilege”. Simply put, don’t let the process have more power or access than it needs for the situation. That single typo in the AWS example wouldn’t have been as catastrophic if the script being utilized didn’t have the ability to shut down more than the desired set of servers. Lastly, all deployment scripts must be tested in lower-level environments prior to being used in production. The best way to mitigate risk of data corruption and potential loss of revenue when making updates to production systems is to test the process in a “production-like” environment beforehand*. At the very least, deployment scripts should be run in smaller-scale environments more than once, under different circumstances, to catch any potential bugs and crashes.
Of course, not all risk is avoidable. Software and development practices change constantly and nobody’s perfect. Companies that lack manpower or “know how” to accomplish the work described above should outsource to experience professionals. The teams at Hub City Media, for example, have handled large and small scale production deployments, ranging from major upgrades of identity governance infrastructure, to network-wide implementations of single sign-on and federation, to migration and consolidation of dozens of directory systems. Any company implementing their own DevOps processes should take note of their existing infrastructure needs and differences between internal applications, prepare contingency plans for system restoration and rollbacks and, most importantly, test deployment processes to catch bugs before they make it to production.
*In all fairness, it’s probably fairly difficult for AWS to simulate their S3 production servers, but most companies also aren’t providing cloud storage services for entire regions.
Nick T. is a Senior Systems Engineer at Hub City Media. He is eagerly awaiting the rise of the machines and extends an early welcome to the new robot overlords.
"Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region." Amazon Web Services, Inc. Amazon, 02 Mar. 2017. Web. 01 Apr. 2017.
Kastrenakes, Jacob. "Amazon's Web Servers Are down and It's Causing Trouble across the Internet." The Verge. The Verge, 28 Feb. 2017. Web. 01 Apr. 2017.