3 Steps to Business Continuity and Disaster Recovery After a Ransomware Attack
The depth and sophistication of ransomware attacks on companies lately have many IT organizations scrambling for security solutions. A recent study by Recorded Future — a security firm that tracks ransomware attacks — estimated that last year, there was one ransomware attack every eight minutes, totaling 65,000 successful ransomware attacks. 1
Most of the focus on IT security threats revolves around monitoring prevention. The IT security software space is growing exponentially. A recent survey of over 3,200 business and technology executives from Global Digital Trust 2021 finds that business leaders are working with business teams to increase and strengthen the resilience of the organization as a whole. More than half or 55% of these companies intend to increase their cyber security budgets, with half stating that privacy and security will be incorporate into every business decision and plan. 2
Any IT professional that has spent time in information security, will tell you that protecting networks is a never ending game of cat and mouse with attackers, where attackers typically have the upper hand. 3
Cyber security company Veritas states every 11 seconds, an organization is hit with an attack. 4 Many IT organizations over focus on prevention and typically have fairly defensible incident response for one off attacks, a virus on the network or similar. However, when it comes to ransomware, natural disasters, or other catastrophic events, very few organizations invest in recovery.
Most IT organizations have disaster recovery solutions in place. These solutions consist of a combination of software and process to recover systems and data. However, many of these solutions are rarely tested and limited in scope to just IT’s processes to bring servers back online. 5
In my 25 year career, I presided over one disaster recovery incident. While we survived the incident and brought systems online, it came at a significant cost. Our systems were offline for 5 days and after 30 days of tedious work, we were able to do a partial restore of data. Not having executed periodic “Game Day Simulations” (discussed later), our backups were not complete, leading to a permanent loss of 2 weeks of historical (and important) data.
I was fortunate enough shortly after to be charged with developing a Business Process Continuity (BPC) plan. A BPC plan is a comprehensive document that incorporates disaster recovery (DR). Often called BPC/DR, these plans view IT system availability in the context of keeping a business running. A solid BCC/DR plan provides a clear plan to execute a coordinated recovery of IT systems and ensure continued business operations. 6
The following sections describe the 3 critical steps any company should take to implement their own BPC/DR plan.
Develop a Business Continuity Plan Document
The first step in recovery from a ransomware attack or any disaster is for IT organizations to develop a Business Process Continuity and Disaster Recovery Plan (BPC/DR) document. This document is the critical first step that outlines the who, what, where, and when of responding to a disaster event. Drafting a BPC/DR plan is not difficult — it just takes time, organization, and attention to detail.
One can Google search and find many templates and how-to guides to write a BPC/DR document. In its most simple form, a BPC/DR document typically has the following sections:
- Address and contact information for all physical locations
- A table of insurance policy numbers and insurer contact information
- An employee notification and communication tree
- Designated representatives to speak on behalf of the company
- A list of all IT services ranked by importance to business continuity
A Service Recovery Template for each IT service consisting of:
- Server information - Hardware, OS, IP, DNS Name, Location
- Application information - Software version, access instructions
- Recovery Time Objective (RTO) - how long it will take to restore the service
- Recovery Point Objective (RPO) - the maximum amount of data, measured by time, that can be lost
- Set of steps to be taken by IT staff to restore the service
The one part of this plan that takes work is developing the employee notification and communication tree. This process identifies the chain of notification throughout the organization that identifies specific managers in the company responsible for a specific set of information and a clear process to communicate that information through the company. One of the biggest challenges in dealing with a BPC/DR event is correct information flow. It is critical that identified managers are trained on the notification tree and the specific information they should share down the tree to their reports.
Implement Disaster Recovery Automation
The second step in BPC/DR is to implement an automated failover software solution. Disaster Recovery projects are on every IT team’s backlog of items that “need to get done”. However, higher priority projects always trump DR projects, just building more technical debt for the company.
For some IT organizations, they have bent a backup and restore software solution into a minimally viable DR solution. This very common approach consists of a hodge podge of: backup software, scripts that replicate between datacenters, and a run book on how to restore data. These solutions “Check the boxes” in many ways on a DR plan, but everyone on the IT team fears the day they’d actually be called to implement this plan. While IT automation has been around forever, recent gains in automation development make it a very viable solution for a BPC/DR plan. Specifically, Red Hat Ansible Automation has developed quite an ecosystem of vendor- and user- contributed content that makes it an extremely viable solution to be the “brains” of a BPC/DR Solution.
By design, Ansible abstracts all of the automation functions out of the automation process via Ansible content and collections. Between Ansible Tower, Ansible Galaxy, and Red Hat Ansible Automation Hub, Red Hat provides prebuilt certified automation functionality to support data center stalwart vendors including Microsoft, F5, Juniper, Cisco, NetApp, Palo Alto Networks, Cisco, and VMware. Using a simple language, system administrators and developers alike simply “assemble” automation via Ansible Playbooks.
In the simplest terms, an BPC/DR solution using Ansible Tower will accomplish the following:
- Fail over the primary Ansible Tower to the DR Ansible Tower
- Automate the reconfiguration of perimeter network firewall, routing, switching, and naming services
- Re-configure warm stateful standby servers and applications with the correct production network and naming services
- Instantiate new instances of stateless servers and applications
- Perform health checks across all systems across the DR site
The ideal scenario is one in which an IT organization has already made an investment in Ansible Tower in their production infrastructure and have all or a subset of installation, configuration, orchestration, and day 2 operations automated across all devices. The more existing investment in production Ansible makes the BPC/DR strategy easier to attain.
Execute Game Day Simulations
Game Day simulations test the BPC/DR process in a controlled environment. These simulations are absolutely critical for obvious reasons, not the least of which is to ensure that the business can follow the BPC/DR plan and the IT team’s configuration of the DR solution actually works. In addition to the technical execution of the DR solution, the IT team also measures whether or not the RTO and RPO objectives in the BPC/DR plan are accurate and attainable.
The intensity of Game Day simulations are ultimately up to each IT department. Usually, Game Day simulations can be as easy as a “Paper Simulation”, which as the name implies, is a whiteboard exercise in which teams walk through and discuss the steps they take to implement the BPC/DR plan. Beyond paper Game Day simulations, IT organizations can plan for real simulations involving the failover of production applications. For a simple reference on how to run a Game Day simulation, check out this procedure from the AWS Well Architected Framework documentation.
Whether paper or real, Game Day simulations will provide the required experience and confidence for IT teams to execute business continuity and disaster recovery when catastrophic events like ransomware attacks happen.
Conclusion
Implementing a Business Continuity and Disaster Recovery plan is not difficult. While many businesses and IT teams have all the required budget, time and expertise, there needs to be focus. By broadening information security initiatives to consider recovery of equal importance to prevention, companies will be much better positioned to high profile ransomware attacks and natural disasters. Developing a plan, leveraging ubiquitous automation tools like Red Hat Ansible, and running Game Day simulations, greatly increase the chances of a company recovering from such events within hours instead of days or weeks. If you are feeling overwhelmed and unsure where to start, Stone Door Group offers our Disaster Recovery with Ansible Tower Accelerator℠
which is a turnkey software and services offering that delivers all the required automation frameworks to successfully recover from ransomware attacks. To get an overview of our solution, register here for a live virtual demonstration of using Ansible Tower for disaster recovery.
About the Author
Darren Hoch is a Managing Partner for Stone Door Group, an IBM and Red Hat Apex partner, that specializes in enterprise DevOps cloud engineering solutions. Stone Door Group helps enterprises of all sizes with their digital transformation initiatives. To talk to Darren and learn more how Stone Door Group can help, drop us a line at: letsdothis@stonedoorgroup.com.
Sources: