Trilogix Cloud

Cloud Management Systems and Disaster Recovery Best Practices


A key objective of a cloud management system, indeed of a Cloud platform in general, is to provide the customer a portal (usually web-based) to order cloud services, track billing, auditing of Cloud resources and API usage; and automatically provision services that they order. Sophisticated cloud management systems will not only provision services based on customer orders, but also can automatically update network and datacenter monitoring and management systems whenever a new VM or software application is created.



Modern datacenters and particularly cloud services require careful consideration and flexible options for redundancy, high availability, continuity of operations, and disaster recovery, including:


Redundancy is achieved through a combination of hardware and/or soft- ware with the goal of ensuring continuous operation even after a failure. Should the primary component fail for any reason, the secondary systems are already online and take over seamlessly.

Examples of redundancy are multiple power and cooling modules within a server, a RAID-enabled disk system, or a secondary network switch running in standby mode to take over if the primary network switch fails.

High availability

High availability (HA) is the concept of maximizing system uptime to achieve as close to 100% availability as possible. HA is often measured by how much time the system is online versus unscheduled outages—usually shown as a percentage of uptime over a period of time. Goals for cloud providers and customers consuming cloud services are often in the range of 99.95% uptime per year. The SLA will determine what the cloud provider is guaranteeing and what outages, such as routine maintenance, fall out- side of the uptime calculation.

If you expect 99.99% uptime, this means four minutes of maximum outage per month. Many VMs, OSs, and applications will take longer than this just to boot up so HA configurations are necessary to achieve higher uptime requirements.

You can achieve the highest possible availability through various networking, application, and redundant server techniques, such as the following:

  • Secondary systems (e.g., physical or VMs) running in parallel to the primary systems—these redundant servers are fully booted and running all applications—ready to assume the role of the primary server if it were to fail. The failover from primary to secondary is instantaneous, and causes no outages nor does it have an impact on the customer.
  • Using network load balancers in front of servers or applications. The load balancer will send users or traffic to multiple servers to maximize performance by splitting the workload across all available servers. The servers that are fed by the load balancer might be a series of frontend web or application servers. More advanced load-balancing systems can even sense slow performance of their downstream servers and rebalance traffic to other servers to maintain a performance SLA (not just an availability SLA).
  • Deploy clustered servers that both share storage and applications, but can take over for one another if one fails. These servers are aware of each other’s status, often sending a heartbeat or “are you OK?” traffic to each other to ensure everything is online.
  • Applications specifically designed for the cloud normally have resiliency built in. This means that the applications are deployed using multiple replicas or instances across multiple servers or VMs; the application continues to service end users even if one or more servers fail.

The below figure is an example of an HA scenario. In this example, a VM has failed and secondary VMs are running and ready to immediately take over operations. This configuration has two redundant servers—one in the same datacenter on a separate server blade and another in the secondary datacenter. Failing-over to a server within the same datacenter is ideal and the least likely to impact customers. The redundant servers in the secondary datacenter can take over primary operations should multiple servers or the entire primary datacenter experience an outage.

high availability failover

Figure  Example of HA: a failover

Continuity of operations

Continuity of operations (CoO) is the concept of offering services even after a significant failure or disaster. In the real world, CoO refers to a broader range of keeping your entire service online after a significant failure or disaster. This would typically involve failing-over to a secondary datacenter should the primary datacenter become unavailable or involved in a disaster. The network infrastructure, server farms, storage, and applications at the secondary datacenter are roughly the same as those in the primary, and most important, the data from the primary datacenter is always being replicated to the secondary.

Disaster recovery

Similar to continuity of operations, disaster recovery (DR) is both a technology issue and a logistics challenge. Usually clients will have two or more datacenters so that you can failover between the primary and secondary in the event of a disaster. If your cloud system design is more advanced, you might have three or more datacenters, all peers of one another, with none of them acting as the “prime” center. If an outage occurs at one, the others immediately take over the customer load without a hiccup (essentially, load balancing between datacenters).

A DR plan is similar to a CoO plan, but with one important addition. As a result of your CoO plan and technologies, you can continue to service your customers or internal end users. A DR plan also includes how to rebuild the datacenter, server farms, storage, network, or any portion that was damaged by the disaster event. In the case of a total datacenter loss, the DR plan might contain strategies to build a new datacenter in another location, or lease space from an existing datacenter and purchase all new equipment. The recovery, in this worst-case scenario, might take several months to resume normal operations.

The figure below presents a scenario in which the primary datacenter has failed, lost connectivity, or is otherwise entirely unavailable to handle production customers. Due to the replication of data (through the SAN in this exam- ple) and redundant servers, all applications and operations are now shifted to the secondary datacenter. A proper CoO plan not only allows for the fail- over from the primary to secondary datacenter(s), but also documents a plan to either switch back all operations to the first datacenter after the problem has been resolved. Finally, as previously stated, the CoO plan should include procedures to make the secondary datacenter the new permanent primary datacenter should the first (failed) datacenter be unrecoverable.

disaster recovery scenarios

Leave a Comment

Your email address will not be published. Required fields are marked *