In today’s cloud focused world, many technologies and products are available to you for building a solid disaster recovery (DR) solution. For private clouds based on VMware vSphere, you could consider VMware Site Recovery Manager (SRM).
Many existing DR systems are inadequate for reasons such as
- Little or no orchestration
- DR testing is too painful or inadequate
- RPO / RTO are not met
Your solution should address your unique use cases, goals, and requirements. For example, do you need DR only in the event of a full data center failure or also for key application or partial data center failure? Here are some common challenges:
- Accommodating complex, sensitive applications.
- Establishing a production ready recovery site.
- Disaster mitigation, failback, non-disruptive DR testing
- Monitoring, DR test reporting
Your first step is to create a conceptual design of your anticipated solution, which you can provide to potential vendors who want to pitch you a DR solution. It contains your goals, requirements, constraints, personas, operations, etc. It may contain a diagram, like this example.
|TIP: Be sure to include your requirements for DR testing in your conceptual design. The requirements probably involve non-disruptive testing, specific personas, specific application level testing, specific test criteria, and frequency of tests.|
VMware Site Recovery Manager (SRM) Solution
SRM is an extension to VMware vCenter that provides DR orchestration, site migration and non-disruptive testing. It is integrated with the vSphere Client. SRM can be a key component in a DR solution.
|TIP: An SRM based DR solution may or may not be the right fit for you. You should use your conceptual design to evaluate how a potential SRM solution addresses all your requirements.|
SRM integrates with supported storage array-based replication and VMware vSphere Replication to automate the process of DR.
In addition to DR, other SRM use cases include migration, DR testing, disaster avoidance, failback, and upgrade / patching.
Some SRM 8.1/8.3 new features include:
- HTML 5 interface (Clarity UI)
- Flexibility with SRM and vCenter Server versions. See the SRM Compatibility Matrixes for detail.
- Increased maximums
- SRM Virtual Appliance option
- Support for VVOLs
SRM is deployed in a paired configuration, with an SRM Server, vCenter Server, and ESXi hosts at each site. SRM utilizes vSphere Replication, array-based replication, or stretched storage for transferring data between sites. The choice between replication types may be based on many factors.
|TIP: Many SRM based DR solutions include array-based replication and vSphere Replication. They also include application-based replication, where failover may be managed outside of SRM.|
Array-based replication requires storage replication adaptor (SRAs) to be installed on the SRM servers. The SRAs enable SRM to manage (monitor, stop, start, and reverse) the replication and to manage storage-based snapshots.
vSphere Replication, which replicates VMs / virtual disks involves virtual appliances at each site running the vSphere Replication Management Service (VRMS) and vSphere Replication Service (VRS).
The following diagram illustrates the key components in an SRM based DR solution.
In vSphere Replication, the hypervisor uses a VR agent and a vSCSI Filter to capture writes for a VM’s virtual disk, compress the data, and ship it via a designated network to a VRS service at the recovery site. The VRS services sends the compressed data to an ESXi host via NFC (network file copy), who decompresses it and writes it to disk.
By associating multiple SRM servers with a single vCenter Server, you can use SRM to accommodate a shared recovery site or multiple peer sites.
You can use SRM to set mappings for resource, folders, and networks. SRM use these mappings to properly place recovered VMs. The sample screenshot contains a common approach, where the source and target networks are mapped using identical names.
You can configure VM protection groups to use in recovery plans. You can use priority groups and dependencies to effectively establish a startup order in a recovery plan. The example recovery plan begins by pre-synchronizing storage, stopping VMs, and synching storage. Perhaps some of the primary data center is still active, so the plan will attempt to perform a final data synchronization and graceful VMs shutdown, if it can. Regardless, it moves on to steps at the recovery site, such as bringing hosts out of standby, suspending non-critical VMs, preparing virtual disks, and starting VMs by priority group. You use the Run button to perform actual failovers or migrations. You use the Test button to run the plan in a non-disruptive manner, using dedicated networks and (VM or storage based) snapshots. When you are finished with a DR Test, you run Cleanup.
You may want to consider one of these next steps.
- Review SRM product details: https://www.vmware.com/products/site-recovery-manager.html
- VMware Hands on Lab: HOL-2005-01-SDC
- Proof of Concept Testing: https://storagehub.vmware.com/t/site-recovery-manager-3/srm-evaluation-guide/
- VMware Professional Services: https://www.vmware.com/professional-services.html
- Reach out to me on Twitter: @johnnyadavis
Conceptual Design: https://vloreblog.com/2020/04/17/conceptual-designs/
Technical Overview: SRM 8.1 Technical Overview: https://bit.ly/2O8l7Op