Disaster Recovery Drills for Consul on ECS
How to verify that consul is in a normal state
-
the nodes request shows expected number of nodes
-
insert a value on the first node, check its value on the second node
Drill 1: single node failure
-
Stop an EC2 instance, verify autoscaling group kicks and spawns a new instance.
-
Verify that the new instance joins the existing cluster and reaches a stable state
-
Go inside the container, verify that the new consul instance is at the correct version (to test the rolling upgrade case)
-
Restart the stopped instance at 1), force remove it after it joins the new stable cluster. Otherwise, force remove it.
Drill 2: loss of quorum - multiple instances down
-
Reboot all servers
-
The ECS should kick in and all servers restart and rejoin the same cluster successfully
Drill 3: loss of quorum - complete outage
-
Stop all but one server. ASG will kick in and spawn new instances, but the cluster is at a corrupted state, so any rejoin will fail.
-
Stop all newly spawned instanced expect the original alive one
??4. Change the -bootstrap-expect, note for the init bootstrap you need -boottrap-expect flag for sure -need to take out ASG desired capacity
-
go to the -data-dir of each server, inside ratf/ dir add a raft/peers.json file. Find the server id in node-id file of data directory. This file should include ONLY the alive node
-
reboot the only alive server now. Make sure the log shows that peers.json is read correctly
-
Go into the new instances, and manually join the node to the cluster