DR Drills with AWS Aurora

Objectives

Verify that our services on Aurora can still perform within SLA with degraded aurora service
Build tools and procedures for such drills, so that we can repeat drills against other services on different platform.

Note

All drills should happen in the context of our performance testing load, i.e., new traffic coming.
Upon injected failure, aurora will just restart on the spot instead of failover, this means the service will wait until the master recovers
Between AZs, the replica lag is around 20ms. However, this just means the lags in cache, because aurora uses a share-disk design, the data is always consistent.
Failover times are typically 60-120 seconds.This means most connections will timeout during failover
Aurora also has the capabity to perform disk failure and disk congestions, but drilling on such things brings debatable additonal value, until we gain more experience on that

Crash dispatcher on writer

master: create db instance
read replica: Read replica has been disconnected from master. Restarting Mysql => create db instance

Crash instance on writer

master: DB instance restarted
read replica: DB instance restarted

Crash node on writer

master: DB instance restarted
read replica: Read replica has fallen behind the master too much. Restarting Mysql => DB instance restarted

Failover

old master:

a. Started cross AZ failover to DB instance
b. A new writer was promoted. Restarting database as a reader.
c. DB instance restarted
d. Completed failover to DB instance

new master:

a. Started cross AZ failover to DB instance
b. DB instance shutdown
c. DB instance restarted	
d. Completed failover to DB instance

Schedule

Such drill should happen once a month, before major version release.
The drill should start during low-traffic times, e.g., 2am local time

Drill 1: Failover

Ensure traffic is going through our service. Either through traffic replication or load testing tool
Failover the current writer to a reader in a different AZ
During the failover, the service health check should be remain OK all the time
During the failover, write failure is expected, but read failure should not happen

Drill 2: Deleting a read replica

Ensure traffic is going through our service. Either through traffic replication or load testing tool
Ensure we have at least 2 healthy aurora instances running
Pick a read replica and delete it
During the failover, the service health check should be remain OK all the time
During the failover, write/read failure should not happen
Create a new read replica off the current writer

Drill 3: Deleting the current writer

Ensure traffic is going through our service. Either through traffic replication or load testing tool
Ensure we have at least 2 healthy aurora instances running
Pick the current writer and delete it
During the failover, the service health check should be remain OK all the time
During the failover, write failure is expected, but read failure should not happen
Create a new read replica off the current writer