Product engineering team
- Anyone in the product team can request resources.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable.
- The infra team will challenge the product team on costs.
- Three strikes for overprovisions. Upon the third strike, the product team must provide the post mortem on why over-provision happens so often and publish it to all tech teams
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Infra engineering team
- The infra team is responsible for challenging the product team to bring down the cost.
- Every successful challenge and potentially saved cost should be logged and recorded as part of the KPI. This number will be published to the management monthly or quarterly.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Guidelines
- Overall, if cost is the same, prefer bigger but fewer instances, e.g., provision one c4.8xlarge instead of two c4.4xlarge.
- Reduces the probability that any given instance failed
- Reduces the noisy neighbor problem commonly experienced in virtualized environments
- Reduces the per-host monitoring effort, both effort and money-wise.
- Computing instances for k8s
- Instance type defaults to c5.9xlarge.
- Any instance request less than c5.4xlarge should be extremely rare, e.g., once per month.
- Number of replicas:
- Each tomcat instance should handle between 200 to 2000 RPS.
- This means very unlikey you need more than 20 replicas for a single services
- Use Little’s law to estimate. Rule of thumb is 1k RPS per replica
- Each kafka consumer should be able to handle at least 200 msg/sec. In general the consumption is not the bottleneck, the data sink is.
- For OLTP workload, highly unlikley you need more than 3 consumers.
- RDS/Aurora
- Overprovisioning write db is OK, because it is hard to scale up.
- For high traffic OLTP services (> 1000 rps peak), r5.4xlarge is more than enough
- For internal services OLTP (less than 100 rps), m5.2xlarge is more than enough
- Read db is needed only for high traffic OLTP services, max 3 read dbs are enough.
- If the suggested db specification is not enough, then 80% it is code or architecture problem
- Redis
- One r5.4xlarge should be enough for almost all our use cases, i.e., 20k rps.
- If the redis expects to handle no more than 200 rps, then we don’t really need to redis. Just use DB
- Replication factor 2 is enough in 80% cases, although 3 is OK
- Elasticsearch
- Keep between 10% to 30% free storage size. Higher than 30% means over-provision
- For log analysis, defaults to m5.4xlarge, with 2TB storage per node