Process for infra cost control

Product engineering team

Anyone in the product team can request resources.
Use the guideline below to request infra resource.
Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable.
The infra team will challenge the product team on costs.
- Three strikes for overprovisions. Upon the third strike, the product team must provide the post mortem on why over-provision happens so often and publish it to all tech teams
If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead

Infra engineering team

The infra team is responsible for challenging the product team to bring down the cost.
Every successful challenge and potentially saved cost should be logged and recorded as part of the KPI. This number will be published to the management monthly or quarterly.
Use the guideline below to request infra resource.
Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable
If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead

Guidelines

Overall, if cost is the same, prefer bigger but fewer instances, e.g., provision one c4.8xlarge instead of two c4.4xlarge.
- Reduces the probability that any given instance failed
- Reduces the noisy neighbor problem commonly experienced in virtualized environments
- Reduces the per-host monitoring effort, both effort and money-wise.
Computing instances for k8s
- Instance type defaults to c5.9xlarge.
- Any instance request less than c5.4xlarge should be extremely rare, e.g., once per month.
Number of replicas:
- Each tomcat instance should handle between 200 to 2000 RPS.
  - This means very unlikey you need more than 20 replicas for a single services
  - Use Little’s law to estimate. Rule of thumb is 1k RPS per replica
- Each kafka consumer should be able to handle at least 200 msg/sec. In general the consumption is not the bottleneck, the data sink is.
  - For OLTP workload, highly unlikley you need more than 3 consumers.
RDS/Aurora
- Overprovisioning write db is OK, because it is hard to scale up.
  - For high traffic OLTP services (> 1000 rps peak), r5.4xlarge is more than enough
  - For internal services OLTP (less than 100 rps), m5.2xlarge is more than enough
- Read db is needed only for high traffic OLTP services, max 3 read dbs are enough.
- If the suggested db specification is not enough, then 80% it is code or architecture problem
Redis
- One r5.4xlarge should be enough for almost all our use cases, i.e., 20k rps.
- If the redis expects to handle no more than 200 rps, then we don’t really need to redis. Just use DB
- Replication factor 2 is enough in 80% cases, although 3 is OK
Elasticsearch
- Keep between 10% to 30% free storage size. Higher than 30% means over-provision
- For log analysis, defaults to m5.4xlarge, with 2TB storage per node