Source

For interviewer

  • Identify workflow, are you able to learn tools/methods from the candidate?
  • How do they deal with problems that are bound to happen in the work
  • Alternatively, “Describe your process for [a major part of the job]

What’s the most importatnt part of getting your job done

The most important part is to get stakeholders, most likely from different teams, to agree on the context of the problem and document this aggrement. Such agreement needs to be revisited once two week

This includes:

  • Goals and measurements, i.e., what external impact we can have after we finish this project, this number should be meaningful to the external user/customer, and not intenal metrics within the team carrying out the task
  • Roles of stakeholders: who is the DRI. who can veto, who can inform, whose (ideally no more than 3) signoffs are needed
  • Non-goals: cleared rejected scopes, so that later we can revisit it
  • Milestones and revisit internal: every 1-2 weeks.

Describe your process for handling performance problems

Case 1: slow query

  • API timeout
  • Check slow query log, identify slow queries by long running process
    • Re-run the query on prod or repro env to reproduce it, espeically take out the where checks not cover by the index to make sure it doesn’t scan too many rows or bad query plan
  • Informed stakeholders and applied it in 30 mins. Documented RCA the following morning

Case 2: timeout too long

  • In preparation for the big campaign, I am inspectiving modules, one module’s p99/max spikes on production
  • Check db qps and latency - Normal
    • Check external API - way long than expected, and the main API didn’t give 408 or 504
    • Check the past load testing result, without this external API call all within the expectation
    • Discussed with the owner of the module, confirmed it is a tech debt
  • Do nothing until the next release cycle because this one does not give customer impact

Case 3: connection pool size too small

  • API timeout
  • Check the chained API, the latency is good, and request count really low
  • Suspect client side problem. Check the HttpClient docs, find the common problem
  • Informed stakholders in a day and documented RCA.

Describe your process for handling SEV 2 and above

Case 1: user didn’t see the payment on

  • Unable to reproduce in other environments. Direct affecting 200 txns per sec
  • Add metrics at every single level, and release it to narrow down the code
    • Read source code and draw graph to understand the flow
  • Find the explaination and mitigation, released the degraded but correct exp in 3 hours
    • Complete fix took two weeks to release, because will not experience such traffic soon, and people are tired after the big campaign

Case 2: periodically latency spike

  • A critical service’s API latency spikes every day, and k8s health check failed
  • Check db query to find longest total time queries
    • Find that it is from a db co-hosted on the same server
  • Provide mitigation by turning off the cron job
    • Complete fix provided in two weeks with proper db seperation and index

Case 3: Run out of auto incremental id

  • 36 hours before switching online, unable to execute txn
  • Ask support for mitigation and test case. Tested independently both us and 3rd party
  • Contact sign off people to explain my proposal and the risk of doing so. Got approval to go ahead as planned
  • Goes as planned without incident
    • Hot fix released in 3 month. Upgraded in 5 month

Describe your process for handling disagreement with 360

Case 1 & 2: db selection - round 1 & 2

  • Run into performance problem with existing db
  • Document all pros and cons for discussion with stakeholders
    • Got overruled. Stayed my objection and propose a PoC
    • Decision made without consulting me
    • Provide migration plan to commit the success
  • Run into performance problem again after 6 months. Unable to handle it even upgrade to the highest spec
    • List different options with new context
    • PoCed wtih all options
    • Stakeholder picked the option and went ahead
  • Stable, incident-free deployment

Case 3: the performance impact of count(*)

  • 500 error due to failed health checks during load testing
  • Explained why current solution won’t work. Offered sample solution. The other team’s owner doesn’t commit to fix it.
  • Raised to my chain of command, forwarded solution, and told not to worry about it after that