On Mysql innodb buffer pool

2019-11-08 00:00:00 +0000

Innodb buffer pool

  • Two LRU lists, New vs Old = 5:3. Both stores index and data pages
  • New list is mostly explicitly hit pages by the query
  • Old list is mostly read ahead pages. All pages start at the head of the old list, and will move to the new list once it is hit
  • Aurora defaults to 75% of node’s physical memory

Change buffer

  • Not avaiable in Aurora
  • Uses buffer pool space. defaults 25% of the buffer pool, max 50%
  • Cache non-clusterd index change, and merges it only when the said index page is loaded into the buffer pool
  • The motivation is that modification into the secondary index is often very random access, and yet we have to read them into the buffer pool to process it

Be Professional

2019-11-07 00:00:00 +0000

How to act in scenarios

  • You are in a position of authority and need to issue an unpopular decision from your supervisor
    • just issue the decision, espeicially ones applying to everyone and/or the decision is from your supervisor
  • Sudden bad news/crisis Options:
    1. From competitors, find their weakness to exchange, or escalate that into a common/general problem to shift focus
    2. From public/users, transparent on all known
    3. Product, slient/fallback/stay low.
    4. Note often these things appear together. Need to identify which is the main one
  • Authority issues a decision you deemed incorrect
    • Once the boss made the decision, it is our job to fulfill the vision within the boss’s framework. Don’t expect to influence the boss too much on the decision. In general, do not try to change people
    • You may disagree, but have to commit
  • Rejected by the authority in front of your subordinates
    • Know that the authority often has a different context - thus, complaining only shows that you are narrow minded
    • One way is to not talk about that, and let time dilute the effect
  • Sudden incidents need to address
    • If incidents from external, focus on that as it brings more impact. If it is internal, emphasize on it will give a (false) impression of internal disorganization
  • Got blamed for things out of your scope
    • explain that not within your scope
    • offer your solution - put more emphasis on what you can do to help solve the problem.
  • Commanded by bosses’s boss directly
    • Should obey the command
    • Explain current context including boss’s direction
    • Let the boss know the change and command as soon as time permits
  • Someone wants to gossip with you
    • Be very explict you are not interested
    • Change topic
    • If keeps happening, report to his manager

REMEMBER

  • Shut up on issues related to money and people management - that is the management’s job
  • Importance is to bring value to the boss. Do not use threatening with resigning and active rejection
  • Don’t mention things you hear in different locations, at least don’t frame as such
  • NEVER complain your company or your team’s task in front your subordinates and stop the complaints at you! However, it is OK to complain to your boss!
  • Corrective/negative feedback should be executed by the direct supervisor, even if it is from upper management

On motivating the team

2019-11-06 00:00:00 +0000

Training

  • Training is one of the highest leverage activity can be provided to a team. Do it yourself, or have the best employees in each area run trainings
  • Change how people think by letting them practice different behavior over time
  • Mentor is strong reason an employee stay. A cheap, effective, but time-consuming motivator
  • No need to be formal, but need to be intentional
  • In training, spend more time on practice/repetition, and less on content

Finding interesting work for the team

  • If you want to be able to find interesting work and also work on important things, you generally have to go find the interesting important things yourself. This requires that you to talk to a lot of people and listen to their problems, and then place a bet on a solution to one of these problems that will actually both be feasible but will also be seen as important. Your manager might help identify people that you could talk to, but you must take responsibility for doing the legwork and making the final choice in problems to address.
  • A critical feature of effective complex organisations is that they make people do all the jobs.
  • Most companies design jobs and then slot people into them. Our best managers sometimes do the opposite: When they find talented people, they’re open to creating jobs around them.
  • The best go out of their way to help people do work they enjoy — even if it means rotating them out of roles where they’re excelling.
  • In the first week on the job, managers sit down with their new hires and ask them about their favorite projects they’ve done, the moments when they’ve felt most energized at work, the times when they’ve found themselves totally immersed in a state of flow, and the passions they have outside their jobs.
  • Build a searchable database of experts. The goal is to put employees’ strengths on display so that people know whom to contact.
  • Let them experiment. Let them obsess. Let them scratch that itch. If there is no project on their plate that you know is engaging them, create time for them to explore whatever they want to obsess about. I absolutely guarantee there is an investigation somehow related to their work that they are dying to tinker with.

Perks

  • Many companies have over-invested in hygiene factors. High pay, tons of perks, and over-investment in superficial aspects of “company culture” represent the wrong end of the 80-20 rule for impacting the motivation of your people.
  • Perks can be expensive, but they’re easy. Once you add a perk, it just continues to be a line item on your budget. There’s little you have to think about.
  • The next time you think about adding a perk, challenge yourself how beneficial it will really be for your company. Avoid the trap of choosing perks rather than the hard work of improving your managers
  • Talk about great managers as your company’s No. 1 benefit. Manager quality is the single best predictor of whether an employee will stay or leave. Good managers also cause their reports to have higher performance scores.
  • People refer friends because it’s a great place to work, not (primarily) for the referral bonus.
  • Non cash rewards, whether they are experiences (dinner for two, a trip) or gifts (a new phone) trigger an emotional response. When people are surveyed they say they prefer cash awards, but they report higher levels of happiness when receiving experiential awards.

  • Rather than just look at how to engage employees, it’s also helpful to look at how to lose them, and do the opposite.
  • Spread mundane, chore tasks evenly between teams
  • Stress is stress, fatigue on one task spills over into another
  • we need to be mindful that what is going on in our players’ minds, and in their personal lives away from the training ground, plays a huge part in what we get from them on a day-to-day basis

How to win friends and influence people

2019-11-05 00:00:00 +0000

  • Have to be interested in people when face-to-face
  • Start with your own shortcoming and the counterpart’s supriority before criticizing
  • Quick feedback loop!
  • Specific details when you compliment
  • Give people a fine reputation they live up to. Use titles and responsibilities as toys for people that they want to own
  • Always makes people happy doing what you suggest. What benefit will they gain from doing what you suggest?
  • Rejecting people is OK, but see if you can provide an alternative solution at the same time
  • Avoid insult, ridicule, criticize in most cases - they are just not productive. (On a side note, i see so many people violating this by trying to be funny)
  • Use the person’s first name as much as you can
  • Do not tell other people you are wiser than them
  • Don’t teach people, guide them to find it within himself
  • Presenting facts in a vivid, dramatic way with multiple senses

Guideline for interviewer/hiring

2019-11-04 00:00:00 +0000

  • Secretary Problem. Picking an candidate is inherently hard, expect > 30% not so great hires!
  • Culture fit evaluations can invite bias into your process if interviewers see it as simply a way to assess a candidate’s likeability, rather than how they align with your company’s core values.
  • You want people who will fit in and flourish at your company while, at the same time, challenging the outlook and ideas of your employees.
  • Never ever hire a person you feel is overly ego-driven.
  • Overall manager should personally vet the new hires
  • Who to hire/promote etc should be done by a commitee with believability weighted decision making
    • How the decision is made should be transparent, but not the individual situation
  • Better to miss great hire than making a bad hire. Veto rights should be given to multiple people
  • First impression is biased and overly important
  • Use structured interview questions or sample work, and score on consistant rubics. e.g. Tell me about a time when you [outcome that is needed for this role]. Avoid unstructured if you can - people over-estimate themselves in evaluating other people
  • the interview skillset is not equal, ask the best interviewers to make this a bigger part of their job
  • have at least two interviewers assess each value or key competency
  • Expert interviewers are often the reason a candidate joins a company
  • Team should feel the alignment of the tech perspective with the candidate
  • Candidate can be a leader or follower regardless of current status and self-esteem
  • Don’t sell the position by cool tech. Instead, look for people who wants to own, interested in the business, and customer
  • Easier to hire 90th percentile than training avg person to 90th percentile
    • “Instead of 150 new people, are you sure you don’t want 75 people whom you pay twice as much because they have twice as much experience and can be higher performers?”
  • Use closed tech quesitons to filter out obvious nos
    • ask why certain tech is needed, and then dive deep into details to see the max depth

How do managers and ICs get stuck

2019-11-04 00:00:00 +0000

How do managers get stuck

  • Need to attend to details. Communicate timely what is accomplished, setback, challenges, decisions make
    • Push it before being asked, and provide details if necessary
  • Professional verbal, written, and body communication in front of the senior management. Can you present to the peers of senior management alone?
  • When you are asked to do something by your manager. Either do it or say you don’t want to do it. Just don’t fail to do it

How do ICs get stuck

Common cause:

  • Don’t try to be 100% prepared before starting off the project
  • Timebox the research, especially prototyping
  • Limited refactoring
  • Helping others when they have assigned task
  • Jump to fires when not on-call

Common symptom of difficulties

  • finish the last 20% of the project
  • Start project from scratch
  • Project planning
  • Working with unfamiliar lib/system/code
  • Deal with surprises and setbacks
  • Deal with bureacracy or external partners
  • Deploy to prod

What's the most important part of getting your job done?

2019-11-04 00:00:00 +0000

Source

For interviewer

  • Identify workflow, are you able to learn tools/methods from the candidate?
  • How do they deal with problems that are bound to happen in the work
  • Alternatively, “Describe your process for [a major part of the job]

What’s the most importatnt part of getting your job done

The most important part is to get stakeholders, most likely from different teams, to agree on the context of the problem and document this aggrement. Such agreement needs to be revisited once two week

This includes:

  • Goals and measurements, i.e., what external impact we can have after we finish this project, this number should be meaningful to the external user/customer, and not intenal metrics within the team carrying out the task
  • Roles of stakeholders: who is the DRI. who can veto, who can inform, whose (ideally no more than 3) signoffs are needed
  • Non-goals: cleared rejected scopes, so that later we can revisit it
  • Milestones and revisit internal: every 1-2 weeks.

Describe your process for handling performance problems

Case 1: slow query

  • API timeout
  • Check slow query log, identify slow queries by long running process
    • Re-run the query on prod or repro env to reproduce it, espeically take out the where checks not cover by the index to make sure it doesn’t scan too many rows or bad query plan
  • Informed stakeholders and applied it in 30 mins. Documented RCA the following morning

Case 2: timeout too long

  • In preparation for the big campaign, I am inspectiving modules, one module’s p99/max spikes on production
  • Check db qps and latency - Normal
    • Check external API - way long than expected, and the main API didn’t give 408 or 504
    • Check the past load testing result, without this external API call all within the expectation
    • Discussed with the owner of the module, confirmed it is a tech debt
  • Do nothing until the next release cycle because this one does not give customer impact

Case 3: connection pool size too small

  • API timeout
  • Check the chained API, the latency is good, and request count really low
  • Suspect client side problem. Check the HttpClient docs, find the common problem
  • Informed stakholders in a day and documented RCA.

Describe your process for handling SEV 2 and above

Case 1: user didn’t see the payment on

  • Unable to reproduce in other environments. Direct affecting 200 txns per sec
  • Add metrics at every single level, and release it to narrow down the code
    • Read source code and draw graph to understand the flow
  • Find the explaination and mitigation, released the degraded but correct exp in 3 hours
    • Complete fix took two weeks to release, because will not experience such traffic soon, and people are tired after the big campaign

Case 2: periodically latency spike

  • A critical service’s API latency spikes every day, and k8s health check failed
  • Check db query to find longest total time queries
    • Find that it is from a db co-hosted on the same server
  • Provide mitigation by turning off the cron job
    • Complete fix provided in two weeks with proper db seperation and index

Case 3: Run out of auto incremental id

  • 36 hours before switching online, unable to execute txn
  • Ask support for mitigation and test case. Tested independently both us and 3rd party
  • Contact sign off people to explain my proposal and the risk of doing so. Got approval to go ahead as planned
  • Goes as planned without incident
    • Hot fix released in 3 month. Upgraded in 5 month

Describe your process for handling disagreement with 360

Case 1 & 2: db selection - round 1 & 2

  • Run into performance problem with existing db
  • Document all pros and cons for discussion with stakeholders
    • Got overruled. Stayed my objection and propose a PoC
    • Decision made without consulting me
    • Provide migration plan to commit the success
  • Run into performance problem again after 6 months. Unable to handle it even upgrade to the highest spec
    • List different options with new context
    • PoCed wtih all options
    • Stakeholder picked the option and went ahead
  • Stable, incident-free deployment

Case 3: the performance impact of count(*)

  • 500 error due to failed health checks during load testing
  • Explained why current solution won’t work. Offered sample solution. The other team’s owner doesn’t commit to fix it.
  • Raised to my chain of command, forwarded solution, and told not to worry about it after that

What happens when JVM creates a new object

2019-11-01 00:00:00 +0000

Sequence of actions

  • Every time we call new, a new object is created. Otherwise, autoboxing may use constant values in the constant pool
  • Load the class via class loading if not done yet
  • Allocate the memory
    • When the heap memory segmented, often a result of CMS, maintain a list on which segments are available.
    • When the heap memory is not segmented, i.e., with Serial and ParNew GC, put used on one end, unused on the other, use a pointer as the separator
    • To preserve thread safety of memory allocation
      • TLAB: reserve a segment for each thread in Eden
      • When TLAB is exhausted, use CAS+retry
    • May maintain a separate handle pool, and the stack just points to the handle, so that after GC, no need to change the reference on the stack. The tradeoff is that any access to the data needs to go through 2 pointers
  • Init 0 value
  • Set object header
  • Execute init().
    • Parent clas first. Instance code block first and then constructor

What Makes an Effective Executive

2019-10-30 00:00:00 +0000

Source

LISTEN FIRST, SPEAK LAST

Ask “What needs to be done?”

  • NOT “What do I want to do?”. Not asking it almost always gurantees wrong decision
  • Focus on ONLY one urgent task at a time
    • After it is done, re-eval priorities instead of the moving on to the second most urgent

Develop action plans

  • Action plan states intention but not commitment. Even if it is written, should expect to revisit and revise often
  • Organizations are time-wasters in nature
  • Most likely you don’t stick with the original plan, but without planning, you become a prisoner of events. Also need regular check-in to figure out which ones truly matter

Responsible for decision

  • A decision must have all stakeholders and their precision roles in the decision
    • Also a deadline must be included
  • Hiring/promoting is often the most difficult decision. Expect 1/3 success rate, and revisit people decision after 6 months.
    • Even if it is a failure, it is the exec’s mistake, not the failed person, BUT this person has to be removed - offer them to go back to the original level
  • On weak areas, delegate

Focus on oppurtunities than problems

  • How can we exploit this change for the company?
  • List and deal with oppurtunities first, problems next
  • Put best people on oppurtunities than problems.

Run productive meetings

  • Make sure all participants agree on why the meeting is happening
  • Don’t sign up for committee meetings
  • Don’t brainstorm together. Brainstorm offline, and use meeting to review and eval the ideas. People should feel free to criticize DURING the meeting
  • MAKE SURE decide what kind of meeting it will be, each needs different preparation and results
    • Preparing statement or announcement: one member comes up with draft before the meeting. At the end of the meeting, a pre-pointed member will distribute the final text
    • Annoucement,e.g., org changes: annoucemnet and discussion about it only
    • One member reports: nothing but the report should be discussed
    • A few member reports: No discussion, or questions for clarification only, or discussion where all participants ask questions. In this case, distribute report before the meeting, and timebox it (e.g., 15 mins)
    • Inform the exec: Exec should listen and ask questions, he should sum up but not making the presentaion
    • Meeting just to be in exec’s presence: lunch/dinner format ONLY
  • Designate someone to take notes in the meeting.
  • Sloan’s meeting method
    • Annouce the purpose
    • Listen and only clarify points
    • Sum up what to be included in the memo at the end of each disscussion point or end of meeting
    • Immediately followed by an after-meeting memo, which summarized discussions, conclusions, and action items/assignments/deadline. This memo is cced to every participants
  • Too many meetings means the responsibility is too decentralized. In a lean org, people should be able to move around without collsion and explaining their work all the time. Otherwise, the org is overstaffed

Common Redis Problems

2019-10-29 00:00:00 +0000

  • Sentinel checks master/slave liveness, and will perform the switch.
    • Sentinel itself is at least one master one slave (outside redis master-slaves). Client connects to the sentinel instead of main nodes directly
  • Cluster is for sharding
  • How is master elected?
  • How does redis cluster know which shard to use? why 16k slots?
    • Every node maintains slot and it corresponding key ranges by a bitmap. More slots means better compression
    • However, more slots means bigger heartbeat messages

Skip list

  • Redis uses skiplist instead of RBT to implemnt sorted list
  • Need to support interval search and scan. Hard to do so in RBT
  • Easier to implement than RBT
  • Each item has probability p to exist in the next level of the linked list. Therefore, We need to scan at most 1/p items at each level to detect the segment that contains the item
  • Each level is p smaller than the previous level, so the number of linked list is log (1/p)(n). Total search cost is number of linked list * expected items to scan in each linked list

Eviction

  • 3.0 maintain a candidate pool of size 16, a top K least recent visited entries. When eviction is needed, just purge from the candidate pool
  • Strict LRU nees a LL which introduces concurrency and size issue. That is why redis uses a 24 bit TS to implement near LRU
  • The eviction process continous if expiration ratio > 1/4

Defend against a single, super hot key’s expiry

  • If we want to update from the client side, i.e., don’t want a separate process to update cache entries, options:
    • Set a mutex key (SETNX), load db, and set the cache. Otherwise, retry the whole get cache method
    • Alternatively, inside value itself set a timeout value to1, when we read the cache and find it already expired, we extend timeout1, and reset it on cache. This way, others still think it is safe to use the cache.
    • Do not set expiring time, but store the logical value inside the value, if we find it is about to expire, use a background thread to update construct the value asyncly
  • Still need to combine with rate-limiting + active standby to defend in depth.

HA

  • Sentinel is for HA with master-slave replication. Note redis cluster is for sharding + replication, which also supports master-slave failover.
    • Sentinel and redis cluster should not be mixed
  • RDB
    • BGSAVE: fork a child process to create the file. This forked process uses CoW, i.e., it shares code and data with the parent process unless copied
    • This is mainly use for hourly backup/restore
  • AOF
    • forks a chile process to rewrite old AOF
    • write commands to appendonly.aof
    • Replays AOF when the server reboots
  • RDB file goes to the replica, and then replay the AOF log

System design: point system for ecommerce

2019-10-29 00:00:00 +0000

All e-commerce order systems follow very similar patterns

Requirements

  • Points gained via activities on the site
  • use points to redeem gifts

Data structure

  • Account table to keep track of points of the user
  • Need a txn table to track redemption history (user_id, product_id, used_points)
  • Storage and shipping module
    • type: is it purchase or point?
    • txn id: the txn record of the redeem hisotry
    • Product id, cnt,….etc
    • tracking number, for 3rd party courier

Idempotency

  • Get token from service with TTL
    • On submit, try deletion. If does not work, reject the request
  • Optimisitc lock: version col on DB row
  • Some people propose distributed lock too. But I don’t see point in it
  • API needs to require both source and seq as params

General e-commerce components

  • Catalog service for products
  • Shopping carts
  • Order system
    • Order itself
    • Async processing: coupon, push notification
    • Query
    • Return/Refund
    • Integration with 3rd party courier
  • Payment system
    • Often calls back to the order system to ACK
  • Delivery/Storage system

Order phases

  • Before generating order: txn systems. Online/sync
    • shopping carts - can stay in cache
    • order details
    • fee calculation
  • Generate order: order system. Online/sync
    • balance
    • point and coupon
    • stock freeze
    • generate order and item
  • After generation: fulfillment. Offline/async
    • reivew
    • delivery

System design: social network

2019-10-28 00:00:00 +0000

Data structure: relationship graph

  • As # of n-n relationship grows, relational and then graph model becomes more appropriate.
  • Note that for Vs and Es, they don’t need to be of homogenous type at all, e.g., user can like a picture
  • vertices:
    1. uid
    2. metadata
    3. set of incoming edges
    4. set of out going edges
    5. what type of vertex is it
  • edges
    1. uid
    2. metadata
    3. marks what kind of relationship this edge represets
    4. both ends of the edge

This separation of model means we can store Vs and Es into different tables/DBs

Requirement: Daily system notification

Display notification only first time user logs in today.

Design:

  • Maintain a list of notifications. Since this is admin message, the volume will be no more than < 1k per day. So HA DB with read replicas is enough
    • We can even cache the most 10 days notification in memory, as long as it is immutable
    • Due to epoll, each server can hold at least 20k connection per server, for 200 mil users, this means 10k servers is more than enough to handle the peak
  • Client side, keep track of last login time, if the date is different, ask server side to compute the list of notification to push
  • Server side, use rules to compute notification
    1. created since the last poll
    2. within the scope of the notification’s target
    3. may from the most recent to fetch only top k
    4. update the last poll time after push notification is successful

Requirement: Unread group chat for group member

Design:

  • Maintain a group message table (group_id, message seq or message_created_at) -> message_id
    • Asssuming each user is rate limited to 5 sec between messages, so write is only 200 for a 1k man group
  • Maintain a user group membership table (user_id, group_id) -> last_pollled_time
  • Client side polls the group membership table with locally saved last poll time. Note that client side may needs to dedup message in case of retries

Requirement: can publish feed and read feeds

Also supports the pagination on the feeds page

Design:

  • May use seperate table to store user -> following and user -> follower relationships for write and read broadcast
  • the data structure will be (user_id, following_id) -> metadata
  • May also need a table to do include (user_id) -> follower_count/following_count

Do we need to persist read feed on publish?

  • Need a background job to populate to all followers. Latency concern to see popular user’s post
  • Write amplification - imagine the celebrity, write amplies to tens of millions
  • So read feed on publish is not feasible
  • What if i want to see the historical posts of a person i just followed in my feed? This means every time i follow and un-follow, I need to mark the current persisted feed as outdated and need to populate the delta

Can we generate read feed on read? Possible for the first page

  • Get the top K most recent updated posters by scatter-gather or a join
  • Get the top K posts from each of them and then sort max k^2 posts in memory
  • BUT, if we include pagination into the picture, scrolling means true scatter-gather, may cause read amplification problem. So generate feed on read may not be feasible if we are strict on the pagination semantics

How about generating persisted feed on demand?

  • This is possible. Consider 100 mil MAU, each person reads 200 feeds per day, this means we need to store 20B records per month, each record takes 128 bytes would mean 3PB, note entirely unreasonable
  • Again, this means we need to keep track of deltas of the user’s recently added/removed poster, so we can update the feed

Requirement: User can like a post and undo it

We need to display

  1. like counts
  2. How many among your friends liked the count

Design:

  • maintain a (post_id, user_id) -> metadata table
  • post_id -> liked_cnt table is debatable. Most likely we need it. We can also use a hyperloglog data structure, which has 2% error with 1.5KB of memory with cnt > 1e9
  • similarly, we can use a post_id -> liked user_id bloomfilter table to speed up scatter-gather. Less than 10 bits for 1% false positive rate
  • Pre-compute the liked count or not is very similar to the post feed discussion. Each page we see no more than 10 new posts, so compute in memory is good enough

Requirement: post can be nested replied, and replies sorted by number of likes

That is, show the most popular replies under the post.

This means we need to store a tree like structure in DB

  • Each reply needs to keep track of
    • root post
    • parent reply
    • number of likes
    • who liked them
    • Position among the sibling
    • How many children it has
  • Need a separate OLAP pipeline to generate this top liked reply tree

Requirment: User searches keywords

  • Upon the post, sends tweet to the indexer
  • A seperate search service to handle the request
  • It chose to store the home timeline of the follower in the cache via a fanout service, to optimize for the read performance
    • By a redis list
  • It adds the tweet to the search index service
  • Avoid fanning out for highly followed users
  • Abstract follower/following relationship into a user graph service

System design: search engine

2019-10-28 00:00:00 +0000

Crawler

  • Spider crawls web to create build data (BD)
  • Estimate 1 T new page generated per day, e.g., 10 pages for everyone in the owlrd
    • Each 100K is 100 PB per day
  • Need to store:
  • URL - which can be in a trie to reduce the char space
  • last_visted_at
  • hash
  • Links to crawl, separately from the crawled links
  • May Separate day db, hour data, month db, change only the hour db, and divide & conquer the search result
  • Seed URLs can be populated by hand
  • DNS caching needs to be tuned or enchanced so since it discovers a lot new URLs

How much can a trie save?

  • assume all URLs are of lenght up to l, evenly distributed chars. So the total space is 26 + 26^2 + … = O(l * 26^l)
  • In a trie, we have 26 chars at each level, so it is 26^l => so we saved a factor of l
  • The intuition being that no matter which way we estimate, we need space to keep track only the unique marker

Page indexer

  • Indexer reads/receives URL from the crawler, and build inverted index by visiting the URL
  • Child links from that page is needed for page rank
  • Choose to snapshot the page visited, including the link and snippets. This will power the document service when the search indexer returns the URLs
    • may build the document a forward index of key words and prefixes
    • for quick filtering, can use a bloomfilter for the key words and prefixes

Search indexer

  • Analyze the search keyword

Search flow

  • Goes through search indexer to limit the documents to search
  • Ranker will score and sort results and return the first page result
  • Skiplist to the most common appraoch for ordered LL union problem. Log(n) perf.
  • Cache popular search results. Reading 1MB sequentially is 250 microsec from memory, 1ms on SSD

System design: tiny url

2019-10-25 00:00:00 +0000

Generate new URL:

  • Brute force hash won’t work because we want to keep the URL short
  • hash of (snowflake id + salt) - need to do base conversion, e.g., from decimal to 62-based
    • Similarly, we can pre-generate short URL by hash of incremental id, and assign to the incoming long URL request
  • MurmurHash
  • Brute force pick 7 random chars may cause multiple rerolls when it is close to full
  • Limit it to 255 varchars. Note 62^7 is 300 bil already space.

Need to persist:

  • past long -> short mappings for look up
  • past short mapping for conflict check.
  • We can also solve it by a unique index on the short url
  • Bloomfilter to detect obviously new ones - 1B set consumes about 125M

The server can return either 301 (perm redir) or 302 (temp redir). The difference is that if the client will cache the redirectly result

Handling edits and deletes

System design: design a rpc framework

2019-10-25 00:00:00 +0000

Check eureka for design details

Service provider registers itself on the config center on start

Service consumer fetchs(and caches) endpoint info

Message format

  • interface
  • method
  • paramType
  • paramVal
  • requestId - identify + request id so that multiple threads can issue same method concurrently

Registry:

  • Class
  • hostName
  • ip
  • port

Use ZK

  • For each interface’s each method to register a emphor node, use key as the id for the method, data is the service list

System design: flash sale

2019-10-24 00:00:00 +0000

General workflow

  • Dedup multiple requests from the same user
    • unique token for each submission request on the frontend
    • redis for consitency check, set 5 mins expire time
    • Use DEL to check if it is dup or not. Successful DEL means it is first time
    • API needs to accept source + seq at the same time, and they should form a unique inde
    • Can also use standard distributed lock
  • check storage. User is redirected to waiting page instead of the sales page now
    • optimisitic lock/or distributed lock. Note that pessimistic lock has bad performance
    • storage cache in redis, 5 mins expiry time. Note that we can set the count sightly higher than the actual storage, and reject the out of stock one at the next level
    • MQ
  • deduct storage
    • If order and then deduct, then people may create many orders without paying
  • create order
    • Cancel order can be done via MQ
    • Status is polling by the waiting page
  • pay
    • If no pay by a certain time, release the storage
    • If pay and then deduct, user may order but will not be able to pay

Rate limiting

  • Reduce the traffic layer by layer. Needs to be done on both producer and consumer
    • ngx_http_limit_conn_module, ngx_http_req_module
    • intecetpor at the service + centralized redis counter
  • Often need to RL at dynamic hotspots
  • Tools:
    1. Token bucket - no need to be producer-consumer, and calcuate last token’s timestamp and we can calcualte how many tokens we need to add. Good for blocking case. Good for no SUDDEN change of traffics.Constant incoming rate, stop putting into bucket when the capacity is full. If bucket is full in token Bucket , token are discard not packets. While in leaky bucket , packets are discarded, and can send large burst. The goal is to limit AVG and accept certain bursts
    2. Leaky bucket - unlike token bucket, need to take out token at the rate of t/n. Good for blocking case. less resource utiliation than token bucket, i.e., max output rate, goal is to limit output
    3. Slided time window - slide end, delete obsolete one, and then update counter - but may still have peak traffic in a REALLY small timed window with in the timed window. Good for rejecting rate limiting. Simple
  • When hit the rate limit, we need to circuit break
    1. direct reject
    2. add to the a blocking Q
    3. log + warning
  • To test:
    • redirect prod traffic to a small subset of service, and see if the rate limiter is in effect, e.g., even only 1 instance
    • the shape of the TPS graph should be more reasonable
  • Can be deployed at API gateway level or a separate RPC service

The Standard of Code Review

2019-10-24 00:00:00 +0000

  • Approve the CL as long as it improves the overall health of the code base, even if it is not perfect.
    • If hard to approve CR, people will be less motivated to make improvements
    • Don’t require author to polish every piece of CL before the approval
    • The senior principle among all of the code review guidelines.
  • Just keep in mind that if your comment is purely educational, but not critical to meeting the standards described in this document, prefix it with “Nit: “ or otherwise indicate that it’s not mandatory for the author to resolve it in this CL.
  • Any purely style point (whitespace, etc.) that is not in the style guide is a matter of personal preference. The style should be consistent with what is there. If there is no previous style, accept the author’s.
  • If the author can demonstrate (either through data or based on solid engineering principles) that several approaches are equally valid, then the reviewer should accept the preference of the author. Otherwise the choice is dictated by standard principles of software design.
  • You can validate the CL if you want—the time when it’s most important for a reviewer to check a CL’s behavior is when it has a user-facing impact, such as a UI change. It’s hard to understand how some changes will impact a user when you’re just reading the code. For changes like that, you can have the developer give you a demo of the functionality if it’s too inconvenient to patch in the CL and try it yourself.
  • Encourage developers to solve the problem they know needs to be solved now, not the problem that the developer speculates might need to be solved in the future. The future problem should be solved once it arrives and you can see its actual shape and requirements in the physical universe.
  • Compliment people in CR, especially a good answer
  • At Google, we optimize for the speed at which a team of developers can produce a product together, as opposed to optimizing for the speed at which an individual developer can write code.
  • If you are in the middle of a focused task, such as writing code, don’t interrupt yourself to do a code review.
  • to be sure that you are always making comments about the code and never making comments about the developer. You don’t always have to follow this practice, but you should definitely use it when saying something that might otherwise be upsetting or contentious. For example:
    • Bad: “Why did you use threads here when there’s obviously no benefit to be gained from concurrency?”
    • Good: “The concurrency model here is adding complexity to the system without any actual performance benefit that I can see. Because there’s no performance benefit, it’s best for this code to be single-threaded instead of using multiple threads.”
  • If you ask a developer to explain a piece of code that you don’t understand, that should usually result in them rewriting the code more clearly. Occasionally, adding a comment in the code is also an appropriate response, as long as it’s not just explaining overly complex code.
  • Explanations written only in the code review tool are not helpful to future code readers. They are acceptable only in a few circumstances, such as when you are reviewing an area you are not very familiar with and the developer explains something that normal readers of the code would have already known.
  • Experience shows that as more time passes after a developer writes the original CL, the less likely this clean up is to happen. In fact, usually unless the developer does the clean up immediately after the present CL, it never happens. This isn’t because developers are irresponsible, but because they have a lot of work to do and the cleanup gets lost or forgotten in the press of other work. Thus, it is usually best to insist that the developer clean up their CL now, before the code is in the codebase and “done.” Letting people “clean things up later” is a common way for codebase to degenerate.
  • If you previously had fairly lax code reviews and you switch to having strict reviews, some developers will complain very loudly. Improving the speed of your code reviews usually causes these complaints to fade away.
  • Sometimes it can take months for these complaints to fade away, but eventually developers tend to see the value of strict code reviews as they see what great code they help generate. Sometimes the loudest protesters even become your strongest supporters once something happens that causes them to really see the value you’re adding by being strict.
  • Priorities in order: message format/schema, tests, interface, implementation
  • Claimed by speed must be backed by micro-benchmarks
  • Reviewer is the owner of the code they are reviewing
  • Favor the author’s approach if the author can prove it by qualitative or quantitative proofs
  • If something is off track early, send feedback immediately
  • Review CR as soon as you are not in the middle of a focused task, no more than 1 business day in any case.
  • Most deadlines are soft. Favor code quality over meeting soft deadline. Hard deadline include
    • Contract
    • Market share speed
    • Missing an important conference (debatable)
  • Can reject CL just because it is too large.
    • Leave refactoring in a separate CL whenever possible.
    • Small cleanup can be in the same CL
    • May do a refactor CL before the implementation
  • If review is not constructive or polite, explain in person. Use private email to explain in a kind way that what you wish could have been done differently

Common phrases in toxic comments

Excluding second person pronouns, which are the most common symptom.

  • it is
  • that is
  • going to
  • trying to
  • to do
  • do not
  • it is not
  • this is
  • in the
  • to be
  • have to
  • there is no
  • to do with
  • the problem is
  • want to
  • of these
  • of our
  • is to
  • if we
  • depend on
  • we use
  • the cl
  • this case
  • is of
  • the project
  • the linter
  • read the
  • as long as
  • just wanted to
  • we don’t want

the perception of unnecessary interpersonal conflict while a reviewer is blocking a change request

  • Rounds of a review
  • Active reviewing time
    • not wall-clock time of start-to-finish reviews, but based on time specifically spent working on the code review and related actions.
  • Active shepherding time
    • time the author spent actively viewing, responding to reviewer comments, or work- ing on the selected CR, between requesting the code review and actually merging the change into the code base.
  • To flag potential push back
    • 48 minutes for active reviewing time
    • 112 minutes for shepherding time
    • 9 batches of comments for number of rounds of review
  • Compared to very small code changes (less than 10 lines), changes that were between 250-999 lines long were 4.8 times more likely for authors to feel more change was requested than necessary and changes that were between 10-49 lines were 4.7 times more likely to have authors report they felt negatively about submitting a similar change in the future
  • The number of reviewers is not predictive of any of our push back feelings.
  • Some of the most commonly mentioned associated factors were delays (65), references to code style (46) and documentation (39), the tone in which feedback was delivered (43), familiarity with the codebase (35), and the CR being part of a read- ability review (34). The most common emergent themes which we classified as consequences were inefficient use of time (31), negative emotions (29), and offline conversations (28).

Code review at Google

  • Expectations for code review at Google do not center around problem solving. Reviewing was introduced at Google to ensure code readability and maintainability. Today’s developers also perceive this educational aspect, in addition to maintaining norms, tracking history, gate-keeping, and accident prevention. Defect finding is welcomed but not the only focus.
    • This does not align with the previous idea that CR is a group problem solving activity
  • Usually, only one reviewer is required to satisfy the aforementioned requirements of ownership and readability.
    • This does not align with previous idea that two reviewers find an optimal number of defects
  • Over 80% of all changes involve at most one iteration of resolving comments.
  • We find that the median developer authors about 3 changes a week, and 80 percent of authors make fewer than 7 changes a week. Similarly, the median for changes reviewed by developers per week is 4, and 80 percent of reviewers review fewer than 10 changes a week. In terms of speed, we find that developers have to wait for initial feedback on their change a median time of under an hour for small changes and about 5 hours for very large changes. The overall (all code sizes) median latency for the entire review process is under 4 hours. This is significantly lower than the median time to approval reported by Rigby and Bird [33], which is 17.5 hours for AMD, 15.7 hours for Chrome OS and 14.7, 19.8, and 18.9 hours for the three Microsoft projects. Another study found the median time to approval at Microsoft to be 24 hours
  • At Google, over 35% of the changes under consideration modify only a single file and about 90% modify fewer than 10 files. Over 10% of changes modify only a single line of code, and the median number of lines modified is 24. The median change size is significantly lower than reported by Rigby and Bird for companies such as AMD (44 lines), Lucent (263 lines), and Bing, Office and SQL Server at Microsoft (somewhere between those boundaries), but in line for change sizes in open source projects
  • fewer than 25% of changes have more than one reviewer, and over 99% have at most five reviewers with a median reviewer count of 1. Larger changes tend to have more reviewers on average. However, even very large changes on average require fewer than two reviewers.
  • Developers spend an average of 3.2 (median 2.6 hours a week) reviewing changes. This is low compared to the 6.4 hours/week of self-reported time for OSS projects
  • The median change size in OSS projects varies from 11 to 32 lines changed
  • During the week, 70% of changes are committed less than 24 hours after they are mailed out for an initial review.

References

Scalable microservice checklist for engineers

2019-10-17 00:00:00 +0000

This is based on the internal guide I wrote for Paypay teams. I took out data that are specific to our environment. Almost every item has a SEV 2 behind it.

Target audience

Backend engineers developing Java microservices

Java

  • GC’s stop the world (STW) is a common source of CPU spikes and latency spikes.
    • Most of backend services are IO intensive instead of CPU intensive. In such cases, the CPU usage of process should not exceed 50% of the cores.
    • Use metrics to check both young GC and old GC times and frequency. Frequent young GC is OK. Any sign of old GC requires attention
  • STW also means the timer you set at the java side could include the GC time on top of the actual execution time. Therefore, make sure you track client side AND server side processing time at the same time
  • Heap size rule of thumb: try to keep it between 4 - 8G,
    • Rationale: Less than 4G, the survivor region may be too small and gets promoted to old generation too easily. Therefore, it is more likely to trigger old GC
    • Rationale: Greater than 8G, the full GC STW may be too long
  • JVM settings
    • Same -Xmx and -Xms
    • For oltp systems, bigger new gen
    • 256MB is more than enough for metaspace
    • 1M max stack size is more than enough

Network

  • Make sure you understand Little’s law, and use that to estimate your connection pool size and task queue size
    • Pool’s init/min/max size should be constant, so we can warm up the pool on start
    • On db/server side, prefer smaller connection pool size. Higher than (number of cores * 3) is very debatable.
      • Rationale: your cores will be saturated with (number of cores * 3) workers already. Adding more workers will only add the context switch cost without adding breathing room for the CPU
    • Set max task queue size. Higher than (num of cores * 10) is very debatable. When the queue is full, prefer abort or discard policy.
      • Rationale: The system can not process fast enough already, adding in the task queue will not help with the task processing time
  • Do NOT make any remote/inter-process calls when you are inside a mutex. 95% of chance such code is wrong
  • The network latency between nodes in our DC is about 1-2 ms. Do not issue sequential network call, as the latency will add up quickly
  • Low timeout so we can fail early, which also protects the downstream system. Timeout should be no more than 2s on UX-impacting calls (may even consider 1s)
  • Retry no more than 3 times with random jitters to avoid thundering herd problem
  • Because of retry, the timeout of each call needs to be 1/2 or 1/3 of the read timeout of the client. Otherwise, it will have no impact because client times out already
  • Timeout normally just kills the current request connection. Often client side needs to do additonal things to trigger clean up of long running action. The key point is that don’t expect server time to do clean up without explicit instruction from client side

Operation

  • Health check
    • The thread answering health check is often different from the worker thread. To prevent the ninja work done by the worker thread, upon detecting health check failure, make sure you shut down the whole process
    • ELB is default check is 10s time out, and will register the instance as failed after 3 times. This is too generous if the load balancer is within 100 KM of the instances,i.e., failure may not be detected enough
    • When the same region scenario, consider health check timeout of 1s, with 3 as the failure threshould, and 2 as success threshould

RDS

  • Memory should be able to hold all indices. You can estimate the size of indices by (number of rows * avg size of primary keys)
  • Try not to use incremental id for PK on big tables, use snowflake/uid generator if possible.
    • Rationale: strict incremental id generation is NOT support in most of the distributed systems. We will have to use snowflake for id generation when we migrate off RDS anyway.
  • Make sure you add quotes when filtering by varchar type. Otherwise, the index on that varchar type will not be hit
  • Solving deadlock:
    • On Repeatable-Read isolation level, a common source is the gap lock introduced by batch inserts
    • Check the execution plan of the conflicting queries. Most likely one is holding gap lock or table lock by mistake
  • Solving slow queries
    • Check query plan by explain analyze. If there is no “range” step, something is wrong.
    • Note that “index” type step is still a full table scan
    • The most reliable fix is to add an index hint. With it, you don’t even need to worry about the outdated table statistics.

Redis

  • You don’t need redis, if
    • Your distribute lock has no more than 100 rqs - you can just implement the lock in the DB
      • Rationale: trade off between code complexity vs architecture complexity
    • 80% data hits less than 50k entries - you can just use local in-memory cache
      • Rationale: 50k * 1K per entry = 50M memory only
  • Redis is single threaded. This means any slower command will block everyone else.
    • For non k-v look up case, make sure your collection size is less than 200 items. Otherwise, you need to redesign your data structure
      • Rationale: avoid the big key problem

Kafka

  • Common kafka use cases:
    • Uses event sourcing or even-driven model or
    • Have to buffer requests to process asynchronously
  • People often use Kafka just to partition workloads. Explore other options before opting for kafka. Rationales:
    • Your scaling factor is limited by the number of topic partitions, which is very hard to scale up
    • Very hard to deploy 50+ consumer pods on the cluster, because each Spring pod is a separate process consuming > 1G memory
    • Your true bottleneck is almost always the datasink anyway
  • Most producer/consumer logics are at least once, so idempotency should be in your logic all the time
    • Kafka does have exactly once processing, but that works only when both source and sink are kafka topics
    • Most likely you need producer ack = 1
  • If you do DB write and kafka write in the same API request, you will need a recon job to recon inconsistencies. Rationales:
    • The DB write and kafka write are not the in the same transaction. So all-or-nothing semantics can’t be guaranteed
    • Reliable message delivery is possible, but not trivial. The effort/reward trade off in our experience is not worth it.
  • Replication factor 3 is enough. You can even set it to 2 if losing data is acceptable, e.g., low importance logs
    • Rationale: Kafka deployment is evenly spread across 3 AZs. Replication factor 3 is enough to counter 1 AZ failure

Spring

  • Tomcat + JVM idle will consume almost 1G ram. This, combined with the process context switch costs, means that we prefer bigger but fewer pods
  • Spring AOP annotation works only in a @Bean, and only when it is called from another class. We had many cases where @Async, @Transactional annotations have no effect because of these two problems
  • Try to limit the @Transactional scope as much as you can, this means
    • Highly debatable to put @Transancatonal scope on the class level. Prefer method level annotation
    • It is normal to create a method just to open the transaction.
    • Rationale: reduce the DB transaction duration as much as we can
  • If you use @Async, make sure you override the Executor bean
    • Rationale: default implementation is SimpleThreadPoolTaskExecutor, which may create an unlimited number of worker threads
  • For OLTP APIs affecting UX, no timeout > 5s, ideally no more than 2s
    • Rationale:if one API can’t finish it under 2s, that means some systems are under stress or faulty already. We fail early to prevent the cascading failure across services
  • Prefer native query to JPA/hibernate ORM. Rationale:
    • Often you need to specify index hints
    • Generated query is too long to fit into slow query log. What makes it worse is that the slow query log often truncates the WHERE condition
  • Do not include inter-service call inside transaction. Rationale:
    • Reduce DB txn duration as much as we can. This is a common source of deadlock
    • It gives people a false sense of transaction between services when there is none. You have to implement TCC/Saga yourself to have the transaction-ish semantics
  • All calls to third party APIs must be protected by a circuit breaker
    • Rationale: Fail fast, so the worker thread on our tomcat can be freed quickly

Debugging JVM memory issues

2019-10-10 00:00:00 +0000

Is the survivor space enough?

If not, then after minor GC, the objects can’t fit into the survivor, and we will have to promote to old gen directly. Consider making new gen 50% bigger than the old gen

Which GC to use?

ParNew for new gen, CMS for old gen.

Metaspace full

  • commonly caused by too many proxies generated by cglib
  • By default the size is about 20M, and then on start this size is very likely to trigger metaspace resize, which leads to GCs. Consider setting metaspace size to min=max=256MB

Native memory usage too high

e.g., off heap memory such as DirectByteBuffer

Where are strings stored?

  • Constant strings are stored in string constant pool, which is on the heap. It is implemented as a hashtable of pointers to the strings on the heap
  • Other strings created by ctor just stay on heap as usual

Tools

  1. jmap -heap - check new and old jen size
  2. jmap -histo:live - what is alive, is it too much?
  3. /proc/${PID}/fd, /proc/${PID}/task check FD usage and thread count
  4. pstree、ss - to check process creation and network connection number

Who’s Got the Monkey

2019-10-09 00:00:00 +0000

Source

  • The time imposed by the supervisor and peers are hard to cut down. Therefore, we have to control how much time we spend on sobordinates, and many people spend way more than what they prefer or even realized
  • Progress report should be from the subordinates. This also means a manager tries not to “get back to you later” - It is now or never. In case a decision does take time, make that decision in a scheduled meeting with the owner together, ideally face-to-face.
    • At the end of such meeting, the problem owner (not the manager) and next meeting time must be explicitly set
    • This meeting should not take > 15 mins. Longer than that means subordinates are not prepared
    • The manager should NEVER make a decision on a subordinate’s problem alone!

On a side note, build a universal accountability culture to shorten the req-respose time and feedback loop, i.e., everyone should be able to hold each other accountable