Scalable microservice checklist for engineers
This is based on the internal guide I wrote for Paypay teams. I took out data that are specific to our environment. Almost every item has a SEV 2 behind it.
Target audience
Backend engineers developing Java microservices
Java
- GC’s stop the world (STW) is a common source of CPU spikes and latency spikes.
- Most of backend services are IO intensive instead of CPU intensive. In such cases, the CPU usage of process should not exceed 50% of the cores.
- Use metrics to check both young GC and old GC times and frequency. Frequent young GC is OK. Any sign of old GC requires attention
- STW also means the timer you set at the java side could include the GC time on top of the actual execution time. Therefore, make sure you track client side AND server side processing time at the same time
- Heap size rule of thumb: try to keep it between 4 - 8G,
- Rationale: Less than 4G, the survivor region may be too small and gets promoted to old generation too easily. Therefore, it is more likely to trigger old GC
- Rationale: Greater than 8G, the full GC STW may be too long
- JVM settings
- Same -Xmx and -Xms
- For oltp systems, bigger new gen
- 256MB is more than enough for metaspace
- 1M max stack size is more than enough
Network
- Make sure you understand Little’s law, and use that to estimate your connection pool size and task queue size
- Pool’s init/min/max size should be constant, so we can warm up the pool on start
- On db/server side, prefer smaller connection pool size. Higher than (number of cores * 3) is very debatable.
- Rationale: your cores will be saturated with (number of cores * 3) workers already. Adding more workers will only add the context switch cost without adding breathing room for the CPU
- Set max task queue size. Higher than (num of cores * 10) is very debatable. When the queue is full, prefer abort or discard policy.
- Rationale: The system can not process fast enough already, adding in the task queue will not help with the task processing time
- Do NOT make any remote/inter-process calls when you are inside a mutex. 95% of chance such code is wrong
- The network latency between nodes in our DC is about 1-2 ms. Do not issue sequential network call, as the latency will add up quickly
- Low timeout so we can fail early, which also protects the downstream system. Timeout should be no more than 2s on UX-impacting calls (may even consider 1s)
- Retry no more than 3 times with random jitters to avoid thundering herd problem
- Because of retry, the timeout of each call needs to be 1/2 or 1/3 of the read timeout of the client. Otherwise, it will have no impact because client times out already
- Timeout normally just kills the current request connection. Often client side needs to do additonal things to trigger clean up of long running action. The key point is that don’t expect server time to do clean up without explicit instruction from client side
Operation
- Health check
- The thread answering health check is often different from the worker thread. To prevent the ninja work done by the worker thread, upon detecting health check failure, make sure you shut down the whole process
- ELB is default check is 10s time out, and will register the instance as failed after 3 times. This is too generous if the load balancer is within 100 KM of the instances,i.e., failure may not be detected enough
- When the same region scenario, consider health check timeout of 1s, with 3 as the failure threshould, and 2 as success threshould
RDS
- Memory should be able to hold all indices. You can estimate the size of indices by (number of rows * avg size of primary keys)
- Try not to use incremental id for PK on big tables, use snowflake/uid generator if possible.
- Rationale: strict incremental id generation is NOT support in most of the distributed systems. We will have to use snowflake for id generation when we migrate off RDS anyway.
- Make sure you add quotes when filtering by varchar type. Otherwise, the index on that varchar type will not be hit
- Solving deadlock:
- On Repeatable-Read isolation level, a common source is the gap lock introduced by batch inserts
- Check the execution plan of the conflicting queries. Most likely one is holding gap lock or table lock by mistake
- Solving slow queries
- Check query plan by
explain analyze
. If there is no “range” step, something is wrong. - Note that “index” type step is still a full table scan
- The most reliable fix is to add an index hint. With it, you don’t even need to worry about the outdated table statistics.
- Check query plan by
Redis
- You don’t need redis, if
- Your distribute lock has no more than 100 rqs - you can just implement the lock in the DB
- Rationale: trade off between code complexity vs architecture complexity
- 80% data hits less than 50k entries - you can just use local in-memory cache
- Rationale: 50k * 1K per entry = 50M memory only
- Your distribute lock has no more than 100 rqs - you can just implement the lock in the DB
- Redis is single threaded. This means any slower command will block everyone else.
- For non k-v look up case, make sure your collection size is less than 200 items. Otherwise, you need to redesign your data structure
- Rationale: avoid the big key problem
- For non k-v look up case, make sure your collection size is less than 200 items. Otherwise, you need to redesign your data structure
Kafka
- Common kafka use cases:
- Uses event sourcing or even-driven model or
- Have to buffer requests to process asynchronously
- People often use Kafka just to partition workloads. Explore other options before opting for kafka. Rationales:
- Your scaling factor is limited by the number of topic partitions, which is very hard to scale up
- Very hard to deploy 50+ consumer pods on the cluster, because each Spring pod is a separate process consuming > 1G memory
- Your true bottleneck is almost always the datasink anyway
- Most producer/consumer logics are at least once, so idempotency should be in your logic all the time
- Kafka does have exactly once processing, but that works only when both source and sink are kafka topics
- Most likely you need producer ack = 1
- If you do DB write and kafka write in the same API request, you will need a recon job to recon inconsistencies. Rationales:
- The DB write and kafka write are not the in the same transaction. So all-or-nothing semantics can’t be guaranteed
- Reliable message delivery is possible, but not trivial. The effort/reward trade off in our experience is not worth it.
- Replication factor 3 is enough. You can even set it to 2 if losing data is acceptable, e.g., low importance logs
- Rationale: Kafka deployment is evenly spread across 3 AZs. Replication factor 3 is enough to counter 1 AZ failure
Spring
- Tomcat + JVM idle will consume almost 1G ram. This, combined with the process context switch costs, means that we prefer bigger but fewer pods
- Spring AOP annotation works only in a @Bean, and only when it is called from another class. We had many cases where @Async, @Transactional annotations have no effect because of these two problems
- Try to limit the @Transactional scope as much as you can, this means
- Highly debatable to put @Transancatonal scope on the class level. Prefer method level annotation
- It is normal to create a method just to open the transaction.
- Rationale: reduce the DB transaction duration as much as we can
- If you use @Async, make sure you override the Executor bean
- Rationale: default implementation is
SimpleThreadPoolTaskExecutor
, which may create an unlimited number of worker threads
- Rationale: default implementation is
- For OLTP APIs affecting UX, no timeout > 5s, ideally no more than 2s
- Rationale:if one API can’t finish it under 2s, that means some systems are under stress or faulty already. We fail early to prevent the cascading failure across services
- Prefer native query to JPA/hibernate ORM. Rationale:
- Often you need to specify index hints
- Generated query is too long to fit into slow query log. What makes it worse is that the slow query log often truncates the WHERE condition
- Do not include inter-service call inside transaction. Rationale:
- Reduce DB txn duration as much as we can. This is a common source of deadlock
- It gives people a false sense of transaction between services when there is none. You have to implement TCC/Saga yourself to have the transaction-ish semantics
- All calls to third party APIs must be protected by a circuit breaker
- Rationale: Fail fast, so the worker thread on our tomcat can be freed quickly
Debugging JVM memory issues
Is the survivor space enough?
If not, then after minor GC, the objects can’t fit into the survivor, and we will have to promote to old gen directly. Consider making new gen 50% bigger than the old gen
Which GC to use?
ParNew for new gen, CMS for old gen.
Metaspace full
- commonly caused by too many proxies generated by cglib
- By default the size is about 20M, and then on start this size is very likely to trigger metaspace resize, which leads to GCs. Consider setting metaspace size to min=max=256MB
Native memory usage too high
e.g., off heap memory such as DirectByteBuffer
Where are strings stored?
- Constant strings are stored in string constant pool, which is on the heap. It is implemented as a hashtable of pointers to the strings on the heap
- Other strings created by ctor just stay on heap as usual
Tools
jmap -heap
- check new and old jen sizejmap -histo:live
- what is alive, is it too much?/proc/${PID}/fd
,/proc/${PID}/task
check FD usage and thread count- pstree、ss - to check process creation and network connection number
Who’s Got the Monkey
- The time imposed by the supervisor and peers are hard to cut down. Therefore, we have to control how much time we spend on sobordinates, and many people spend way more than what they prefer or even realized
- Progress report should be from the subordinates. This also means a manager tries not to “get back to you later” - It is now or never. In case a decision does take time, make that decision in a scheduled meeting with the owner together, ideally face-to-face.
- At the end of such meeting, the problem owner (not the manager) and next meeting time must be explicitly set
- This meeting should not take > 15 mins. Longer than that means subordinates are not prepared
- The manager should NEVER make a decision on a subordinate’s problem alone!
On a side note, build a universal accountability culture to shorten the req-respose time and feedback loop, i.e., everyone should be able to hold each other accountable
Outage caused by Log4j's Async Appender
In AsyncAppender.append
, if failed to move the message to in-memory buffer
if (blocking) {
if (AbstractLogger.getRecursionDepth() > 1) { // LOG4J2-1518, LOG4J2-2031
// If queue is full AND we are in a recursive call, call appender directly to prevent deadlock
AsyncQueueFullMessageUtil.logWarningToStatusLogger();
logMessageInCurrentThread(logEvent);
} else {
// delegate to the event router (which may discard, enqueue and block, or log in current thread)
final EventRoute route = asyncQueueFullPolicy.getRoute(thread.getId(), memento.getLevel());
route.logMessage(this, memento);
}
} else {
error("Appender " + getName() + " is unable to write primary appenders. queue is full");
logToErrorAppenderIfNecessary(false, memento);
}
On default settings
- The buffer is an
ArrayBlockingQueue
blocking
is set to true- use DefaultAsyncQueueFullPolicy, which most likely uses
EventRoute.ENQUEUE
, which in turns usesBlockingQueue.put(logEvent);
This means that under default settings, if the buffer is remains full for a long time, e.g., consumer is unable to consume, your main logging thread will be blocked on that put
call, and in turn the Tomcat work thread pool will be exhausted!
On AbstractQueuedSynchronizer
State
- Internally, maintains a
volatile int state
- state = 0 is free, 1 acquired, > 1 num of reentrants
- State is updated by
unsafe.compareAndSwapInt
Node and CLH lock
In CLH lock, applying thread can only self-spin on local variable, and keep polling the previous node’s status, if it discovers that the prev node released lock, it will finish self spin
static final class Node{
volatile int waitStatus; //control blocking
volatile Node prev;
volatile Node next;
volatile Thread thread;
Node nextWaiter; //Uses sentinel value to detect if it is shared or exclusive
}
- if
prev
is cancelled, this node can use prev node’s status - head inside AQS is a sentinel value
- Custom synchroinzer need to implement how
state
is acquired and released, q maintenance is done by the AQS already - WaitStatus
- 1 - cancelled
- -1 - signal, i.e., current thread need to wake up the
next
when release or cancel - -2 - condition, waiting to be waken up by the condition
- -3 - propergate, shared lock
Acquire
- You need to implement your own
tryAcquire
ortryAcquireShared
private Node enq(final Node node) {
for (;;) {
Node t = tail;
if (t == null) { // Must initialize
if (compareAndSetHead(new Node()))
tail = head;
} else {
node.prev = t;
if (compareAndSetTail(t, node)) {
t.next = node;
return t;
}
}
}
}
private Node addWaiter(Node mode) {
Node node = new Node(Thread.currentThread(), mode);
// Try the fast path of enq; backup to full enq on failure
Node pred = tail;
if (pred != null) {
node.prev = pred;
if (compareAndSetTail(pred, node)) { //uses unsafe
pred.next = node;
return node;
}
}
enq(node);
return node;
}
private static boolean shouldParkAfterFailedAcquire(Node pred, Node node) {
int ws = pred.waitStatus;
if (ws == Node.SIGNAL)
/*
* This node has already set status asking a release
* to signal it, so it can safely park.
*/
return true;
if (ws > 0) {
/*
* Predecessor was cancelled. Skip over predecessors and
* indicate retry.
*/
do {
node.prev = pred = pred.prev;
} while (pred.waitStatus > 0);
pred.next = node;
} else {
/*
* waitStatus must be 0 or PROPAGATE. Indicate that we
* need a signal, but don't park yet. Caller will need to
* retry to make sure it cannot acquire before parking.
*/
compareAndSetWaitStatus(pred, ws, Node.SIGNAL);
}
return false;
}
private final boolean parkAndCheckInterrupt() {
LockSupport.park(this); //to wake up thread, use unpark() or interrupted()
return Thread.interrupted(); //ignores inturrptions during the park(), will add a self-interruption only after resource is acquired
}
final boolean acquireQueued(final Node node, int arg) {
boolean failed = true;
try {
boolean interrupted = false;
for (;;) {
final Node p = node.predecessor();
if (p == head && tryAcquire(arg)) {//it is possible to get lock now, e.g., head has been released
setHead(node);
p.next = null; // help GC
failed = false;
return interrupted;
}
if (shouldParkAfterFailedAcquire(p, node) &&
parkAndCheckInterrupt())
interrupted = true;
}
} finally {
if (failed)
cancelAcquire(node);
}
}
public void acquire(int arg) {
if(!tryAcquire(arg) && acquireQueued(addWaiter(Node.EXCLUSIVE), arg))
selfInterrupt(); //if the thread has been interrupted during the wait, it won't respond, here we do a self-interrupt to make up for it
}
public void acquireShared(int arg){
if(tryAcquireShared(arg) < 0)
doAcquireShared(arg);
}
- Difference with fair and biased: on biased, CAS to acq first, and acq only when failed, and won’t detect if CLH is idle
- Shared lock: when it becomes head, it will wake up the next thread, upon releasing, S lock will wake up other threads regardless of state, but X lock will wake up only when state = 0
- for X lock, each node CAS checks if the previosu node is header, if so, try to acquire lock
- CountDownLatch: state set to N, at each countDown(), state will CAS to dec by 1, when state= 0, unpark the main thread
On ThreadLocal
static class ThreadLocalMap {
private Entry[] table;
int INITIAL_CAPCITY=16;
static class Entry extends WeakReference.ThreadLocal<?> {
Object value;
Entry (ThreadLocal<?> k, Object v){ //v is what the code put in
super(k); //weak ref means entry k may become null
value = v;
}
}
//on get and set, will expungeStaleEntry
}
public class Thread implements Runnable{
ThreadLocal.ThreadLocalMap threadLocals = null;
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
}
class ThreadLocal {
public void set(T value) {
Thread t = Thread.currentThread();
ThreadLocalMap map = t.threadLocals
if(map != null)
map.set(this, value);
else
createMap(t, value);
}
}
- standard techinque to replace
% pow(2, n)
with&(pow(2,n)-1)
, and keep the size of the colleciton always2^n
- The table uses iteration to handle hash conflict, note that it will remove non-null entries whose key is null. This is part of the reason why the load factor is 2/3, lower than that of hashmap
ThreadLocal.ThreadLocalMap
is a member variable ofThread
, so we can access it with Thread.currentThread();- Weak reference to the
ThreadLocal
means that often we will haveEntry
with key = null. That is why during get() and set(), the code try to remove the stale entry. Otherwise, there is no way for outside code to break the strong reference to thevalue
object by this time. This case is a common source of memory “leak” and is also the reason it is recommended to callThreadLocal.remove()
in a finally clause after processing is done.
On Java thread state
- Java threads:
- New: by
new Thread()
- Runnable: after
Thread.start()
- Waiting: stay in the wait set, thread previosuly called:
- Thread.join(), Need that thread to finish
Object.wait()
. NeedObject.notify/notifyAll()
to become runnable.- LockSupport.park()
- Timed_watiing:
- Thread.sleep(long)
- Object.wait(timeout)
- Thread.join(timeout)
- LockSupport.parkUntil(long)
- Blocked: need to acquire (monitor) lock but unable to. Stay in the entry set.
- New: by
- OS threads:
- new, ready, running, waiting, terminated
- Java thread has no running state - only runnable, it includes ready and running state in the OS thread’s sense. Normally one thread executes about 10ms on CPU at a time. When at blocking IO, e.g.,
socket.accept()
, the JAVA THREAD is still runnable instead of BLOCKED or WAITING, but at OS layer, the OS thread is WAITING.
On Spring AOP
Terms
- Join point: A point during the execution of a program, such as the execution of a method or the handling of an exception.
- Advice: Action taken by an aspect at a particular join point,i.e., the actual code, think it as the method of an apsect. Spring AOP currently supports only method execution join points (advising the execution of methods on Spring beans).
- Introduction: (Also known as an inter-type declaration). Declaring additional methods or fields on behalf of a type. Spring AOP allows us to introduce new interfaces (and a corresponding implementation) to any proxied object.
- Aspect: advice + introduction
cglib vs dynamic proxy
- JDK dynamic proxy works only for the interfaced classes, and as the name suggests, it is implemented by the proxy pattern/composition. This means to trigger the annotation, the dynamic proxy has to be called from outside this class. Implemented by reflection
- cglib works doesn’t have this restriciton, because it implemented via inheritance. This also means annotaion works only for the public methods, because only they can be inherited.
On Java NIO
NIO model: Multiple channel -> one selector -> SelectionKey queue -> thread pool
- Channel is similar to a abstraction of file descriptor. It is bi-directional, but can asyncly read and write buffer
- SocketChannel is commonly used in TCP client
- ServerSocketChannel in handling TCP request from outside
- DatagramChannel to handle UDP client and server
- One selector to monitor mutliple channels, and we check the selector directly
- Buffer is commonly memory mapped so we don’t have to copy from kernel space to user space
- SelectionKey class states
- Connect – when a client attempts to connect to the server. Represented by SelectionKey.OP_CONNECT
- Accept – when the server accepts a connection from a client. Represented by SelectionKey.OP_ACCEPT
- Read – when the server is ready to read from the channel. Represented by SelectionKey.OP_READ
- Write – when the server is ready to write to the channel. Represented by SelectionKey.OP_WRITE
NIO itself is a case of the reactor model: mutliple Client -> one accepter -> mulitple channel -> one reactor –> multiple servers
Select() copies request from user space to kernel space, and in kernel mode to test if each request is ready.
- Blocks on no data
- Similar to poll() requires user to iterate all channels once more after notifies some channels are ready.
In level-triggered epoll(), all threads will wake up and try accept() - but only one can get it!
Unlike select(), poll(), epoll’s fd shars between user state and kernel state - that copy is saved
- Epoll returns only the request with data. It reorders fds, and put all fd with data to the front, and then return
My interview questions for SQL DBA candidates
Expectations
- Software engineers are responsible for
- Design and performance of queries and tables
- Basic DB topology design, including the placement of DB servers
- Monitoring and alerting metrics from the application code, but not DB
- DBAs are responsible for
- Monitoring and alerting metrics from DB layer
- Backup and restore db from/to different servers
- Failover/failback drill
- DBA should be able to answer at minimum 80% of questions I posted here
- Communication wise, see if they are aware and follow the STAR principle
- Culture wise, see if they demonstrate
- Ownership
- Disagree and commit
- Dive deep
Questions
- Give two cases of deadlocks you see on the DB, and explain how you solved it.
- Be able to explain isolation levels
- Be able to explain difference of different lock types
- Be able to explain ways to detect and fix deadlock cases
- Give two cases of slow queries you see on the DB, and explain how you solved it
- Show the use of slow query log, query plan, and unix performance debugging commands
- Monitoring and alerting should be in place to discover the slow query after 5 mins for OLTP, or high resource usage for OLAP
- Be able to explain the common patterns of setting up indices and identify when the indices are too many or too few
- Give two cases of replication lag you see on the DB, and explain how you solved it
- Be able to explain replication lag threshold set, which should be no more than 10 minutes
- Be able to explain the HA and DR setup of the replication tool, and the operational excellence to ensure the HA
- Be able to give verfication procedure after the replication is fixed
- What is the backup/restore DB speed I can expect?
- Be able to dig more context info without being prompted. Red flag if jump into an estimate immediately
- Be able to give verification procedure after the restoration is done
- Be able to explain common parameters for the dump/restore tool
- Design or explain DR drills for the DBs you are responsible for
- Knows the meaning of RTO, RPO, MTTR. Red flag if not able to specify them before jumping into drill details
- The DR drills should cover multiple cases, from the most simple to complete outage. At least 3 cases should be covered
How tidb operator constructs aws eks cluster
Context: we want to provision multiple tidb clusters inside the same EKS cluster
Terraform
- Uses different worker groups to host different components, each worker group with specific taint and node label at the same time, e.g.,
--register-with-taints=dedicated=pd:NoSchedule --node-labels=dedicated=pd
- Each worker group maps to a ASG, and we can add tags to the user group itself then. Note that by default the monitor group doesn’t have extra tags
- The user data at each worker group just configs docker group and format and mount nvme disk
- Uses a null resource provisioner to apply the following k8s manifests, CRD, local volume provisioner, gp2 storage class, tiller rbac, in order
- inits helm and waits until it completes
- releases a tidb-operator via helm chart of the same name to tidb-admin namespace
- releases a tidb-cluster via helm chart of the same name, to tidb namespace. Note that this chart accepts data.template_file.tidb_cluster_values as params
- Then the null resource provisioner will keep polling via kubectl until pd, tidb, and tikvs are ready
- Note that elb and network access rules are not provisioned explicitly here
-
The helm release for tidb-cluster are done by tf module, where
tidb-cluster-values.yaml.tpl
provides params toresource "helm_release" "tidb-cluster" { depends_on = ["helm_release.tidb-operator"] name = "tidb-cluster-${var.cluster_name}" namespace = "tidb" chart = "${path.module}/charts/tidb-cluster" values = [ "${data.template_file.tidb_cluster_values.rendered}" ] }
Tidb cluster helm chart
- Check
pkg/apis/pingcap.com/v1alpha1/types.go
for the template values, important ones include:// TidbCluster is the control script's spec type TidbCluster struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata"` // Spec defines the behavior of a tidb cluster Spec TidbClusterSpec `json:"spec"` // Most recently observed status of the tidb cluster Status TidbClusterStatus `json:"status"` } type TidbClusterSpec struct { SchedulerName string `json:"schedulerName,omitempty"` PD PDSpec `json:"pd,omitempty"` TiDB TiDBSpec `json:"tidb,omitempty"` TiKV TiKVSpec `json:"tikv,omitempty"` TiKVPromGateway TiKVPromGatewaySpec `json:"tikvPromGateway,omitempty"` // Services list non-headless services type used in TidbCluster Services []Service `json:"services,omitempty"` PVReclaimPolicy corev1.PersistentVolumeReclaimPolicy `json:"pvReclaimPolicy,omitempty"` Timezone string `json:"timezone,omitempty"` }
Peak Performance: Elevate Your Game, Avoid Burnout, and Thrive with the New Science of Success
My reading notes
- The team falls back to the lowest common denominator in terms of attitude. How important to purge them!
- Do not use outcome as the driver of bettering yourself
- Turn OFF social media while focusing. Even a phone on the desk will distract us - stow them away!
- Make stress phase about 5% above your current level. More you will get anxiety, less bored
- Social media and tvs are NOT good relaxation techinques. You need to zone out, check out, instead of doing endorphin-inducing activities.
- For knowledge workers, excercising is a good way to relax.
- Social interactions with friends is also a good way, but not with your colleagues
- Walking breaks every hour
- Deep breathing may not always work, instead, view stress as challenge instead of threat.
- The less we think about ourselves, the better we perform
- When your mind is not at peace or in a bad mood, hard to think creatively. Therefore, be mindful of what happened to other people’s life outside work - this will affect how they react too.
- Separating work and life’s influence on each other is impossible
- Even top performers are not able to focus more than 2 hours in one sitting
- Napping can not make up for the lack of REM sleep, but short nap (15-20 mins) good ways to replenish energy in a long and intense day
- Making decision takes energy, even the slightest ones
- Discomfort is a natural part of stress. Learn to separate yourselve from it
- If you check your phone every couple of minutes, then you need a break!
- Fatigue spills over tasks
Product engineer's guide to the tidb migration
Shared by MySql and Tidb
- 90% of chance you should not use plain SELECT (no for update) inside transaction in either MySQL or TiDB. Most likely you need SELECT FOR UPDATE (SFU)
- SFU is a common solution to defend against writer skew and lost update problem.
- No nestd transactions or remote call inside the transaction
- Do not use foreign key in DB even if they are supported
- Concurrent insertions into an auto-increment ID column may have cases where ID values do not appear in the increasing order
Unique to Tidb
- Default isolation level is Snapshot Isolation (SI), which is similar to MySQL’s Repeatable Read (RR). It is stronger than Read Commited (RC), but not exactly same as RR
- Auto increment ID can guarantee uniqueness but not continuity. HIGHLY recommend move on from auto increment ID to ID generator such as snowflake.
- Hot key range problem in the kv storage problem during batch writes. Note that for int/bigint type, the key in tikv is the primary key
- SFU is using optimistic lock. This means we need to catch the optimistic locks exceptions from mysql client explicitly.
Unique to Mysql
- Phantom read is possible at Repeatable Read but not SI
How percolator/tidb transaction works
Uses a timestamp oracle(TO) to allocate monotonically increasing TS. Every txn needs to contact TO twice, one for start_ts at the beginning of prewrite, one for commit_ts at the beginning of commit
Each logical k-v has 3 column families(CF):
- Data: key_starTs-> value. Stores the actual data
- Lock: key -> meta: Stores txn lock infos
- Write: key_commitTs->startTs: The value at (Data)key_commitTs is stored at (Data)key_startTs
Sequence of actions
Initially Bob has $10, and Joe has $2
So the k-vs at the start of the transaction are:
(Data)Bob_5: 10
(Data)Bob_6: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Write)Bob_5: null
(Write)Bob_6: data@5
(Data)Joe_5: 2
(Data)Joe_6: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Write)Joe_5: null
(Write)Joe_6: data@5
Note that write keys at version 6 point to the data key at version 5. This is the end state of previous transaction
Now Bob wants to transfer $7 to Joe
- Gets the start_ts at 7 from TO, and write (Data)Bob_7:3
- Write (Lock)Bob_7: Primary. This marks this key is THE primary lock of the whole txn, regardless of how many keys are changed in this txn
- Write (Data)Joe_7: 2, and (Lock)Joe_7:Primary@Bob. The new lock entry marks the secondary lock that is used to look up for the primary lock. Now prewrite step is done!
- Gets the commit_ts at 8 from TO, and write (Write)Bob_8:data@7
- Delete (Lock)Bob_7
- At this stage we think the txn is complete already! Even if we have not clean up for Joe’s entries yet. In tidb, the below steps will process asyncly. See twoPhaseCommitter.doActionOnKeys
- Check (Lock)Joe_7, find it points to (Lock)Bob_7, which is deleted already, so we know the txn is complete already, and delete the (Lock)Joe_7
- Write (Write)Joe_8: data@7
Final shape
(Data)Bob_5: 10
(Data)Bob_6: null
(Data)Bob_7: 3
(Data)Bob_8: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Lock)Bob_7 : null
(Lock)Bob_8 : null
(Write)Bob_5: null
(Write)Bob_6: data@5
(Write)Bob_7: null
(Write)Bob_8: data@7
(Data)Joe_5: 2
(Data)Joe_6: null
(Data)Joe_7: 9
(Data)Joe_8: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Lock)Joe_7 : null
(Lock)Joe_8 : null
(Write)Joe_5: null
(Write)Joe_6: data@5
(Write)Joe_7: null
(Write)Joe_8: data@7
Design Problem: Pagination
Problem
- composite PK for message (msg_id, order_id, post_id)
- We want to paginate by the time
- data is sharded across DBs
- partition key is id % shards to ensure even distribution among shards
- We want to be able to go to a SPECIFIC page with known page size.
Of course complete view can do that, but can we do better?
100% accurate: next page only
- From product PoV, allows only next page, instead of going to the specific page.
- Every page we fetch, we remember the end of the timestamp at the current page
- When we look for the next page, we just use that last timestamp to pull the full page from each shard, and aggregate them in memory
Mostly accurate: guess the shard
- We just assume the data is evenly distrubuted
- Therefore, for each specific page we can compute exactly the offset, and page size. Both will be the original size/number of shards
100% accurate: good perf but more complicated
- Similar to the random partition method, pull pages from sharded db based on guessed offsets, but real page number, call the returned page RS
- Find the min_time of all 3 pages returned
- Query again with the time between min_time and the max time returned by each page in RS
- Find the virtual offset of min_time of in each shard’s page, and now we know the exact offset of min_time from the global PoV. Note that min_time global offset will be less than the target offset for sure.
- Now that with global view, we have enough information in memory to calculate the real offset from each record
Kelly criterion & Zipf's law
Kelly criterion
A more general problem relevant for investment decisions is the following:
- The probability of success is p.
- If you succeed, the value of your investment increases from 1 to 1+b.
- If you fail (for which the probability is q=1-p) the value of your investment decreases from 1 to 1-a. (Note that the previous description above assumes that a is 1).
The percentage to bet is p/a - q/b
Zipf’s law
The frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Only 135 vocabulary items are needed to account for half the Brown Corpus. Note that this is somewhat related to parent to Parento priciple, i.e., the 20/80 rule.
The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel
Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency)
References
Secretary Problem
The purpose here is not to show that this strategy is perfect. It just shows the inherent complexity in hiring. Whatever schemes you devise is just luck in filtering out “maybe-no”s. However, the bar happens to be much higher than what your actual job needs in many places. Thus, whatever hazing ritual shows good result, and we have to accept that as part of life and play by the rule.
When N is know
Note its difference between real life
- If you reject, the candidate will not appear again
- Decision must be made on the spot
- the interview will rank all applicants against each other
If N is known, and we are filling only 1 position
- The optimal strategy is to reject first n/e (about 37%) of candidates, and then pick the first candidate “better” than all previous ones
- The prob of picking the “best” candidate converages to 1/e (37%) too. Such result shows the inherent complexity of hiring
- An variant exists for picking k out of a pool of n, the cut-off threashold is about 0.25
When N is unknown, 1/e strategy
But timeframe T is known, and candiates appears in the same (not necessarily uniform) distribution
- Then we pick the the moment where the combined possibily that the best candiate appeared so far is 1/e
- 1/e chance you will get the best candidate, and 1/e chance you pick no one AT ALL
Common TCP and HTTP problems
Terms
- C: client
- S: server
- A: active
- P: passive
- ssthresh: slow-start threshold
- FR: fast retransmit
- FCQ: full connection q
TCP
- Why we need 3 network calls during init handshake, and 4 to terminate?
- S needs to ensure that C gets the ACK
- During the second FIN, the server can not merge it because it needs to send remaining data. The first FIN only ensures that C has no remaining data.
- What are possible reasons of many TIME_WAIT? Why wait 2 MSL?
- TIME_WAIT is A side’s (normally client) state after receives FIN from P side (normally server) and replied ACK
- Client has to wait 2 * MSL so that server’s resend FIN will timeout, 2 * MSL is the time for server to try FIN twice (first time being the normal try). Otherwise, the server’s FIN may go into the next session
- Normally MSL is between 30s to 2 mins
- What are possible reasons of many CLOSE_WAIT?
- A state for P (normally server). When A sends FIN and P doesn’t not receive the ACK, it will remain established. Needed so that client can ACK all on-the-fly but after FIN data. will set K+1 as ACK number.
- A common problem is not closing the connection from client (while the client side maybe closed due to long idleness)
- TCP congestion control, how is it done? How does tcp decide there is congestion?
- TCP tahoe: double frames sent until we reach ssthresh, and after that we just send 1 more than before.
- To avoid the stampede effect. Once congestion is reached (detected by loss of packet), we will FR and halve the ssthresh.
- FR receives a later frame, it will just ACK the last one it is expecting
- Note that different algos, sets different value to the initial window size
- Suppose A and B established connection with no data sent, and then B restarts, what is the state of conneciotn on A? How to get rid of/move on from this state?
- A will remain established, and recycled by the OS after timeout
- Half connection Q vs Full conneciton Q
- FCQ is for ESTABLISHED state
- If you have many SYN-RCVD, maybe a sign of SYN flood DDoS. Connection is now to half-opened state inside the half-connection queue
- When C starts first handshake but S is unable to process, what does S return?
- S will resend syn+ack in a timed way, forces the client to think the original ACK is lost, so just retry.
- How does TCP congestion control work?
- How does fast retry works?
- Tcp quick resend and congestion control
- Why tcp may have sticky packt problem, but UDP does not?
- UDP header has 16 bits to specify datagram’s length, but TCP doesn’t have that
- That is why netty uses various frame decoders
HTTP
- keep-alive: reuse the underlying TCP connection between requests. Need a timeout so server can release.
- Client side may need a shorter timeout cleanup thread. Otherwise, server may send a FIN when the new request comes
- how to handle cross domain problem?
- resend vs redirect?
How to deal with credit-stealing management
- Fight only the most importatnt battles. Not getting all credits you deserve is a fact of working in a team.
- Do NOT act on impulse. Think how this will affect you in the next 2-5 years. Fight only on those with long term consequence
- Credit stealing, e.g., the “I” tone, may be unintended or even well-intended.
- He may want to take the ownership of the whole thing - both success and failure,
- He may want to attach his name to get more buy-ins
- As self-protection, create paper trails and involve other people early
- During 1 on 1, raise that you want more visibility, and want chances to present ideas as my own.
- Make sure you remain neutral and non-confrontational.
- Stick with positive sandwich!
- “Tell her how proud you are of other team members for their specific contributions, and how those are helping her. Over time it will help her see more of the benefits of giving credit; how savvy that makes you look; and how indispensable a team is to her own success.” Against insecure manager
- Praise the manager in public if he is seeking visibiity, while silping in your own contribution (“I” tone is acceptable here)
- Set yourself up as the information source. Let other people know that they can find details from you
- If hiring manager asks why you left, answer honestly, but with emphasis on fairness and cooperation, and your action only demonstrates ethics and personal direction
- Plan idea out as if it got buy-in already, and present the detailed solution. (Similar to foot-in-the-door apporach?)
- Prepare additonal data not included in presentation, and slip it in during the presentation. The goal is to set you up as the expert
Process for infra cost control
Product engineering team
- Anyone in the product team can request resources.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable.
- The infra team will challenge the product team on costs.
- Three strikes for overprovisions. Upon the third strike, the product team must provide the post mortem on why over-provision happens so often and publish it to all tech teams
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Infra engineering team
- The infra team is responsible for challenging the product team to bring down the cost.
- Every successful challenge and potentially saved cost should be logged and recorded as part of the KPI. This number will be published to the management monthly or quarterly.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Guidelines
- Overall, if cost is the same, prefer bigger but fewer instances, e.g., provision one c4.8xlarge instead of two c4.4xlarge.
- Reduces the probability that any given instance failed
- Reduces the noisy neighbor problem commonly experienced in virtualized environments
- Reduces the per-host monitoring effort, both effort and money-wise.
- Computing instances for k8s
- Instance type defaults to c5.9xlarge.
- Any instance request less than c5.4xlarge should be extremely rare, e.g., once per month.
- Number of replicas:
- Each tomcat instance should handle between 200 to 2000 RPS.
- This means very unlikey you need more than 20 replicas for a single services
- Use Little’s law to estimate. Rule of thumb is 1k RPS per replica
- Each kafka consumer should be able to handle at least 200 msg/sec. In general the consumption is not the bottleneck, the data sink is.
- For OLTP workload, highly unlikley you need more than 3 consumers.
- Each tomcat instance should handle between 200 to 2000 RPS.
- RDS/Aurora
- Overprovisioning write db is OK, because it is hard to scale up.
- For high traffic OLTP services (> 1000 rps peak), r5.4xlarge is more than enough
- For internal services OLTP (less than 100 rps), m5.2xlarge is more than enough
- Read db is needed only for high traffic OLTP services, max 3 read dbs are enough.
- If the suggested db specification is not enough, then 80% it is code or architecture problem
- Overprovisioning write db is OK, because it is hard to scale up.
- Redis
- One r5.4xlarge should be enough for almost all our use cases, i.e., 20k rps.
- If the redis expects to handle no more than 200 rps, then we don’t really need to redis. Just use DB
- Replication factor 2 is enough in 80% cases, although 3 is OK
- Elasticsearch
- Keep between 10% to 30% free storage size. Higher than 30% means over-provision
- For log analysis, defaults to m5.4xlarge, with 2TB storage per node