Scalable microservice checklist for engineers

2019-10-17 00:00:00 +0000

This is based on the internal guide I wrote for Paypay teams. I took out data that are specific to our environment. Almost every item has a SEV 2 behind it.

Target audience

Backend engineers developing Java microservices

Java

GC’s stop the world (STW) is a common source of CPU spikes and latency spikes.
- Most of backend services are IO intensive instead of CPU intensive. In such cases, the CPU usage of process should not exceed 50% of the cores.
- Use metrics to check both young GC and old GC times and frequency. Frequent young GC is OK. Any sign of old GC requires attention
STW also means the timer you set at the java side could include the GC time on top of the actual execution time. Therefore, make sure you track client side AND server side processing time at the same time
Heap size rule of thumb: try to keep it between 4 - 8G,
- Rationale: Less than 4G, the survivor region may be too small and gets promoted to old generation too easily. Therefore, it is more likely to trigger old GC
- Rationale: Greater than 8G, the full GC STW may be too long
JVM settings
- Same -Xmx and -Xms
- For oltp systems, bigger new gen
- 256MB is more than enough for metaspace
- 1M max stack size is more than enough

Network

Make sure you understand Little’s law, and use that to estimate your connection pool size and task queue size
- Pool’s init/min/max size should be constant, so we can warm up the pool on start
- On db/server side, prefer smaller connection pool size. Higher than (number of cores * 3) is very debatable.
  - Rationale: your cores will be saturated with (number of cores * 3) workers already. Adding more workers will only add the context switch cost without adding breathing room for the CPU
- Set max task queue size. Higher than (num of cores * 10) is very debatable. When the queue is full, prefer abort or discard policy.
  - Rationale: The system can not process fast enough already, adding in the task queue will not help with the task processing time
Do NOT make any remote/inter-process calls when you are inside a mutex. 95% of chance such code is wrong
The network latency between nodes in our DC is about 1-2 ms. Do not issue sequential network call, as the latency will add up quickly
Low timeout so we can fail early, which also protects the downstream system. Timeout should be no more than 2s on UX-impacting calls (may even consider 1s)
Retry no more than 3 times with random jitters to avoid thundering herd problem
Because of retry, the timeout of each call needs to be 1/2 or 1/3 of the read timeout of the client. Otherwise, it will have no impact because client times out already
Timeout normally just kills the current request connection. Often client side needs to do additonal things to trigger clean up of long running action. The key point is that don’t expect server time to do clean up without explicit instruction from client side

Operation

Health check
- The thread answering health check is often different from the worker thread. To prevent the ninja work done by the worker thread, upon detecting health check failure, make sure you shut down the whole process
- ELB is default check is 10s time out, and will register the instance as failed after 3 times. This is too generous if the load balancer is within 100 KM of the instances,i.e., failure may not be detected enough
- When the same region scenario, consider health check timeout of 1s, with 3 as the failure threshould, and 2 as success threshould

RDS

Memory should be able to hold all indices. You can estimate the size of indices by (number of rows * avg size of primary keys)
Try not to use incremental id for PK on big tables, use snowflake/uid generator if possible.
- Rationale: strict incremental id generation is NOT support in most of the distributed systems. We will have to use snowflake for id generation when we migrate off RDS anyway.
Make sure you add quotes when filtering by varchar type. Otherwise, the index on that varchar type will not be hit
Solving deadlock:
- On Repeatable-Read isolation level, a common source is the gap lock introduced by batch inserts
- Check the execution plan of the conflicting queries. Most likely one is holding gap lock or table lock by mistake
Solving slow queries
- Check query plan by explain analyze. If there is no “range” step, something is wrong.
- Note that “index” type step is still a full table scan
- The most reliable fix is to add an index hint. With it, you don’t even need to worry about the outdated table statistics.

Redis

You don’t need redis, if
- Your distribute lock has no more than 100 rqs - you can just implement the lock in the DB
  - Rationale: trade off between code complexity vs architecture complexity
- 80% data hits less than 50k entries - you can just use local in-memory cache
  - Rationale: 50k * 1K per entry = 50M memory only
Redis is single threaded. This means any slower command will block everyone else.
- For non k-v look up case, make sure your collection size is less than 200 items. Otherwise, you need to redesign your data structure
  - Rationale: avoid the big key problem

Kafka

Common kafka use cases:
- Uses event sourcing or even-driven model or
- Have to buffer requests to process asynchronously
People often use Kafka just to partition workloads. Explore other options before opting for kafka. Rationales:
- Your scaling factor is limited by the number of topic partitions, which is very hard to scale up
- Very hard to deploy 50+ consumer pods on the cluster, because each Spring pod is a separate process consuming > 1G memory
- Your true bottleneck is almost always the datasink anyway
Most producer/consumer logics are at least once, so idempotency should be in your logic all the time
- Kafka does have exactly once processing, but that works only when both source and sink are kafka topics
- Most likely you need producer ack = 1
If you do DB write and kafka write in the same API request, you will need a recon job to recon inconsistencies. Rationales:
- The DB write and kafka write are not the in the same transaction. So all-or-nothing semantics can’t be guaranteed
- Reliable message delivery is possible, but not trivial. The effort/reward trade off in our experience is not worth it.
Replication factor 3 is enough. You can even set it to 2 if losing data is acceptable, e.g., low importance logs
- Rationale: Kafka deployment is evenly spread across 3 AZs. Replication factor 3 is enough to counter 1 AZ failure

Spring

Tomcat + JVM idle will consume almost 1G ram. This, combined with the process context switch costs, means that we prefer bigger but fewer pods
Spring AOP annotation works only in a @Bean, and only when it is called from another class. We had many cases where @Async, @Transactional annotations have no effect because of these two problems
Try to limit the @Transactional scope as much as you can, this means
- Highly debatable to put @Transancatonal scope on the class level. Prefer method level annotation
- It is normal to create a method just to open the transaction.
- Rationale: reduce the DB transaction duration as much as we can
If you use @Async, make sure you override the Executor bean
- Rationale: default implementation is SimpleThreadPoolTaskExecutor, which may create an unlimited number of worker threads
For OLTP APIs affecting UX, no timeout > 5s, ideally no more than 2s
- Rationale:if one API can’t finish it under 2s, that means some systems are under stress or faulty already. We fail early to prevent the cascading failure across services
Prefer native query to JPA/hibernate ORM. Rationale:
- Often you need to specify index hints
- Generated query is too long to fit into slow query log. What makes it worse is that the slow query log often truncates the WHERE condition
Do not include inter-service call inside transaction. Rationale:
- Reduce DB txn duration as much as we can. This is a common source of deadlock
- It gives people a false sense of transaction between services when there is none. You have to implement TCC/Saga yourself to have the transaction-ish semantics
All calls to third party APIs must be protected by a circuit breaker
- Rationale: Fail fast, so the worker thread on our tomcat can be freed quickly

Debugging JVM memory issues

2019-10-10 00:00:00 +0000

Is the survivor space enough?

If not, then after minor GC, the objects can’t fit into the survivor, and we will have to promote to old gen directly. Consider making new gen 50% bigger than the old gen

Which GC to use?

ParNew for new gen, CMS for old gen.

Metaspace full

commonly caused by too many proxies generated by cglib
By default the size is about 20M, and then on start this size is very likely to trigger metaspace resize, which leads to GCs. Consider setting metaspace size to min=max=256MB

Native memory usage too high

e.g., off heap memory such as DirectByteBuffer

Where are strings stored?

Constant strings are stored in string constant pool, which is on the heap. It is implemented as a hashtable of pointers to the strings on the heap
Other strings created by ctor just stay on heap as usual

Tools

jmap -heap - check new and old jen size
jmap -histo:live - what is alive, is it too much?
/proc/${PID}/fd, /proc/${PID}/task check FD usage and thread count
pstree、ss - to check process creation and network connection number

Who’s Got the Monkey

2019-10-09 00:00:00 +0000

Source

The time imposed by the supervisor and peers are hard to cut down. Therefore, we have to control how much time we spend on sobordinates, and many people spend way more than what they prefer or even realized
Progress report should be from the subordinates. This also means a manager tries not to “get back to you later” - It is now or never. In case a decision does take time, make that decision in a scheduled meeting with the owner together, ideally face-to-face.
- At the end of such meeting, the problem owner (not the manager) and next meeting time must be explicitly set
- This meeting should not take > 15 mins. Longer than that means subordinates are not prepared
- The manager should NEVER make a decision on a subordinate’s problem alone!

On a side note, build a universal accountability culture to shorten the req-respose time and feedback loop, i.e., everyone should be able to hold each other accountable

Outage caused by Log4j's Async Appender

2019-10-08 00:00:00 +0000

In AsyncAppender.append, if failed to move the message to in-memory buffer

if (blocking) {
     if (AbstractLogger.getRecursionDepth() > 1) { // LOG4J2-1518, LOG4J2-2031
         // If queue is full AND we are in a recursive call, call appender directly to prevent deadlock
         AsyncQueueFullMessageUtil.logWarningToStatusLogger();
         logMessageInCurrentThread(logEvent);
     } else {
         // delegate to the event router (which may discard, enqueue and block, or log in current thread)
         final EventRoute route = asyncQueueFullPolicy.getRoute(thread.getId(), memento.getLevel());
         route.logMessage(this, memento);
     }
 } else {
     error("Appender " + getName() + " is unable to write primary appenders. queue is full");
     logToErrorAppenderIfNecessary(false, memento);
 }

On default settings

The buffer is an ArrayBlockingQueue
blocking is set to true
use DefaultAsyncQueueFullPolicy, which most likely uses EventRoute.ENQUEUE, which in turns uses BlockingQueue.put(logEvent);

This means that under default settings, if the buffer is remains full for a long time, e.g., consumer is unable to consume, your main logging thread will be blocked on that put call, and in turn the Tomcat work thread pool will be exhausted!

On AbstractQueuedSynchronizer

2019-10-08 00:00:00 +0000

State

Internally, maintains a volatile int state
- state = 0 is free, 1 acquired, > 1 num of reentrants
- State is updated by unsafe.compareAndSwapInt

Node and CLH lock

In CLH lock, applying thread can only self-spin on local variable, and keep polling the previous node’s status, if it discovers that the prev node released lock, it will finish self spin

static final class Node{
        volatile int waitStatus; //control blocking
        volatile Node prev;
        volatile Node next;
        volatile Thread thread;
        Node nextWaiter; //Uses sentinel value to detect if it is shared or exclusive
}

if prev is cancelled, this node can use prev node’s status
head inside AQS is a sentinel value
Custom synchroinzer need to implement how state is acquired and released, q maintenance is done by the AQS already
WaitStatus
- 1 - cancelled
- -1 - signal, i.e., current thread need to wake up the next when release or cancel
- -2 - condition, waiting to be waken up by the condition
- -3 - propergate, shared lock

Acquire

You need to implement your own tryAcquire or tryAcquireShared

  private Node enq(final Node node) {
        for (;;) {
            Node t = tail;
            if (t == null) { // Must initialize
                if (compareAndSetHead(new Node()))
                    tail = head;
            } else {
                node.prev = t;
                if (compareAndSetTail(t, node)) {
                    t.next = node;
                    return t;
                }
            }
        }
    }

    private Node addWaiter(Node mode) {
        Node node = new Node(Thread.currentThread(), mode);
        // Try the fast path of enq; backup to full enq on failure
        Node pred = tail;
        if (pred != null) {
            node.prev = pred;
            if (compareAndSetTail(pred, node)) { //uses unsafe
                pred.next = node;
                return node;
            }
        }
        enq(node);
        return node;
    }

        private static boolean shouldParkAfterFailedAcquire(Node pred, Node node) {
        int ws = pred.waitStatus;
        if (ws == Node.SIGNAL)
            /*
             * This node has already set status asking a release
             * to signal it, so it can safely park.
             */
            return true;
        if (ws > 0) {
            /*
             * Predecessor was cancelled. Skip over predecessors and
             * indicate retry.
             */
            do {
                node.prev = pred = pred.prev;
            } while (pred.waitStatus > 0);
            pred.next = node;
        } else {
            /*
             * waitStatus must be 0 or PROPAGATE.  Indicate that we
             * need a signal, but don't park yet.  Caller will need to
             * retry to make sure it cannot acquire before parking.
             */
            compareAndSetWaitStatus(pred, ws, Node.SIGNAL);
        }
        return false;
    }

    private final boolean parkAndCheckInterrupt() {
        LockSupport.park(this); //to wake up thread, use unpark() or interrupted()
        return Thread.interrupted(); //ignores inturrptions during the park(), will add a self-interruption only after resource is acquired
    }

    final boolean acquireQueued(final Node node, int arg) {
        boolean failed = true;
        try {
            boolean interrupted = false;
            for (;;) {
                final Node p = node.predecessor();
                if (p == head && tryAcquire(arg)) {//it is possible to get lock now, e.g., head has been released
                    setHead(node);
                    p.next = null; // help GC
                    failed = false;
                    return interrupted;
                }
                if (shouldParkAfterFailedAcquire(p, node) &&
                    parkAndCheckInterrupt())
                    interrupted = true;
            }
        } finally {
            if (failed)
                cancelAcquire(node);
        }
    }

      public void acquire(int arg) {
           if(!tryAcquire(arg) && acquireQueued(addWaiter(Node.EXCLUSIVE), arg))
                      selfInterrupt(); //if the thread has been interrupted during the wait, it won't respond, here we do a self-interrupt to make up for it
      }

      public void acquireShared(int arg){
              if(tryAcquireShared(arg) < 0)
                      doAcquireShared(arg);
      }

Difference with fair and biased: on biased, CAS to acq first, and acq only when failed, and won’t detect if CLH is idle
Shared lock: when it becomes head, it will wake up the next thread, upon releasing, S lock will wake up other threads regardless of state, but X lock will wake up only when state = 0
for X lock, each node CAS checks if the previosu node is header, if so, try to acquire lock
CountDownLatch: state set to N, at each countDown(), state will CAS to dec by 1, when state= 0, unpark the main thread

On ThreadLocal

2019-10-07 00:00:00 +0000

static class ThreadLocalMap {

        private Entry[] table;
        int INITIAL_CAPCITY=16;
        static class Entry extends WeakReference.ThreadLocal<?> {
                Object value;
                Entry (ThreadLocal<?> k, Object v){ //v is what the code put in
                        super(k); //weak ref means entry k may become null
                        value = v;
                }
        }

        //on get and set, will expungeStaleEntry

}

public class Thread implements Runnable{

ThreadLocal.ThreadLocalMap threadLocals = null;

ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;

}

class ThreadLocal {

public void set(T value) {
	Thread t = Thread.currentThread();
	ThreadLocalMap map = t.threadLocals
	if(map != null)
		map.set(this, value);
        else
		createMap(t, value);
}

}

standard techinque to replace % pow(2, n) with &(pow(2,n)-1), and keep the size of the colleciton always 2^n
The table uses iteration to handle hash conflict, note that it will remove non-null entries whose key is null. This is part of the reason why the load factor is 2/3, lower than that of hashmap
ThreadLocal.ThreadLocalMap is a member variable of Thread, so we can access it with Thread.currentThread();
Weak reference to the ThreadLocal means that often we will have Entry with key = null. That is why during get() and set(), the code try to remove the stale entry. Otherwise, there is no way for outside code to break the strong reference to the value object by this time. This case is a common source of memory “leak” and is also the reason it is recommended to call ThreadLocal.remove() in a finally clause after processing is done.

On Java thread state

2019-10-06 00:00:00 +0000

Java threads:
- New: by new Thread()
- Runnable: after Thread.start()
- Waiting: stay in the wait set, thread previosuly called:
  - Thread.join(), Need that thread to finish
  - Object.wait(). Need Object.notify/notifyAll() to become runnable.
  - LockSupport.park()
- Timed_watiing:
  - Thread.sleep(long)
  - Object.wait(timeout)
  - Thread.join(timeout)
  - LockSupport.parkUntil(long)
- Blocked: need to acquire (monitor) lock but unable to. Stay in the entry set.
OS threads:
- new, ready, running, waiting, terminated
Java thread has no running state - only runnable, it includes ready and running state in the OS thread’s sense. Normally one thread executes about 10ms on CPU at a time. When at blocking IO, e.g., socket.accept(), the JAVA THREAD is still runnable instead of BLOCKED or WAITING, but at OS layer, the OS thread is WAITING.

On Spring AOP

2019-10-04 00:00:00 +0000

Terms

Join point: A point during the execution of a program, such as the execution of a method or the handling of an exception.
Advice: Action taken by an aspect at a particular join point,i.e., the actual code, think it as the method of an apsect. Spring AOP currently supports only method execution join points (advising the execution of methods on Spring beans).
Introduction: (Also known as an inter-type declaration). Declaring additional methods or fields on behalf of a type. Spring AOP allows us to introduce new interfaces (and a corresponding implementation) to any proxied object.
Aspect: advice + introduction

cglib vs dynamic proxy

JDK dynamic proxy works only for the interfaced classes, and as the name suggests, it is implemented by the proxy pattern/composition. This means to trigger the annotation, the dynamic proxy has to be called from outside this class. Implemented by reflection
cglib works doesn’t have this restriciton, because it implemented via inheritance. This also means annotaion works only for the public methods, because only they can be inherited.

On Java NIO

2019-10-02 00:00:00 +0000

NIO model: Multiple channel -> one selector -> SelectionKey queue -> thread pool

Channel is similar to a abstraction of file descriptor. It is bi-directional, but can asyncly read and write buffer
- SocketChannel is commonly used in TCP client
- ServerSocketChannel in handling TCP request from outside
- DatagramChannel to handle UDP client and server
One selector to monitor mutliple channels, and we check the selector directly
Buffer is commonly memory mapped so we don’t have to copy from kernel space to user space
SelectionKey class states
- Connect – when a client attempts to connect to the server. Represented by SelectionKey.OP_CONNECT
- Accept – when the server accepts a connection from a client. Represented by SelectionKey.OP_ACCEPT
- Read – when the server is ready to read from the channel. Represented by SelectionKey.OP_READ
- Write – when the server is ready to write to the channel. Represented by SelectionKey.OP_WRITE

NIO itself is a case of the reactor model: mutliple Client -> one accepter -> mulitple channel -> one reactor –> multiple servers

Select() copies request from user space to kernel space, and in kernel mode to test if each request is ready.

Blocks on no data
Similar to poll() requires user to iterate all channels once more after notifies some channels are ready.

In level-triggered epoll(), all threads will wake up and try accept() - but only one can get it!

Unlike select(), poll(), epoll’s fd shars between user state and kernel state - that copy is saved

Epoll returns only the request with data. It reorders fds, and put all fd with data to the front, and then return

My interview questions for SQL DBA candidates

2019-09-21 00:00:00 +0000

Expectations

Software engineers are responsible for
Design and performance of queries and tables
Basic DB topology design, including the placement of DB servers
Monitoring and alerting metrics from the application code, but not DB
DBAs are responsible for
Monitoring and alerting metrics from DB layer
Backup and restore db from/to different servers
Failover/failback drill
DBA should be able to answer at minimum 80% of questions I posted here
Communication wise, see if they are aware and follow the STAR principle
Culture wise, see if they demonstrate
- Ownership
- Disagree and commit
- Dive deep

Questions

Give two cases of deadlocks you see on the DB, and explain how you solved it.
- Be able to explain isolation levels
- Be able to explain difference of different lock types
- Be able to explain ways to detect and fix deadlock cases
Give two cases of slow queries you see on the DB, and explain how you solved it
- Show the use of slow query log, query plan, and unix performance debugging commands
- Monitoring and alerting should be in place to discover the slow query after 5 mins for OLTP, or high resource usage for OLAP
- Be able to explain the common patterns of setting up indices and identify when the indices are too many or too few
Give two cases of replication lag you see on the DB, and explain how you solved it
- Be able to explain replication lag threshold set, which should be no more than 10 minutes
- Be able to explain the HA and DR setup of the replication tool, and the operational excellence to ensure the HA
- Be able to give verfication procedure after the replication is fixed
What is the backup/restore DB speed I can expect?
- Be able to dig more context info without being prompted. Red flag if jump into an estimate immediately
- Be able to give verification procedure after the restoration is done
- Be able to explain common parameters for the dump/restore tool
Design or explain DR drills for the DBs you are responsible for
- Knows the meaning of RTO, RPO, MTTR. Red flag if not able to specify them before jumping into drill details
- The DR drills should cover multiple cases, from the most simple to complete outage. At least 3 cases should be covered

How tidb operator constructs aws eks cluster

2019-06-11 00:00:00 +0000

Context: we want to provision multiple tidb clusters inside the same EKS cluster

Terraform

Uses different worker groups to host different components, each worker group with specific taint and node label at the same time, e.g.,
```
--register-with-taints=dedicated=pd:NoSchedule --node-labels=dedicated=pd
```
Each worker group maps to a ASG, and we can add tags to the user group itself then. Note that by default the monitor group doesn’t have extra tags
The user data at each worker group just configs docker group and format and mount nvme disk
Uses a null resource provisioner to apply the following k8s manifests, CRD, local volume provisioner, gp2 storage class, tiller rbac, in order
- inits helm and waits until it completes
- releases a tidb-operator via helm chart of the same name to tidb-admin namespace
- releases a tidb-cluster via helm chart of the same name, to tidb namespace. Note that this chart accepts data.template_file.tidb_cluster_values as params
- Then the null resource provisioner will keep polling via kubectl until pd, tidb, and tikvs are ready
Note that elb and network access rules are not provisioned explicitly here

The helm release for tidb-cluster are done by tf module, where tidb-cluster-values.yaml.tpl provides params to

  resource "helm_release" "tidb-cluster" {
depends_on = ["helm_release.tidb-operator"]
name = "tidb-cluster-${var.cluster_name}"
namespace = "tidb"
chart = "${path.module}/charts/tidb-cluster"
values = [
  "${data.template_file.tidb_cluster_values.rendered}"
]
}

Tidb cluster helm chart

Check pkg/apis/pingcap.com/v1alpha1/types.go for the template values, important ones include:

  // TidbCluster is the control script's spec
type TidbCluster struct {
  metav1.TypeMeta   `json:",inline"`
  metav1.ObjectMeta `json:"metadata"`

  // Spec defines the behavior of a tidb cluster
  Spec TidbClusterSpec `json:"spec"`

  // Most recently observed status of the tidb cluster
  Status TidbClusterStatus `json:"status"`
  }

  type TidbClusterSpec struct {
  SchedulerName   string              `json:"schedulerName,omitempty"`
  PD              PDSpec              `json:"pd,omitempty"`
  TiDB            TiDBSpec            `json:"tidb,omitempty"`
  TiKV            TiKVSpec            `json:"tikv,omitempty"`
  TiKVPromGateway TiKVPromGatewaySpec `json:"tikvPromGateway,omitempty"`
  // Services list non-headless services type used in TidbCluster
  Services        []Service                            `json:"services,omitempty"`
  PVReclaimPolicy corev1.PersistentVolumeReclaimPolicy `json:"pvReclaimPolicy,omitempty"`
  Timezone        string                               `json:"timezone,omitempty"`
}

Peak Performance: Elevate Your Game, Avoid Burnout, and Thrive with the New Science of Success

2019-05-29 00:00:00 +0000

My reading notes

The team falls back to the lowest common denominator in terms of attitude. How important to purge them!
Do not use outcome as the driver of bettering yourself
Turn OFF social media while focusing. Even a phone on the desk will distract us - stow them away!
Make stress phase about 5% above your current level. More you will get anxiety, less bored
Social media and tvs are NOT good relaxation techinques. You need to zone out, check out, instead of doing endorphin-inducing activities.
- For knowledge workers, excercising is a good way to relax.
- Social interactions with friends is also a good way, but not with your colleagues
- Walking breaks every hour
Deep breathing may not always work, instead, view stress as challenge instead of threat.
The less we think about ourselves, the better we perform
When your mind is not at peace or in a bad mood, hard to think creatively. Therefore, be mindful of what happened to other people’s life outside work - this will affect how they react too.
- Separating work and life’s influence on each other is impossible
Even top performers are not able to focus more than 2 hours in one sitting
Napping can not make up for the lack of REM sleep, but short nap (15-20 mins) good ways to replenish energy in a long and intense day
Making decision takes energy, even the slightest ones
Discomfort is a natural part of stress. Learn to separate yourselve from it
If you check your phone every couple of minutes, then you need a break!
Fatigue spills over tasks

Product engineer's guide to the tidb migration

2019-05-21 00:00:00 +0000

Shared by MySql and Tidb

90% of chance you should not use plain SELECT (no for update) inside transaction in either MySQL or TiDB. Most likely you need SELECT FOR UPDATE (SFU)
SFU is a common solution to defend against writer skew and lost update problem.
No nestd transactions or remote call inside the transaction
Do not use foreign key in DB even if they are supported
Concurrent insertions into an auto-increment ID column may have cases where ID values do not appear in the increasing order

Unique to Tidb

Default isolation level is Snapshot Isolation (SI), which is similar to MySQL’s Repeatable Read (RR). It is stronger than Read Commited (RC), but not exactly same as RR
Auto increment ID can guarantee uniqueness but not continuity. HIGHLY recommend move on from auto increment ID to ID generator such as snowflake.
Hot key range problem in the kv storage problem during batch writes. Note that for int/bigint type, the key in tikv is the primary key
SFU is using optimistic lock. This means we need to catch the optimistic locks exceptions from mysql client explicitly.

Unique to Mysql

Phantom read is possible at Repeatable Read but not SI

How percolator/tidb transaction works

2019-05-16 00:00:00 +0000

Uses a timestamp oracle(TO) to allocate monotonically increasing TS. Every txn needs to contact TO twice, one for start_ts at the beginning of prewrite, one for commit_ts at the beginning of commit

Each logical k-v has 3 column families(CF):

Data: key_starTs-> value. Stores the actual data
Lock: key -> meta: Stores txn lock infos
Write: key_commitTs->startTs: The value at (Data)key_commitTs is stored at (Data)key_startTs

Sequence of actions

Initially Bob has $10, and Joe has $2

So the k-vs at the start of the transaction are:

(Data)Bob_5: 10
(Data)Bob_6: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Write)Bob_5: null
(Write)Bob_6: data@5

(Data)Joe_5: 2
(Data)Joe_6: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Write)Joe_5: null
(Write)Joe_6: data@5

Note that write keys at version 6 point to the data key at version 5. This is the end state of previous transaction

Now Bob wants to transfer $7 to Joe

Gets the start_ts at 7 from TO, and write (Data)Bob_7:3
Write (Lock)Bob_7: Primary. This marks this key is THE primary lock of the whole txn, regardless of how many keys are changed in this txn
Write (Data)Joe_7: 2, and (Lock)Joe_7:Primary@Bob. The new lock entry marks the secondary lock that is used to look up for the primary lock. Now prewrite step is done!
Gets the commit_ts at 8 from TO, and write (Write)Bob_8:data@7
Delete (Lock)Bob_7
At this stage we think the txn is complete already! Even if we have not clean up for Joe’s entries yet. In tidb, the below steps will process asyncly. See twoPhaseCommitter.doActionOnKeys
Check (Lock)Joe_7, find it points to (Lock)Bob_7, which is deleted already, so we know the txn is complete already, and delete the (Lock)Joe_7
Write (Write)Joe_8: data@7

Final shape

(Data)Bob_5: 10
(Data)Bob_6: null
(Data)Bob_7: 3
(Data)Bob_8: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Lock)Bob_7 : null
(Lock)Bob_8 : null
(Write)Bob_5: null
(Write)Bob_6: data@5
(Write)Bob_7: null
(Write)Bob_8: data@7


(Data)Joe_5: 2
(Data)Joe_6: null
(Data)Joe_7: 9
(Data)Joe_8: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Lock)Joe_7 : null
(Lock)Joe_8 : null
(Write)Joe_5: null
(Write)Joe_6: data@5
(Write)Joe_7: null
(Write)Joe_8: data@7

Design Problem: Pagination

2019-05-15 00:00:00 +0000

Problem

composite PK for message (msg_id, order_id, post_id)
We want to paginate by the time
data is sharded across DBs
partition key is id % shards to ensure even distribution among shards
We want to be able to go to a SPECIFIC page with known page size.

Of course complete view can do that, but can we do better?

100% accurate: next page only

From product PoV, allows only next page, instead of going to the specific page.
Every page we fetch, we remember the end of the timestamp at the current page
When we look for the next page, we just use that last timestamp to pull the full page from each shard, and aggregate them in memory

Mostly accurate: guess the shard

We just assume the data is evenly distrubuted
Therefore, for each specific page we can compute exactly the offset, and page size. Both will be the original size/number of shards

100% accurate: good perf but more complicated

Similar to the random partition method, pull pages from sharded db based on guessed offsets, but real page number, call the returned page RS
Find the min_time of all 3 pages returned
Query again with the time between min_time and the max time returned by each page in RS
Find the virtual offset of min_time of in each shard’s page, and now we know the exact offset of min_time from the global PoV. Note that min_time global offset will be less than the target offset for sure.
Now that with global view, we have enough information in memory to calculate the real offset from each record

Kelly criterion & Zipf's law

2019-05-13 00:00:00 +0000

Kelly criterion

A more general problem relevant for investment decisions is the following:

The probability of success is p.
If you succeed, the value of your investment increases from 1 to 1+b.
If you fail (for which the probability is q=1-p) the value of your investment decreases from 1 to 1-a. (Note that the previous description above assumes that a is 1).

The percentage to bet is p/a - q/b

Zipf’s law

The frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Only 135 vocabulary items are needed to account for half the Brown Corpus. Note that this is somewhat related to parent to Parento priciple, i.e., the 20/80 rule.

The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel

Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency)

References

Secretary Problem

2019-05-12 00:00:00 +0000

The purpose here is not to show that this strategy is perfect. It just shows the inherent complexity in hiring. Whatever schemes you devise is just luck in filtering out “maybe-no”s. However, the bar happens to be much higher than what your actual job needs in many places. Thus, whatever hazing ritual shows good result, and we have to accept that as part of life and play by the rule.

When N is know

Note its difference between real life

If you reject, the candidate will not appear again
Decision must be made on the spot
the interview will rank all applicants against each other

If N is known, and we are filling only 1 position

The optimal strategy is to reject first n/e (about 37%) of candidates, and then pick the first candidate “better” than all previous ones
The prob of picking the “best” candidate converages to 1/e (37%) too. Such result shows the inherent complexity of hiring
An variant exists for picking k out of a pool of n, the cut-off threashold is about 0.25

When N is unknown, 1/e strategy

But timeframe T is known, and candiates appears in the same (not necessarily uniform) distribution

Then we pick the the moment where the combined possibily that the best candiate appeared so far is 1/e
1/e chance you will get the best candidate, and 1/e chance you pick no one AT ALL

Common TCP and HTTP problems

2019-05-10 00:00:00 +0000

Terms

C: client
S: server
A: active
P: passive
ssthresh: slow-start threshold
FR: fast retransmit
FCQ: full connection q

TCP

Why we need 3 network calls during init handshake, and 4 to terminate?
- S needs to ensure that C gets the ACK
- During the second FIN, the server can not merge it because it needs to send remaining data. The first FIN only ensures that C has no remaining data.
What are possible reasons of many TIME_WAIT? Why wait 2 MSL?
- TIME_WAIT is A side’s (normally client) state after receives FIN from P side (normally server) and replied ACK
- Client has to wait 2 * MSL so that server’s resend FIN will timeout, 2 * MSL is the time for server to try FIN twice (first time being the normal try). Otherwise, the server’s FIN may go into the next session
- Normally MSL is between 30s to 2 mins
What are possible reasons of many CLOSE_WAIT?
- A state for P (normally server). When A sends FIN and P doesn’t not receive the ACK, it will remain established. Needed so that client can ACK all on-the-fly but after FIN data. will set K+1 as ACK number.
- A common problem is not closing the connection from client (while the client side maybe closed due to long idleness)
TCP congestion control, how is it done? How does tcp decide there is congestion?
- TCP tahoe: double frames sent until we reach ssthresh, and after that we just send 1 more than before.
- To avoid the stampede effect. Once congestion is reached (detected by loss of packet), we will FR and halve the ssthresh.
- FR receives a later frame, it will just ACK the last one it is expecting
- Note that different algos, sets different value to the initial window size
Suppose A and B established connection with no data sent, and then B restarts, what is the state of conneciotn on A? How to get rid of/move on from this state?
- A will remain established, and recycled by the OS after timeout
Half connection Q vs Full conneciton Q
- FCQ is for ESTABLISHED state
- If you have many SYN-RCVD, maybe a sign of SYN flood DDoS. Connection is now to half-opened state inside the half-connection queue
When C starts first handshake but S is unable to process, what does S return?
- S will resend syn+ack in a timed way, forces the client to think the original ACK is lost, so just retry.
How does TCP congestion control work?
How does fast retry works?
Tcp quick resend and congestion control
Why tcp may have sticky packt problem, but UDP does not?
- UDP header has 16 bits to specify datagram’s length, but TCP doesn’t have that
- That is why netty uses various frame decoders

HTTP

keep-alive: reuse the underlying TCP connection between requests. Need a timeout so server can release.
- Client side may need a shorter timeout cleanup thread. Otherwise, server may send a FIN when the new request comes
how to handle cross domain problem?
resend vs redirect?

How to deal with credit-stealing management

2019-05-08 00:00:00 +0000

Fight only the most importatnt battles. Not getting all credits you deserve is a fact of working in a team.
- Do NOT act on impulse. Think how this will affect you in the next 2-5 years. Fight only on those with long term consequence
Credit stealing, e.g., the “I” tone, may be unintended or even well-intended.
- He may want to take the ownership of the whole thing - both success and failure,
- He may want to attach his name to get more buy-ins
As self-protection, create paper trails and involve other people early
During 1 on 1, raise that you want more visibility, and want chances to present ideas as my own.
- Make sure you remain neutral and non-confrontational.
- Stick with positive sandwich!
“Tell her how proud you are of other team members for their specific contributions, and how those are helping her. Over time it will help her see more of the benefits of giving credit; how savvy that makes you look; and how indispensable a team is to her own success.” Against insecure manager
Praise the manager in public if he is seeking visibiity, while silping in your own contribution (“I” tone is acceptable here)
Set yourself up as the information source. Let other people know that they can find details from you
If hiring manager asks why you left, answer honestly, but with emphasis on fairness and cooperation, and your action only demonstrates ethics and personal direction
Plan idea out as if it got buy-in already, and present the detailed solution. (Similar to foot-in-the-door apporach?)
Prepare additonal data not included in presentation, and slip it in during the presentation. The goal is to set you up as the expert

Process for infra cost control

2019-05-06 00:00:00 +0000

Product engineering team

Anyone in the product team can request resources.
Use the guideline below to request infra resource.
Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable.
The infra team will challenge the product team on costs.
- Three strikes for overprovisions. Upon the third strike, the product team must provide the post mortem on why over-provision happens so often and publish it to all tech teams
If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead

Infra engineering team

The infra team is responsible for challenging the product team to bring down the cost.
Every successful challenge and potentially saved cost should be logged and recorded as part of the KPI. This number will be published to the management monthly or quarterly.
Use the guideline below to request infra resource.
Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable
If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead

Guidelines

Overall, if cost is the same, prefer bigger but fewer instances, e.g., provision one c4.8xlarge instead of two c4.4xlarge.
- Reduces the probability that any given instance failed
- Reduces the noisy neighbor problem commonly experienced in virtualized environments
- Reduces the per-host monitoring effort, both effort and money-wise.
Computing instances for k8s
- Instance type defaults to c5.9xlarge.
- Any instance request less than c5.4xlarge should be extremely rare, e.g., once per month.
Number of replicas:
- Each tomcat instance should handle between 200 to 2000 RPS.
  - This means very unlikey you need more than 20 replicas for a single services
  - Use Little’s law to estimate. Rule of thumb is 1k RPS per replica
- Each kafka consumer should be able to handle at least 200 msg/sec. In general the consumption is not the bottleneck, the data sink is.
  - For OLTP workload, highly unlikley you need more than 3 consumers.
RDS/Aurora
- Overprovisioning write db is OK, because it is hard to scale up.
  - For high traffic OLTP services (> 1000 rps peak), r5.4xlarge is more than enough
  - For internal services OLTP (less than 100 rps), m5.2xlarge is more than enough
- Read db is needed only for high traffic OLTP services, max 3 read dbs are enough.
- If the suggested db specification is not enough, then 80% it is code or architecture problem
Redis
- One r5.4xlarge should be enough for almost all our use cases, i.e., 20k rps.
- If the redis expects to handle no more than 200 rps, then we don’t really need to redis. Just use DB
- Replication factor 2 is enough in 80% cases, although 3 is OK
Elasticsearch
- Keep between 10% to 30% free storage size. Higher than 30% means over-provision
- For log analysis, defaults to m5.4xlarge, with 2TB storage per node

Previous Page: 9 of 23 Next