Outage caused by Log4j's Async Appender
In AsyncAppender.append
, if failed to move the message to in-memory buffer
if (blocking) {
if (AbstractLogger.getRecursionDepth() > 1) { // LOG4J2-1518, LOG4J2-2031
// If queue is full AND we are in a recursive call, call appender directly to prevent deadlock
AsyncQueueFullMessageUtil.logWarningToStatusLogger();
logMessageInCurrentThread(logEvent);
} else {
// delegate to the event router (which may discard, enqueue and block, or log in current thread)
final EventRoute route = asyncQueueFullPolicy.getRoute(thread.getId(), memento.getLevel());
route.logMessage(this, memento);
}
} else {
error("Appender " + getName() + " is unable to write primary appenders. queue is full");
logToErrorAppenderIfNecessary(false, memento);
}
On default settings
- The buffer is an
ArrayBlockingQueue
blocking
is set to true- use DefaultAsyncQueueFullPolicy, which most likely uses
EventRoute.ENQUEUE
, which in turns usesBlockingQueue.put(logEvent);
This means that under default settings, if the buffer is remains full for a long time, e.g., consumer is unable to consume, your main logging thread will be blocked on that put
call, and in turn the Tomcat work thread pool will be exhausted!
On AbstractQueuedSynchronizer
State
- Internally, maintains a
volatile int state
- state = 0 is free, 1 acquired, > 1 num of reentrants
- State is updated by
unsafe.compareAndSwapInt
Node and CLH lock
In CLH lock, applying thread can only self-spin on local variable, and keep polling the previous node’s status, if it discovers that the prev node released lock, it will finish self spin
static final class Node{
volatile int waitStatus; //control blocking
volatile Node prev;
volatile Node next;
volatile Thread thread;
Node nextWaiter; //Uses sentinel value to detect if it is shared or exclusive
}
- if
prev
is cancelled, this node can use prev node’s status - head inside AQS is a sentinel value
- Custom synchroinzer need to implement how
state
is acquired and released, q maintenance is done by the AQS already - WaitStatus
- 1 - cancelled
- -1 - signal, i.e., current thread need to wake up the
next
when release or cancel - -2 - condition, waiting to be waken up by the condition
- -3 - propergate, shared lock
Acquire
- You need to implement your own
tryAcquire
ortryAcquireShared
private Node enq(final Node node) {
for (;;) {
Node t = tail;
if (t == null) { // Must initialize
if (compareAndSetHead(new Node()))
tail = head;
} else {
node.prev = t;
if (compareAndSetTail(t, node)) {
t.next = node;
return t;
}
}
}
}
private Node addWaiter(Node mode) {
Node node = new Node(Thread.currentThread(), mode);
// Try the fast path of enq; backup to full enq on failure
Node pred = tail;
if (pred != null) {
node.prev = pred;
if (compareAndSetTail(pred, node)) { //uses unsafe
pred.next = node;
return node;
}
}
enq(node);
return node;
}
private static boolean shouldParkAfterFailedAcquire(Node pred, Node node) {
int ws = pred.waitStatus;
if (ws == Node.SIGNAL)
/*
* This node has already set status asking a release
* to signal it, so it can safely park.
*/
return true;
if (ws > 0) {
/*
* Predecessor was cancelled. Skip over predecessors and
* indicate retry.
*/
do {
node.prev = pred = pred.prev;
} while (pred.waitStatus > 0);
pred.next = node;
} else {
/*
* waitStatus must be 0 or PROPAGATE. Indicate that we
* need a signal, but don't park yet. Caller will need to
* retry to make sure it cannot acquire before parking.
*/
compareAndSetWaitStatus(pred, ws, Node.SIGNAL);
}
return false;
}
private final boolean parkAndCheckInterrupt() {
LockSupport.park(this); //to wake up thread, use unpark() or interrupted()
return Thread.interrupted(); //ignores inturrptions during the park(), will add a self-interruption only after resource is acquired
}
final boolean acquireQueued(final Node node, int arg) {
boolean failed = true;
try {
boolean interrupted = false;
for (;;) {
final Node p = node.predecessor();
if (p == head && tryAcquire(arg)) {//it is possible to get lock now, e.g., head has been released
setHead(node);
p.next = null; // help GC
failed = false;
return interrupted;
}
if (shouldParkAfterFailedAcquire(p, node) &&
parkAndCheckInterrupt())
interrupted = true;
}
} finally {
if (failed)
cancelAcquire(node);
}
}
public void acquire(int arg) {
if(!tryAcquire(arg) && acquireQueued(addWaiter(Node.EXCLUSIVE), arg))
selfInterrupt(); //if the thread has been interrupted during the wait, it won't respond, here we do a self-interrupt to make up for it
}
public void acquireShared(int arg){
if(tryAcquireShared(arg) < 0)
doAcquireShared(arg);
}
- Difference with fair and biased: on biased, CAS to acq first, and acq only when failed, and won’t detect if CLH is idle
- Shared lock: when it becomes head, it will wake up the next thread, upon releasing, S lock will wake up other threads regardless of state, but X lock will wake up only when state = 0
- for X lock, each node CAS checks if the previosu node is header, if so, try to acquire lock
- CountDownLatch: state set to N, at each countDown(), state will CAS to dec by 1, when state= 0, unpark the main thread
On ThreadLocal
static class ThreadLocalMap {
private Entry[] table;
int INITIAL_CAPCITY=16;
static class Entry extends WeakReference.ThreadLocal<?> {
Object value;
Entry (ThreadLocal<?> k, Object v){ //v is what the code put in
super(k); //weak ref means entry k may become null
value = v;
}
}
//on get and set, will expungeStaleEntry
}
public class Thread implements Runnable{
ThreadLocal.ThreadLocalMap threadLocals = null;
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
}
class ThreadLocal {
public void set(T value) {
Thread t = Thread.currentThread();
ThreadLocalMap map = t.threadLocals
if(map != null)
map.set(this, value);
else
createMap(t, value);
}
}
- standard techinque to replace
% pow(2, n)
with&(pow(2,n)-1)
, and keep the size of the colleciton always2^n
- The table uses iteration to handle hash conflict, note that it will remove non-null entries whose key is null. This is part of the reason why the load factor is 2/3, lower than that of hashmap
ThreadLocal.ThreadLocalMap
is a member variable ofThread
, so we can access it with Thread.currentThread();- Weak reference to the
ThreadLocal
means that often we will haveEntry
with key = null. That is why during get() and set(), the code try to remove the stale entry. Otherwise, there is no way for outside code to break the strong reference to thevalue
object by this time. This case is a common source of memory “leak” and is also the reason it is recommended to callThreadLocal.remove()
in a finally clause after processing is done.
On Java thread state
- Java threads:
- New: by
new Thread()
- Runnable: after
Thread.start()
- Waiting: stay in the wait set, thread previosuly called:
- Thread.join(), Need that thread to finish
Object.wait()
. NeedObject.notify/notifyAll()
to become runnable.- LockSupport.park()
- Timed_watiing:
- Thread.sleep(long)
- Object.wait(timeout)
- Thread.join(timeout)
- LockSupport.parkUntil(long)
- Blocked: need to acquire (monitor) lock but unable to. Stay in the entry set.
- New: by
- OS threads:
- new, ready, running, waiting, terminated
- Java thread has no running state - only runnable, it includes ready and running state in the OS thread’s sense. Normally one thread executes about 10ms on CPU at a time. When at blocking IO, e.g.,
socket.accept()
, the JAVA THREAD is still runnable instead of BLOCKED or WAITING, but at OS layer, the OS thread is WAITING.
On Spring AOP
Terms
- Join point: A point during the execution of a program, such as the execution of a method or the handling of an exception.
- Advice: Action taken by an aspect at a particular join point,i.e., the actual code, think it as the method of an apsect. Spring AOP currently supports only method execution join points (advising the execution of methods on Spring beans).
- Introduction: (Also known as an inter-type declaration). Declaring additional methods or fields on behalf of a type. Spring AOP allows us to introduce new interfaces (and a corresponding implementation) to any proxied object.
- Aspect: advice + introduction
cglib vs dynamic proxy
- JDK dynamic proxy works only for the interfaced classes, and as the name suggests, it is implemented by the proxy pattern/composition. This means to trigger the annotation, the dynamic proxy has to be called from outside this class. Implemented by reflection
- cglib works doesn’t have this restriciton, because it implemented via inheritance. This also means annotaion works only for the public methods, because only they can be inherited.
On Java NIO
NIO model: Multiple channel -> one selector -> SelectionKey queue -> thread pool
- Channel is similar to a abstraction of file descriptor. It is bi-directional, but can asyncly read and write buffer
- SocketChannel is commonly used in TCP client
- ServerSocketChannel in handling TCP request from outside
- DatagramChannel to handle UDP client and server
- One selector to monitor mutliple channels, and we check the selector directly
- Buffer is commonly memory mapped so we don’t have to copy from kernel space to user space
- SelectionKey class states
- Connect – when a client attempts to connect to the server. Represented by SelectionKey.OP_CONNECT
- Accept – when the server accepts a connection from a client. Represented by SelectionKey.OP_ACCEPT
- Read – when the server is ready to read from the channel. Represented by SelectionKey.OP_READ
- Write – when the server is ready to write to the channel. Represented by SelectionKey.OP_WRITE
NIO itself is a case of the reactor model: mutliple Client -> one accepter -> mulitple channel -> one reactor –> multiple servers
Select() copies request from user space to kernel space, and in kernel mode to test if each request is ready.
- Blocks on no data
- Similar to poll() requires user to iterate all channels once more after notifies some channels are ready.
In level-triggered epoll(), all threads will wake up and try accept() - but only one can get it!
Unlike select(), poll(), epoll’s fd shars between user state and kernel state - that copy is saved
- Epoll returns only the request with data. It reorders fds, and put all fd with data to the front, and then return
My interview questions for SQL DBA candidates
Expectations
- Software engineers are responsible for
- Design and performance of queries and tables
- Basic DB topology design, including the placement of DB servers
- Monitoring and alerting metrics from the application code, but not DB
- DBAs are responsible for
- Monitoring and alerting metrics from DB layer
- Backup and restore db from/to different servers
- Failover/failback drill
- DBA should be able to answer at minimum 80% of questions I posted here
- Communication wise, see if they are aware and follow the STAR principle
- Culture wise, see if they demonstrate
- Ownership
- Disagree and commit
- Dive deep
Questions
- Give two cases of deadlocks you see on the DB, and explain how you solved it.
- Be able to explain isolation levels
- Be able to explain difference of different lock types
- Be able to explain ways to detect and fix deadlock cases
- Give two cases of slow queries you see on the DB, and explain how you solved it
- Show the use of slow query log, query plan, and unix performance debugging commands
- Monitoring and alerting should be in place to discover the slow query after 5 mins for OLTP, or high resource usage for OLAP
- Be able to explain the common patterns of setting up indices and identify when the indices are too many or too few
- Give two cases of replication lag you see on the DB, and explain how you solved it
- Be able to explain replication lag threshold set, which should be no more than 10 minutes
- Be able to explain the HA and DR setup of the replication tool, and the operational excellence to ensure the HA
- Be able to give verfication procedure after the replication is fixed
- What is the backup/restore DB speed I can expect?
- Be able to dig more context info without being prompted. Red flag if jump into an estimate immediately
- Be able to give verification procedure after the restoration is done
- Be able to explain common parameters for the dump/restore tool
- Design or explain DR drills for the DBs you are responsible for
- Knows the meaning of RTO, RPO, MTTR. Red flag if not able to specify them before jumping into drill details
- The DR drills should cover multiple cases, from the most simple to complete outage. At least 3 cases should be covered
How tidb operator constructs aws eks cluster
Context: we want to provision multiple tidb clusters inside the same EKS cluster
Terraform
- Uses different worker groups to host different components, each worker group with specific taint and node label at the same time, e.g.,
--register-with-taints=dedicated=pd:NoSchedule --node-labels=dedicated=pd
- Each worker group maps to a ASG, and we can add tags to the user group itself then. Note that by default the monitor group doesn’t have extra tags
- The user data at each worker group just configs docker group and format and mount nvme disk
- Uses a null resource provisioner to apply the following k8s manifests, CRD, local volume provisioner, gp2 storage class, tiller rbac, in order
- inits helm and waits until it completes
- releases a tidb-operator via helm chart of the same name to tidb-admin namespace
- releases a tidb-cluster via helm chart of the same name, to tidb namespace. Note that this chart accepts data.template_file.tidb_cluster_values as params
- Then the null resource provisioner will keep polling via kubectl until pd, tidb, and tikvs are ready
- Note that elb and network access rules are not provisioned explicitly here
-
The helm release for tidb-cluster are done by tf module, where
tidb-cluster-values.yaml.tpl
provides params toresource "helm_release" "tidb-cluster" { depends_on = ["helm_release.tidb-operator"] name = "tidb-cluster-${var.cluster_name}" namespace = "tidb" chart = "${path.module}/charts/tidb-cluster" values = [ "${data.template_file.tidb_cluster_values.rendered}" ] }
Tidb cluster helm chart
- Check
pkg/apis/pingcap.com/v1alpha1/types.go
for the template values, important ones include:// TidbCluster is the control script's spec type TidbCluster struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata"` // Spec defines the behavior of a tidb cluster Spec TidbClusterSpec `json:"spec"` // Most recently observed status of the tidb cluster Status TidbClusterStatus `json:"status"` } type TidbClusterSpec struct { SchedulerName string `json:"schedulerName,omitempty"` PD PDSpec `json:"pd,omitempty"` TiDB TiDBSpec `json:"tidb,omitempty"` TiKV TiKVSpec `json:"tikv,omitempty"` TiKVPromGateway TiKVPromGatewaySpec `json:"tikvPromGateway,omitempty"` // Services list non-headless services type used in TidbCluster Services []Service `json:"services,omitempty"` PVReclaimPolicy corev1.PersistentVolumeReclaimPolicy `json:"pvReclaimPolicy,omitempty"` Timezone string `json:"timezone,omitempty"` }
Peak Performance: Elevate Your Game, Avoid Burnout, and Thrive with the New Science of Success
My reading notes
- The team falls back to the lowest common denominator in terms of attitude. How important to purge them!
- Do not use outcome as the driver of bettering yourself
- Turn OFF social media while focusing. Even a phone on the desk will distract us - stow them away!
- Make stress phase about 5% above your current level. More you will get anxiety, less bored
- Social media and tvs are NOT good relaxation techinques. You need to zone out, check out, instead of doing endorphin-inducing activities.
- For knowledge workers, excercising is a good way to relax.
- Social interactions with friends is also a good way, but not with your colleagues
- Walking breaks every hour
- Deep breathing may not always work, instead, view stress as challenge instead of threat.
- The less we think about ourselves, the better we perform
- When your mind is not at peace or in a bad mood, hard to think creatively. Therefore, be mindful of what happened to other people’s life outside work - this will affect how they react too.
- Separating work and life’s influence on each other is impossible
- Even top performers are not able to focus more than 2 hours in one sitting
- Napping can not make up for the lack of REM sleep, but short nap (15-20 mins) good ways to replenish energy in a long and intense day
- Making decision takes energy, even the slightest ones
- Discomfort is a natural part of stress. Learn to separate yourselve from it
- If you check your phone every couple of minutes, then you need a break!
- Fatigue spills over tasks
Product engineer's guide to the tidb migration
Shared by MySql and Tidb
- 90% of chance you should not use plain SELECT (no for update) inside transaction in either MySQL or TiDB. Most likely you need SELECT FOR UPDATE (SFU)
- SFU is a common solution to defend against writer skew and lost update problem.
- No nestd transactions or remote call inside the transaction
- Do not use foreign key in DB even if they are supported
- Concurrent insertions into an auto-increment ID column may have cases where ID values do not appear in the increasing order
Unique to Tidb
- Default isolation level is Snapshot Isolation (SI), which is similar to MySQL’s Repeatable Read (RR). It is stronger than Read Commited (RC), but not exactly same as RR
- Auto increment ID can guarantee uniqueness but not continuity. HIGHLY recommend move on from auto increment ID to ID generator such as snowflake.
- Hot key range problem in the kv storage problem during batch writes. Note that for int/bigint type, the key in tikv is the primary key
- SFU is using optimistic lock. This means we need to catch the optimistic locks exceptions from mysql client explicitly.
Unique to Mysql
- Phantom read is possible at Repeatable Read but not SI
How percolator/tidb transaction works
Uses a timestamp oracle(TO) to allocate monotonically increasing TS. Every txn needs to contact TO twice, one for start_ts at the beginning of prewrite, one for commit_ts at the beginning of commit
Each logical k-v has 3 column families(CF):
- Data: key_starTs-> value. Stores the actual data
- Lock: key -> meta: Stores txn lock infos
- Write: key_commitTs->startTs: The value at (Data)key_commitTs is stored at (Data)key_startTs
Sequence of actions
Initially Bob has $10, and Joe has $2
So the k-vs at the start of the transaction are:
(Data)Bob_5: 10
(Data)Bob_6: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Write)Bob_5: null
(Write)Bob_6: data@5
(Data)Joe_5: 2
(Data)Joe_6: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Write)Joe_5: null
(Write)Joe_6: data@5
Note that write keys at version 6 point to the data key at version 5. This is the end state of previous transaction
Now Bob wants to transfer $7 to Joe
- Gets the start_ts at 7 from TO, and write (Data)Bob_7:3
- Write (Lock)Bob_7: Primary. This marks this key is THE primary lock of the whole txn, regardless of how many keys are changed in this txn
- Write (Data)Joe_7: 2, and (Lock)Joe_7:Primary@Bob. The new lock entry marks the secondary lock that is used to look up for the primary lock. Now prewrite step is done!
- Gets the commit_ts at 8 from TO, and write (Write)Bob_8:data@7
- Delete (Lock)Bob_7
- At this stage we think the txn is complete already! Even if we have not clean up for Joe’s entries yet. In tidb, the below steps will process asyncly. See twoPhaseCommitter.doActionOnKeys
- Check (Lock)Joe_7, find it points to (Lock)Bob_7, which is deleted already, so we know the txn is complete already, and delete the (Lock)Joe_7
- Write (Write)Joe_8: data@7
Final shape
(Data)Bob_5: 10
(Data)Bob_6: null
(Data)Bob_7: 3
(Data)Bob_8: null
(Lock)Bob_5 : null
(Lock)Bob_6 : null
(Lock)Bob_7 : null
(Lock)Bob_8 : null
(Write)Bob_5: null
(Write)Bob_6: data@5
(Write)Bob_7: null
(Write)Bob_8: data@7
(Data)Joe_5: 2
(Data)Joe_6: null
(Data)Joe_7: 9
(Data)Joe_8: null
(Lock)Joe_5 : null
(Lock)Joe_6 : null
(Lock)Joe_7 : null
(Lock)Joe_8 : null
(Write)Joe_5: null
(Write)Joe_6: data@5
(Write)Joe_7: null
(Write)Joe_8: data@7
Design Problem: Pagination
Problem
- composite PK for message (msg_id, order_id, post_id)
- We want to paginate by the time
- data is sharded across DBs
- partition key is id % shards to ensure even distribution among shards
- We want to be able to go to a SPECIFIC page with known page size.
Of course complete view can do that, but can we do better?
100% accurate: next page only
- From product PoV, allows only next page, instead of going to the specific page.
- Every page we fetch, we remember the end of the timestamp at the current page
- When we look for the next page, we just use that last timestamp to pull the full page from each shard, and aggregate them in memory
Mostly accurate: guess the shard
- We just assume the data is evenly distrubuted
- Therefore, for each specific page we can compute exactly the offset, and page size. Both will be the original size/number of shards
100% accurate: good perf but more complicated
- Similar to the random partition method, pull pages from sharded db based on guessed offsets, but real page number, call the returned page RS
- Find the min_time of all 3 pages returned
- Query again with the time between min_time and the max time returned by each page in RS
- Find the virtual offset of min_time of in each shard’s page, and now we know the exact offset of min_time from the global PoV. Note that min_time global offset will be less than the target offset for sure.
- Now that with global view, we have enough information in memory to calculate the real offset from each record
Kelly criterion & Zipf's law
Kelly criterion
A more general problem relevant for investment decisions is the following:
- The probability of success is p.
- If you succeed, the value of your investment increases from 1 to 1+b.
- If you fail (for which the probability is q=1-p) the value of your investment decreases from 1 to 1-a. (Note that the previous description above assumes that a is 1).
The percentage to bet is p/a - q/b
Zipf’s law
The frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Only 135 vocabulary items are needed to account for half the Brown Corpus. Note that this is somewhat related to parent to Parento priciple, i.e., the 20/80 rule.
The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel
Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log (rank order) and log (frequency)
References
Secretary Problem
The purpose here is not to show that this strategy is perfect. It just shows the inherent complexity in hiring. Whatever schemes you devise is just luck in filtering out “maybe-no”s. However, the bar happens to be much higher than what your actual job needs in many places. Thus, whatever hazing ritual shows good result, and we have to accept that as part of life and play by the rule.
When N is know
Note its difference between real life
- If you reject, the candidate will not appear again
- Decision must be made on the spot
- the interview will rank all applicants against each other
If N is known, and we are filling only 1 position
- The optimal strategy is to reject first n/e (about 37%) of candidates, and then pick the first candidate “better” than all previous ones
- The prob of picking the “best” candidate converages to 1/e (37%) too. Such result shows the inherent complexity of hiring
- An variant exists for picking k out of a pool of n, the cut-off threashold is about 0.25
When N is unknown, 1/e strategy
But timeframe T is known, and candiates appears in the same (not necessarily uniform) distribution
- Then we pick the the moment where the combined possibily that the best candiate appeared so far is 1/e
- 1/e chance you will get the best candidate, and 1/e chance you pick no one AT ALL
Common TCP and HTTP problems
Terms
- C: client
- S: server
- A: active
- P: passive
- ssthresh: slow-start threshold
- FR: fast retransmit
- FCQ: full connection q
TCP
- Why we need 3 network calls during init handshake, and 4 to terminate?
- S needs to ensure that C gets the ACK
- During the second FIN, the server can not merge it because it needs to send remaining data. The first FIN only ensures that C has no remaining data.
- What are possible reasons of many TIME_WAIT? Why wait 2 MSL?
- TIME_WAIT is A side’s (normally client) state after receives FIN from P side (normally server) and replied ACK
- Client has to wait 2 * MSL so that server’s resend FIN will timeout, 2 * MSL is the time for server to try FIN twice (first time being the normal try). Otherwise, the server’s FIN may go into the next session
- Normally MSL is between 30s to 2 mins
- What are possible reasons of many CLOSE_WAIT?
- A state for P (normally server). When A sends FIN and P doesn’t not receive the ACK, it will remain established. Needed so that client can ACK all on-the-fly but after FIN data. will set K+1 as ACK number.
- A common problem is not closing the connection from client (while the client side maybe closed due to long idleness)
- TCP congestion control, how is it done? How does tcp decide there is congestion?
- TCP tahoe: double frames sent until we reach ssthresh, and after that we just send 1 more than before.
- To avoid the stampede effect. Once congestion is reached (detected by loss of packet), we will FR and halve the ssthresh.
- FR receives a later frame, it will just ACK the last one it is expecting
- Note that different algos, sets different value to the initial window size
- Suppose A and B established connection with no data sent, and then B restarts, what is the state of conneciotn on A? How to get rid of/move on from this state?
- A will remain established, and recycled by the OS after timeout
- Half connection Q vs Full conneciton Q
- FCQ is for ESTABLISHED state
- If you have many SYN-RCVD, maybe a sign of SYN flood DDoS. Connection is now to half-opened state inside the half-connection queue
- When C starts first handshake but S is unable to process, what does S return?
- S will resend syn+ack in a timed way, forces the client to think the original ACK is lost, so just retry.
- How does TCP congestion control work?
- How does fast retry works?
- Tcp quick resend and congestion control
- Why tcp may have sticky packt problem, but UDP does not?
- UDP header has 16 bits to specify datagram’s length, but TCP doesn’t have that
- That is why netty uses various frame decoders
HTTP
- keep-alive: reuse the underlying TCP connection between requests. Need a timeout so server can release.
- Client side may need a shorter timeout cleanup thread. Otherwise, server may send a FIN when the new request comes
- how to handle cross domain problem?
- resend vs redirect?
How to deal with credit-stealing management
- Fight only the most importatnt battles. Not getting all credits you deserve is a fact of working in a team.
- Do NOT act on impulse. Think how this will affect you in the next 2-5 years. Fight only on those with long term consequence
- Credit stealing, e.g., the “I” tone, may be unintended or even well-intended.
- He may want to take the ownership of the whole thing - both success and failure,
- He may want to attach his name to get more buy-ins
- As self-protection, create paper trails and involve other people early
- During 1 on 1, raise that you want more visibility, and want chances to present ideas as my own.
- Make sure you remain neutral and non-confrontational.
- Stick with positive sandwich!
- “Tell her how proud you are of other team members for their specific contributions, and how those are helping her. Over time it will help her see more of the benefits of giving credit; how savvy that makes you look; and how indispensable a team is to her own success.” Against insecure manager
- Praise the manager in public if he is seeking visibiity, while silping in your own contribution (“I” tone is acceptable here)
- Set yourself up as the information source. Let other people know that they can find details from you
- If hiring manager asks why you left, answer honestly, but with emphasis on fairness and cooperation, and your action only demonstrates ethics and personal direction
- Plan idea out as if it got buy-in already, and present the detailed solution. (Similar to foot-in-the-door apporach?)
- Prepare additonal data not included in presentation, and slip it in during the presentation. The goal is to set you up as the expert
Process for infra cost control
Product engineering team
- Anyone in the product team can request resources.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable.
- The infra team will challenge the product team on costs.
- Three strikes for overprovisions. Upon the third strike, the product team must provide the post mortem on why over-provision happens so often and publish it to all tech teams
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Infra engineering team
- The infra team is responsible for challenging the product team to bring down the cost.
- Every successful challenge and potentially saved cost should be logged and recorded as part of the KPI. This number will be published to the management monthly or quarterly.
- Use the guideline below to request infra resource.
- Any resource request higher than the guideline needs to be back by data or concrete reasons. Gut feeling is not acceptable
- If the product team is not able to convince infra team, escalate to the team lead and then senior tech lead
Guidelines
- Overall, if cost is the same, prefer bigger but fewer instances, e.g., provision one c4.8xlarge instead of two c4.4xlarge.
- Reduces the probability that any given instance failed
- Reduces the noisy neighbor problem commonly experienced in virtualized environments
- Reduces the per-host monitoring effort, both effort and money-wise.
- Computing instances for k8s
- Instance type defaults to c5.9xlarge.
- Any instance request less than c5.4xlarge should be extremely rare, e.g., once per month.
- Number of replicas:
- Each tomcat instance should handle between 200 to 2000 RPS.
- This means very unlikey you need more than 20 replicas for a single services
- Use Little’s law to estimate. Rule of thumb is 1k RPS per replica
- Each kafka consumer should be able to handle at least 200 msg/sec. In general the consumption is not the bottleneck, the data sink is.
- For OLTP workload, highly unlikley you need more than 3 consumers.
- Each tomcat instance should handle between 200 to 2000 RPS.
- RDS/Aurora
- Overprovisioning write db is OK, because it is hard to scale up.
- For high traffic OLTP services (> 1000 rps peak), r5.4xlarge is more than enough
- For internal services OLTP (less than 100 rps), m5.2xlarge is more than enough
- Read db is needed only for high traffic OLTP services, max 3 read dbs are enough.
- If the suggested db specification is not enough, then 80% it is code or architecture problem
- Overprovisioning write db is OK, because it is hard to scale up.
- Redis
- One r5.4xlarge should be enough for almost all our use cases, i.e., 20k rps.
- If the redis expects to handle no more than 200 rps, then we don’t really need to redis. Just use DB
- Replication factor 2 is enough in 80% cases, although 3 is OK
- Elasticsearch
- Keep between 10% to 30% free storage size. Higher than 30% means over-provision
- For log analysis, defaults to m5.4xlarge, with 2TB storage per node
On IO
- Sync/async vs blocking/non-blocking
- first one is about the whether the caller get the result immedately, and second one is aobut the where is caller waits
- In the context of IO, Sync often means blocking
- One example Sync + non-blocking will be self-spinning, which is what selector in java NIO does. The selector itself is sync + blocking, but underlying it will be busy polling the sockets, i.e., epoll() sys call.
- In fact, most non-blocking is implmented by self-spinning, but encapsulated but encapsulated as sync/blocking.
- Java AIO is purely async + non-blockng, it returns immediately and you need to pass the callback function at construction
- Cocurrency is the fact that multiple tasks can progress over a period of time. Parallelism means they can progress AT THE SAME TIME
Zero-copy
- Note that data needs to be copied from kernel space buffer to user space buffer, and vice versa. This motivates the zero-copy techinique
- Direct buffer maps the kernel space address and user process’s address into the SAME physical memory
- mmap maps the file/object into user process’s address space. Note that we still need to copy from kernel read buffer to kernel socket buffer if we want to send to the network drive.
- Sendfile means data transfer happens only inside the kernel space, and kernel buffer’s memeory address and offset will be recorded into socket buffer. This saves the cpu copy from kernel read buffer to kernel socket buffer
- In java, MapppedByteBuffer, DirectByteBuffer, Channel-to-Channel
- In later Linux versions, use DMA to copy file content to kernel’s read buffer, and then pass the FD with data location and length info to the socket buffer, and DMA direct transfers from kernel buffer to protocol engine/NIC buffer. Note in the who process no CPU is involved
Unix IO models
- Blocking IO: process uses
recvfrom
, which copies from kernel space to user space - Non blocking: polling to check if kernel has prepared the data yet, still via
recvfrom
- Event-driven: process ues
sigaction
, kernel will notific with SIGIO, and then processrecvfrom
- Io-reuse: multiple process’s IO can register on the smae channel, i.e.,
select()
. Note that afterselect()
, the process has to wait until at least one required data is ready and then return - Async io:
aio_read
with FD, buffer pointer, buffer size, after the kernel prepared the data, will copy it to the user space, and send signal
Manage energy, not time
- Energy can be renewed via rituals. Need to make them unconcious and automatic as quickly as possible. If you can completely disenage and change channel, it may take only a few minutes
- Change energy depleting behavior! Reduce fight or flight case if you can! It will reduce the ability of thiking clearly
- Lower energy also means harder to manage emotion. Note that not feeling positively means people will not be able to perform well
- Small meals along with day, once every 3 hours
- The energy cycle is between 90 to 120 mins
- Express appreciation to other is also a way to regain energy. The more specific the appreciation, the better
- Deep inhale and exhale, 5-6 second is effect way to defuse
- To gain positivity, ask
- What the other side in this conflict would say and in what ways might it be true
- How do I view this situation in 6 month?
- How do I learn and grow from it, regardless of outcome
- Try not to answer email or phone during your true focus session, check the email only during fixed time:w
- Identify the most important task the NIGHT before work
- To ask people to the top two work experiences where they find they are in their sweet spots. Then deconstruct them to find the motivating factor.
- However, defining value often too predictive. Instead, ask what people dislike!
- Schedule relaxation and fun activities and stick to it, even if you think you don’t need it! Double the break you think need!
- For every 30 mins of focused work, take 10 mins break. Note that each person has difference attention span. Identify your own!
- Detailed daily plan demotivates people.
Notes on TiDB Operator
TO = TiDB Operator TPR = (k8s) third party resource PV = persistent volume
- An operator is an application-specific controller mostly to handle stateful applications. The controller watches changes to etcd (state behind k8s api server), and executes corresponding changes
- Use k8s’s TPR API to encapsulate ops domain knowledge into the operator, and then inject the module to the k8s. TO to K8s is like bash to OS
- We should use local volume to persist for prod, but we can use networked volume for backup. Know that local PV uses special rules for failover
- When one machine is down, the pod’s local PV may not be restored => stateful set itself is not enough
- In k8s cases, to do IOPS isolation, have to use one PV one physical disk. Bind mount does not support volume and IOPS separation
- Due to security concernts, load network PV to other nodes is dangerous,e.g., double write due to eager write
- Set resource limit same as request to guarantee QoS
Debug pending pod
When a pod is pending, it means the required resources are not satisfied. The most common cases are:
- CPU, memory or storage insufficient
- Storage class not exist or no PV available
- By default TiDB service is exposed using NodePort. You can modify it to ClusterIP which will disable access from outside of the cluster. Or modify it to LoadBalancer if the underlying Kubernetes supports this kind of service.
- It’s recommended to do horizontal scaling other than vertical scaling when workload increases. Just modify the replicas of PD, TiKV and TiDB in values.yaml file.
- By default the monitor data is not persistent, when the monitor pod is killed for some reason, the data will be lost. This can be avoided by specifying monitor.persistent to true in values.yaml file.
How TO works
- Create a k8s config map for tidb storage. Storage’s physical resource is inside the config, e.g., machine’s IP, disk’s corresponding folder
- Create a tidb-volume TPR, this resource reads the physical disk status and converts the disk resources into a TPR too
- Create a volume controller to mark the usage status of the resources, VC also montiors config map’s physical resource in RT
- Package the local disk resource boot instances
- Create daemonset to maintain disk’s resource usage, e.g., when a tikv node is down, you need to delete and clear data.