On CMS GC
Commonly used in old gen GC. New gen uses ParNew
Heap size
- normally 2-6 G, higher than that becomes much more debatable.
- common to set min heap size same as max heap size to reduce OS scheduling cost
Goal: shortest stop the world time.
- tag objects directed referenced by GC roots
- Remark: fix changed introduced during the concurrent tag stage
- By default GC thread 3 = (CPU + 3) /4
- Still need to mark & sweep for old ge
- Old gen will have frag problem that may trigger full gc too much
- New gen GC strategy is same as ParNew
- STW, init mark: mark old gen objects DIRECTLY reachable from the GC root, and old gen objects referred by the new gen.
- Concurrent mark: use GC root tracing to tag all garbages, user thread still running
- Pre-sweep: tag old gen survival objects
- Interruptable pre-sweep
- STW: retag, scan GC roots + new gen + old gen objects with dirty cards
- concurrent sweep: mark unreachable object
- concrrent reset: reset gc state and enter the next GC cycle
Manage oneself
Reading notes on the essay of same name by Peter Drucker
- Most self-evaluation on one’s own strength and weakness are not reliable - has to go through proper try-feedback loop over some time, AT LEAST couple of months
- Focusing on improving strength instead of weakness
- Things need to figure out on people, again, NOT by self-evaluation:
- Reader or listener, i.e., needs preparation or on the spot
- Work with people or alone?
- Advisor or decision maker?
- Under stress or predicative environment?
- Big corporation or small one?
- When short term results are in conflict with long-term growths. A company’s value will determine the priority
- A Plan > 18 months is hard to stay clear and specific
- The key to up-management is to understand in which your boss works in the most efficient way, and adjust to that
- Communicate with your 360 clearly:
- This is what I am good at.
- This is how I work.
- These are my values.
- This is the contribution I plan to concentrate on and the results I should be expected to deliver
- And ask them the same thing (NOT via plain self-evaluation though!)
- The existence of trust between people does not necessarily mean that they like one another. It means that they understand one another
- Don’t try hard to change people’s behaviors
(Un)reliability of timing code
Suppose we have code
long epochBefore = currentTime()
//A
remoteSlowCall()
//B
long epochAfter = currentTime()
long callTime = epochAfter - epochBefore
//C
- Suppose we have full GC happened at A or B, then callTime is much longer than the actual cost! For the same reason, in production, STW is a common cause for read timeout
- For the similar reason, with epochBefore, use only epochAfter at C is not reliable either, because it includes GC time! Therefore, we need to record BOTH epochBefore and epochAfter to perform sanity check and detect if gc possibly happened, and extend potential expiry time/time out to take GC time into consideration
On red black tree
Invariants
- Root node is black, leaf nodes are EMPTY black nodes
- Parent and child can NOT be both R
- From any node, the paths to ALL leaves pass through same # of B
Insertion
- Use the standard BST way to insert a R leaf node, and then try to fix the invariant
- Case 1: P is B, we are good, because the old paths to the P can not extened to the current node (the new leaf that replaced the old P), with no additional cost
- Case 2: P is R, and sibling of P is R too
- Paint U and P B, and G R
- RECURSIVELY fix the invariants with G (now R) as the leaf param
- Case 3: P is R, and U is B or U does not exist
- if necessary, left rotate or right rotate P so that C is on the “outside” of the G, i.e., it is not the left-right or right-left of G. No color change at this step
- Then rotate the grandparent to make C the new grand parent, and paind the old G red, C black
Removal
- Idea similar to BST - we just find the in-order predecessor or successor to the node , swap the value, and then delete that predecessor/successor node
- However, the actual cases are complicated. If both P and child are B, there are SIX cases to consider
On Load Balancing
- Nginx does level 7 load balancing to the individual services based on url/service. The server behind LBs repond the request directly, because all servers will have the same VIP bound to their loopback interface
- LVS does level 4 load balancing that LBs multiple Nginx instances. It uses keepalived, which is on active-standby model, to ensure LVS’s HA. Note that we don’t use keepalived on Nginx, since LVS will handle that.
- Use lvs/f5 IN FRONT OF nginx, e.g. f5 can do 100k qps, and use the same keepalived + virtual IP trick on lvs, i.e., LVS will share the same VIP. This should be enough for most cases.
- If a single LVS is not enougu, we can let DNS point to multiple virtual ips backed by the lvl 4 load balancer
- All level 4 LB instances in the same cluster will have the same virtual IP and same virtual MAC address. Note that level 4 LB can easily handle 100k qps
- DR mode: to change mac address on the data link layer. LB will not change ip address, but change target’s MAC address. All physical servicers and LB servers have the same virtual IP. The web server can respond to the user directly, bypassing LB
- Use ARP to broacast VIP, and machine with such IP will reply with its own MAC address, but only LB can respond to the ARP reqeust
- Problem with source IP hash with lvl 4 load balancing
- When source IP is NAT gateway for large LAN
- When mobile client reconnects, hard to go back to the original server with consistent hashing
Hashmap and ConcurrentHashmap
Hashmap
- Array init size defaults to 16. Double the size when it is 75% full
- Length is always a power of two, because
hash%length==hash&(length-1)
- How to avoid/optimize rehash when resizing
- When array length > 64 and some link list(LL) length > 8, this LL will be changed to red-black tree(RBT)
- When iterating entrys, prefer EntrySet to retrive k & v at the same time.
Hashmap’s concurrency problem
- When insert two elements with the same hash, while there is no existing node at the entry, in
putVal()
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
and one node will be overwritten
- When
resize()
, this may introduce a cycle in the link
ConcurrentHashMap
Node type
static class Node<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
volatile V val;
volatile Node<K,V> next;
//Note that Node.setValue gives UnsupportedOperationException(),i.e., Node itself is immutable
//....
Members
/**
* The array of bins. Lazily initialized upon first insertion.
* Size is always a power of two. Accessed directly by iterators.
*/
transient volatile Node<K,V>[] table;
/**
* The next table to use; non-null only while resizing.
*/
private transient volatile Node<K,V>[] nextTable;
Put
- Assuming the normal case, I deleted/commented/erased out some lines for the ease of reading
- CHM does not allow key or value to be null
final V putVal(K key, V value, boolean onlyIfAbsent) {
int hash = spread(key.hashCode());
int binCount = 0;//size in the list to decide if need to convert to RBT
for (Node<K,V>[] tab = table;;) {
Node<K,V> f;//current node at tbl
int n; //size of array, always a power of two
int i; //index at the tbl array
int fh; //mod hash value
if (tab == null || (n = tab.length) == 0)
tab = initTable();
else if ((f = tabAt(tab, i = (n - 1) & hash)) == null) {
// (n-1) & hash, fast mod with 2^n, i,e. first time ever at this slot
if (casTabAt(tab, i, null,
new Node<K,V>(hash, key, value, null)))
break; // no lock when adding to empty bin
//can't CAS, spin again
}
else if ((fh = f.hash) == MOVED)
//resizing is underway, let it finish first, may create a thread to speed up resizing
tab = helpTransfer(tab, f);
else {
V oldVal = null;
synchronized (f) {
if (tabAt(tab, i) == f) {
//now that we are in mutex, first check if the node value remains valid
if (fh >= 0) {
binCount = 1;
//linked list upsert
}
else if (f instanceof TreeBin) {
binCount = 2;
//RBT upsert
}
}
}
if (binCount != 0) {
if (binCount >= TREEIFY_THRESHOLD)
treeifyBin(tab, i);
if (oldVal != null)
//update case. no need to resize
return oldVal;
break; //conclude spinning
}
}
}
//attempt to help resize in case of insert
addCount(1L, binCount);
return null;
}
get()
does not use any mutex or CAS, why does it work?- Every time we update the table, we acquire the mutex first, i.e., in effect only 1 thread updating volatiles. Therefore, reading volatiles becomes thread safe
Concurrent resizing
Inside transfer()
- Mark copied node and empty node as
ForwardingNode
. When iterating, when we see such nodes, move to the next - Uses
MIN_TRANSFER_STRIDE
to lower bound the range - No new thread is spawned. It reuses threads in
put()
.
On G1 GC
How does G1 separate new gen and old gen?
Concepts
- Collet Set(CSet): collecatable-region set. Survivors in the CSet will be moved to antoher region. LT 1% of heap size
- Remembmered Set(RS): RS records Points-into,i.e., object references in other regions that use this region’s objects. It is a (region address, set of card table index) hashtable
- Snapshot-At-The-Beginning (SATB): previous bitmap and next bitmap
- Eviction failure (EF): new gen not able to find regions to accept survivors
- jdk9 default, separate heap into same size regions. Some for new gen, and some for old gen.
- Each block can be one of Eden/Survivor/Old. Note that new gen will still do recycle/promotion when it is full.
- New gen GC is similar to ParNew, will STW and copy to survivor or old gen
- Designed to handle cases where stop-the-world is unacceptable,e.g., HFT.
- Humogous Object goes directly to old gen, may use mutliple continous regions to hold it
Steps (for mixed GC)
- Initital Marking (IM)
- concurrently clean up next marking bitmap,
- pause all user threads, and scan and tag all DIRECTLY reachable objects from root in each region
- Put the top value to the next top at mark start(TAMS)
- resume all user threads
- Root region scan - G1 scans survivor regions, and mark all referenced objects
- Concurrent Marking
- scan based on the objects found in IM
- Use remembered set logs to track what is object relationship is changed concurrently
- Final marking pause
- process RSL and update RS.
- process SATB logs
- Live Data Counting and Cleanup
- Note that the cleanup trigger time is based on max gc pause allowed and a few other steps
Optimization & Tuning
- Goal is to avoid full gc and EF
- Don’t set NewRatio yourself
- Prioritize maxGCPauseMillis (default 200 ms) instead of configing others
- Default Heap space reachs 45% usage will trigger concurrent
On Java Synchronized
Synchronized
- every object has a monitor lock, montiorenter and monitorexit
- Synchronized static methods are synchronized on the class object of the class the synchronized static method belongs to.
- Every thread has an available monitor record list, and a global available list, every locked object is assocated with a montior, and monitor has an owner section to store the thread holding the the monitor
- Mark word in object header,e.g., hashcode, ago, bias, lock tag. Object header = mark word + type pointer
- Klass Pointer to the class’s meta data
- JVM will take note of the thread id/address of Thread pointer as the owner of the syncrhonzied object
CAS
- Heavily used by atomic types in java.util.concurrent
- ABA problem: the value changes in the sequence of A-B-A, so from CAS pov the value is not changed since the last change on A, but it did in fact!
- By default PreBlockSpin = 10. Note that locks can only upgrade but not downgrade
- LongAdder in Java 8
- If the current threads too much, it will split the addtions into a cell array, each with its own CAS
- If one failed to CAS one cell, LongAdder will try a different cell.
- The final result is the sum of base and all values from cells
Biased lock
- Thread will store its id in the lock object’s mark word. When entering and leaving sync block it will not use CAS to lock/unlock, it will CAS the thread id instead. When the thread id is different from the current one:
- If biased mark is 0, means it is LW lock already, use LW’s CAS to acquire lock
- If biased mark is 1, just CAS to change thread id to current id. If this CAS failed, will ask the cancel the biased lock, and pause the thread holding the biased lock to check if the that thread is still active
- Thread will not release biased lock itself. When no bytecode is running, the thread holding biased lock checks if the lock object is locked or not, and will cancel biased lock to restore to no-lock(01) or lightweight lock (00), depending on the answer
Lightweight lock
- Other thread will spin to try acquiring the lock but will not block
- if the sync obj lock is 01, jvm will setup a Lock Record in the stack frame to store the lock object’s Mark Word, and then use CAS to change object’s mark word to pointer to the Lock Record, and set Lock Record’s owner to object’s Mark Word, if good, then Mark Word’s lock status is 00
- if update failed, jvm will check object’s mark word is pointing to current thread’s stack frame, if only 1 waiting, the thread will just spin, if the thread spins too many times, or one thread is hold, one spinning, and a third coming, LWL will upgrade to HWL
- When release, try CAS to replace Mark Word back into object header. if failed, means there is contention with the lock, release the lock, upgrade to HWL, and wake up blocked threads
Heavyweight lock
- When upgrading to heavy weight lock, lock status changed to “10”, Mark Word stores pointer to HWL, all waiting threads wil block
- The synched object will point to the created monitor object, i.e., the monitor inside the object, which in turns based on MutexLock from OS - blocking and waking up requiring OS, i.e., from user mode to kernel mode (hence “heavy”)
On JVM Class Loading
Class load sequence of actions
- load - done by the class loader
- get byte stream via FQTN
- transform byte stream to metaspace runtime data structure
- create java.lang.Class for this type
- Linking - Verification, including file format, meta data, byte code, symbol reference
- Linking - Prep
- allocate memories for class variable and assign init values inside metaspace
- However for final value, at this same it will be that value already
- Linking - Analyze/Resolution. Note this step can be delayed after init
- Compiler generate symbolic ref including class name, class method name, params type, and return types
- the class does not even know its own method and string’s address yet. Need to replace symbolic refrence with direct reference
- Init
- execute type constructor clinit() method
- () is to construct types: which has static code block or static assign value
- () method is to execute instances, which runs () first
- Init of class variables will be in the order of declaration
Types of class loader
- Bootstrap ClassLoader: load JAVA_HOME/lib and boot classpath
JAVA_HOME/lib/rt/jar
- Extension ClassLoader: load JAVA_HOME/libext and java.ext.dirs. Parent is Bootstrap CL
- Application/System ClassLoader: classpath in the normal sense. Parent is Extension CL. Loads from ENV VAR and class.path
- User ClassLoader: inherits form Application ClassLoader. Note classes loaded by this CL can be unloaded, but not the other 3
Parent delegation model for loadClass()
- See if the requested type has already been loaded into this class loader’s namespace (via findLoadedClass()). If so, return the Class instance for that already-loaded type.
- Otherwise, delegate to this class loader’s parent loader. If the parent returns a Class instance, return that same Class instance.
- Otherwise, invoke findClass(), which should attempt to locate or produce an array of bytes that define the desired type. If successful, findClass() should pass those bytes to defineClass(), which will attempt to import the type and return a Class instance. If findClass() returns a Class instance, loadClass() returns that same Class instance.
- Otherwise, findClass() completes abruptly with some exception, and loadClass() completes abruptly with the same exception.
- When the loadClass() method of a class loader successfully loads a type, it returns a java.lang.Class object to represent the newly loaded type.
- Note that tomcat does not really follow this model
Custum class loader
- Types loaded through the bootstrap class loader will always be reachable and never be unloaded. Only types loaded through user-defined class loaders can become unreachable and be unloaded by the virtual machine. A type is unreachable if its Class instance is found to be unreachable through the normal process of garbage collecting the heap.
- Use case:
- Byte encrption: use tool to encrypt class file. when running, use customized ClassLoader to decrypt file content and then load the decrypted bytecode
- hot load class file
- load class at specific location, e.g., URL
- to uniquely identify a type loaded into JVM requires the fully qualified name and the defining class loader. Every Class object has reference to its class loader
When will java init the class?
- create class instance type
- use class/interfaces’ static val/method. Note that before that the interface of the class will not be inited. Note that this class’s instanct variable types will not be inited
- Class.forName(). Note that Classloader.loadClass() will not init the class immediately
- init a class’s subclass
- Tagged during JVM’s start
On OKR
Objective
- In a single sentence, qualitative goal
- Should be inspirational, not conform to status quo or describe business as usual
- the goal should be time bound, but not too long
- Try setting only one
Key Results
- most often three
- quantitative metrics to meature if objective has been reached by the end of the period, should include evidence of completion
- difficult but not impossible, i.e., 50% of failing, as if it is truly random
- something happened because of something you did => describe impacts instead of activities
- can be time-related if not by default end of the whole period
- Don’t give different weights to key results even though they contribute different level of values=> it is about tendency, not perfection
- much of the value in OKRs comes from the conversation on what matters
- baking OKR into weekly meetings and weekly status mail
- DO NOT CHANCE OKR, let it fail and set better goal next time
- If you can’t keep on track longer than a week. Then you are not ready for OKR yet. This is especailly true for startups. OKR should be reviewed weekly or biweekly to ensure focus and progress
- Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes
- rather than using a single number, the best assessment is usually a set of measurements. By choosing multiple metrics, we can design a solution without the unintended consequences that occur when optimizing for a narrow objective.
From YC
- that the closer to an individual level you get the less useful OKRs are. The backward process of, “we know what we are going to do, how to we make it fit the OKR formula”
- Spotify decided to stop using OKRs - they grow too quickly so that OKR changes a lot
- OKRs are designed to exist in levels, “trickling down” in a way that narrows them down more and more as they reach specific teams and individuals
- OKRs in particular are hugely reliant on leadership defining goals for others to align with.
- For a single team, OKR is used to associate task with goals, instead of vice-versa as intended
Inter-process communication mechanisms
- File IO
- Signals, e.g., SIGTERM, SIGSTOP
- Socket - a form of file descriptor
- Uses socket descriptor as handle. Underlying it associates with IP + port combo of LOCAL node
- a concept used mostly in the transport layer
- Unix domain sockets. similar to the internet socket, but all communications happens with the kernel. Referred to by the process as inode
- MQ in the OS
- Pipe
- parent process creates the anonymous pipe, and let the child process inherit it
- appearts to be normal FD, but no seek capability
- In Unix, processes of the same pipe are started at the same time. The buffering ensures that no data is lost
- Named pipe. appears as a file. Processes use it instead of standard streams
- Shared memory, e.g., mmap file, note it is different from sendfile
- Pros: without shared memory, message,pipe,and files have to copy from user space’s buffer to kernel space’s IPC buffer, and again from kernel to user space
- In most operating systems the memory region mapped actually is the kernel’s page cache (file cache), meaning that no copies need to be created in user space
- In the IPC context, normally it is non-persist file, i.e., when the last process is finished the data is lost
- The memory-mapped approach has its cost in minor page faults—when a block of data is loaded in page cache, but is not yet mapped into the process’s virtual memory space
- Aside from the popular IPC usage, most common usage is the process loader
On Little's Law
Num of items in the system = rate of items arriving in the system * time each item stay in the system
In terms of performance analysis, this means:
- Num of executors we provide >= # of our expected throughput * # of expected latencies
- Conversely, given the size of resource/thread/conneciton pools, and the average latencies we observce, we can infer our max expected throughput
- Tomcat has default thread pool size of 200. Mysql’s
max_connections
defaults to 151.
On JVM GC
When is full gc triggered?
Concepts
- safe point: points where you can pause user thread,i.e., needed for GC
- safe area: for sleep or blocked threads. In the code, reference relationship won’t change
- Scavengc GC: triggered when Eden failed get more space. Eden region only
- From OS pov, perm is in the jvm process’s HEAP.
- Default max thread stack size is 1MB
- Note that SWAP usage along with GC is a common source of problem
- NIO’s memory is mostly in Kernal memory’s System area and PageCache area
GC roots
- referred by local variables and operand stacks in jvm stack frame
- method area, referred by class static
- method area, referred by constant
- local method stack, referred by JNI
Copying GC
- Use only half of space, gc, and then move the survior into the other region, so that the other region has no frag
- Often used by new gen,e.g., Serial, ParNew, Parrallel Scanvenge (at new gen), Serial old (for old gen), because new gen has less surviors and more garbages than older gen
- New gen often uses ParNew
Compacting GC
- move live objects toward one end of the heap. Note this idea is similar to Copying GC but NOT same!
- may use a table of object handles, so that only handle -> actual actual address needs updated during copyign
- Good for old gen where relativly less garbages exist,i.e., less need to move
- Used in Mark-compact, which is common for old gens
Mark and Sweep GC
- Separate of mark phase and sweep stage (and potentially compact stage to defrag), often used for old gen. Will stop the world and generate memory fragments
- For each unreferenced objects, need to also mark if the finalizer has been run or not. The sweep phase will include finalization. Therefore, object may be resurrected by finalizer which makes the object reachable, and GC can not decide if change the object back to reachable to unreachable until after all resurrectable objects’ finalizers are run
- Parallel Scanvenge at old gen, Parallel old,
Method area gc
Decide if constant and types are still in use. Need to satify ALL of
- all instance are gced
- the ClassLoader for this type is gced. Hence, types (not its instances!) loaded by the bootstrap classloader is never gced
- java.lang.Class object for that type is not referenced/unreachable at all
Generational GC
The young generation is divided into Eden/From Survivor(S0)/To Survivor(S1). Size ratio 8:1:1, default to 1/15 of the heap size
- The majority of newly created objects are located in the Eden space. Use copying GC
- After a GC in the Eden space, the objects are piled up into the Survivor space, where other surviving objects already exist.
- If (Eden + From Surivior) obj size is less than To Survivor space, move them to To Surivior, otherwise, move them to old gen directly. increase these To Survivor object age by 1, when it hits age 15, they enter old gen
- After minor GC, if Eden is not enough to for the new object, the new object goes to old gen directly
-
To Surivior half size, and with equal age, any objects older than this age enter the old gen
- when doing minor GC, will determin if old gen’s max continious space > total object sizes in the new gen, this is ONE of the steps of deciding if we should do full gc instead
- For high throughput requirement, more likely to have bigger young gen and smaller old gens, so that most short term objects can be gced ASAP
Bump the pointer(BP) vs TLABs (Thread-Local Allocation Buffers)
- BP just keep track of top pointer in Eden, and see if the new object can fit it in before updating the pointer
- But in a multi-thread case, BP will cause lock contention. Therefore, TLAB gives each thread a small, thread-local space in Eden
Old gen obj refering to young gen
- Use a 512 byte card table to record which old gen objects are referring to young gen
- managed by a write barrier
Train algorithm
- each gen divided into blocks of same size
- each gc runs only on one block
- for objects promoted to older gen, they are added to the tail of the older gen’s train
InnoDB lock behavior under repeatable read
- By default InnoDB is on repeatable read isolation level.
- S - shared locks, X - exclusive locks
- The actual lock is on the clustered index, in most cases this means the PK index.
- A consistent read(CR) is done by mult-versioning and reading a snaphost, i.e., NO lock involved
- Gap lock(GL) exists to prevent other txns from inserting into the gap. Therefore, conflicting GL can co-exist, and GL is a concept applied to cluster index
- S, X locks are row level locks (UPDATE/DELETE), but LOCK TABLE..READ and LOCK TABLE …WRITE are on full table lock
Suppose we have a table t like this
Id | Val |
---|---|
2 | 200 |
4 | 400 |
SELECT * FROM t WHERE Val = 200 -- CR
SELECT * FROM t WHERE Val > 200 -- CR
---- S lock on all rows, GL covering all intervals in (-INF, INF)
SELECT * FROM t WHERE Val = 200 LOCK IN SHARE MOD
---- S lock on all rows, CL covering all intervals in (-INF, INF)
SELECT * FROM t WHERE Val > 200 LOCK IN SHARE MOD
---- X lock on all rows, gap lock covering all intervals in (-INF, INF)
SELECT * FROM t WHERE Val = 200 FOR UPDATE
---- X lock on all rows, gap lock covering all intervals in (-INF, INF)
SELECT * FROM t WHERE Val > 200 FOR UPDATE
SELECT * FROM t WHERE Id = 2 -- CR
SELECT * FROM t WHERE Id > 2 -- CR
-- S on the row, NO gap lock or next-key lock
SELECT * FROM t WHERE Id = 2 LOCK IN SHARE MODE
-- S on the row, and GL on (2, 4], (5, +INF)
SELECT * FROM t WHERE Id > 2 LOCK IN SHARE MODE
-- X on the row
SELECT * FROM t WHERE Id = 2 FOR UPDATE
-- X on the row, and GL on (2, 4], (5, +INF)
SELECT * FROM t WHERE Id > 2 FOR UPDATE
Similarly, for UPDATE and DELETE
-- record lock because it hits the clustered index
UPDATE t_user set ago = 10 where uid = 1;
-- table lock(gap lock, X next-key lock) because it does not hit clustered index
UPDATE t_user set age = 10 where uid != 1;
For insert, will use X on the inserted record, but will not lock the range BEFORE this record. At the same time, it will add an insert intention lock, which does NOT block the gap or different key inserting into the same gap, as long as not the same postion within the gap
INSERT INTO t values (20);
commit;
--txn A
INSERT INTO t values(11)
--txn B
--IIK will be in effect, and B will not be blocked
INSERT INTO t values(12)
Note that concurrent insert is different from concurrent select for update
+ insert. The gap lock will start taking effect!
Generic, covariance, and contravariance
- a <: b means a is a subtype of b
- The overall goal of covariance and contravariance is to reduce the case where type check passed but runtime gives typing error.
- The main use case is when the type is in a higher-order function or generic collections.
- Adding a type to a generic collection is somewhat similar to the function call in typing rules, i.e., the generic type is the parameter type
- Reading from a generic colleciton is somewhat similar to the function return in typing rules, i.e., The generic type beais the returned type
Covariance
- Return types are covariants for function subtyping, that is, if tsub <: tsuper then t -> tsub <: t -> tsuper
- Use
to mark the upper bound of generic (at super). - Required when we return a generic collection, so that we have a guarantee that we can get super at least from collection
- No guarantee on what we can write, because we don’t know the concretely underlying type, which will be casted to on read.
- Class and types are different! Class defines an object’s behavior, a type describes what fields an object has and what messages it can respond to. Subtyping is a question of substitutsubbility and what we want to flag as a type error
Contravariance
- Argument types are contravariant for function subtyping, that is, if tsub <: tsuper, then tsuper -> t <: tsub -> t
- Suppose you wrap your tsuper -> t type in a collection or a higher-order function. If you pass a tsub -> t type, then during execution, the underlying type always expects passing in tsub. However, because the signature is tsuper-> t, we can still pass the tsuper to the concrete tsub-> t function without violating type check => contradiction!
- Similar reasoning on why in Java you can’t pass List
to a method accepting List - Note that in the source they are all stored in an object array, and cast to the type on read
- We can’t pass List
- Use
to mark the lower bound of generic (at subtype). This means we are 100% safe to add only the subtype and its subtypes to the generic collections
On Handling Silence
Source
- The subtle power of uncomfortable silences
- Comforatble silence margin: 3-4 seconds. Also, use open-ended question instead of letting people make a choice. Note that we often feel the slience longer than it actually is.
- Uncomfortable silence actually helps the buyer guide the conversation
- If you are losing the room, be slient for 3-5 secs, and let people zoom out. Once they are back, touch base again.
- Pause 3-5 before and after the important points to improve the impact of critical points - try not to ramble! Similarly, before starting the presentation, stay silent for 3-5 seconds and look around the audience
- During listening, if you can’t resist on thinking what you want to say instead of actual listening, focus on staying slient.
- If the slience approaching 1 min, offer to come back to the point later and move on.
On select(), poll(), and epoll()
- select() requires the kernel to iterate through all FDs passed to the select() call, check their state and register CBs. When an event on any FD happend, kernel has to iterate through these FDs again to deregister CB. Note that every time you get a new connection with accept() system call, you will get a new FD
- Linux uses epoll_ctl() to manage the registration. Call epoll_wait to wait for updates about the list of files you’re interested in.
- Level-triggered epoll() inherits from the “thundering herd” semantics form select() - get list of every FDs you are interested that is operatable. This means everyone will wake up from epoll_wait() and try accept() - all but one will get EAGAIN (Resource temporarily unavailable)!
- In edge-triggered, only 1 thread is triggered, but the threads may wake up unneccearily only to get EAGAIN, because on awake, it does not/can not know how many connections kernel received originally - use level-triggered + EPOLLEXCLUSIVE to wake up 1 thread at a time
- In the case of writing to FD with level triggered, your thread will get constant notification that FD is ready, even though you may not be able to write yet because your resource to FD is not available yet.
- In the case of writing to FD with edge triggered, you get triggered by FD is ready ONCE and write at much as you want until your write(2), and then you wait for the next signal
- Reading from FD with level-triggered is useful when you don’t want to consume all data at once and want epoll to keep triggring while data is available. However, for performance, probably want to use edge-triggered with all data buffered and will be eventually handled
- Poll() is very similar to select() and returns more status,i.e., it also requires the full list of FDs to watch on each invocation
-
Poll() can perform better than select() if you have a sparse set of FDs you are interested, because select() accepts a RANGE of FDs and bitsets for FDs you want for reads/writes/exceptions
- It is also called IO multiplexing because one thread can watch over the change of multiple FDs, i.e., the processing thread itself is reused.
- Note that call to select() and poll() are still blocking
Design Discussion: Event Sourcing
Scenario
- In a domain event, Service A persists one domain change to MySQL database and publishs the event DTO to Kafka, which acts as the event bus. Service B listens to Kafka for the event DTO.
- Reliability requirement: one domain change persisted should cause at least one DTO published to Kafka eventually,i.e., no lost update to Kafka or MySQL
- Latency requirment: The 99th percentile time between Service A starting the persistence and Service B receiving the message should be < 1s.
- Throughput requirement: 500 msg/s
- Following the original BASE paper, service A can have a log table so that DTO can be persisted in the same transaction as the domain change.
- CDC/binlog is not feasible because we have mutli-row, multi-table txns
Option: Compensating transaction + polling
- Following Ebay’s BASE paper, service A writes to log table and domain table in the same transaction
- After the transaction commits, A will asyncly send the DTO to kafka, and upon sendng successful, A will update the entry to log table status to published, or delete the record
- At the same time, we have log cleaner daemon process, which will clean up the old logs and re-publish logs not marked as published
- Note that the async call means that in a single domain flow, two DTOs may arrive at the kafka out of order, because they are from two different processes of the same service A. It really depends on the actual logic to figure it out
Domain event format
- PK
- OP
- State before the change
- State after the change
- src metadata
- src log/db/tbl
- src timestamp (event time)
- capture time (processing time)
Inside DB, need message id , and message status (new, published). Debatable if we want different messages go into the same log table. If we need it, then a separate message type column is required
Case study: spring-domain-events
- DB Schema:
- msg id
- msg pub date
- listener id
- seriaized string
- event type
- completion date
- Problems with this repo in a production setting
- The recovery didn’t consider the case of concurrent recovery
- The recovery can be triggered only at
afterSingletonsInstantiated
, i.e., lacks a separate, continuous recovery process - The recovery query itself is just a full table scan
- Persisting the event happens outside the same db transaction, i.e., with TransactionalApplicationEvent. Therefore, it is still possible that domain change is done, but event is lost
Analysis on the Eventuate Performance
Overview
A very nicely designed and abstracted library that solves the message passing between bounded contexts. However, the performance of this library is not enough if you service emits > 500 messages/sec to Kafka.
Design Analysis
- In the case of MySQL backend, the library uses this mysql CDC tool. However, our past experience with mysql CDC is not overly positive. That is, it is stable enough for data analytics pipelines, but debatable for latency sensitve (<5s) cross-service communication. This suggests us to use polling instead of CDC to detect message to send.
- The library uses a
published
column in itsevents
table andmessage
table. Right now it has no logic the evict/purge published records. Moreover, the code uses aSELECT * FROM %s WHERE %s = 0 ORDER BY %s ASC LIMIT :limit
statement to find message to publish,i.e., it is a naive full table scan - The library stores messages for different topics in the same
events
ormessage
table. The lock contention on the table’s PK column will be a major problem - ‘events
or
message` table uses a long VARCHAR as the id column type. In reality, a 128-bit field,e.g., one generated by snowflake is enough. - The producer uses the following code the send the message
this.producer.send(aggregateTopic, this.publishingStrategy.partitionKeyFor(event), json).get(10L, TimeUnit.SECONDS);
This means it sends each message synchronously. Then if we estimate the latency
- 1 call to broker > 0.5 ms
- broker does majority write > 0.5ms network + 0.1 ms SSD write
- 0.5m network + 0.1 ms SSD write for DB to update the
published
column. - Add the above steps up, The hard limit time alone is already 2 ms per message,i.e., this pseudo code can support 500 msg/sec max
How to write design doc
Summary section
- Roles of stakeholders. Declare it explicitly for each person. Important roles includes:
- Perform: carry out
- Accountable: one single owner
- Control: review the results, can veto. binding advice
- Suggest: non-binding advice
- Informed: should know the result of activity. Do NOT have a say in how work is done
- Overview
- Read by all engineers to decide if they need to keep reading
- 3 paragraphs max
Context section
Different teams should mostly interact in this section instead of the decision section. So that teams can be highly aligned on goals
- Motivation
- Non-dev should be able to read it
- Show how the project is tied to the long-term goals
- Assumptions
- Goals
- in the dimension of user-driven impact
- measuable key results
- Non-goals
Decision
Mostly for intra-team discussions.
- Milestones
- For PM and manager’s manager to quick read
- Set up milestones and check points. Ideally, milestones should enable partial launch of the project
- Current/Proposed solution
- Probably need to include user stories (from different povs) to show how the solution behaves dynamically
- De-risk the project ASAP, by focusing on the risk part and prototyping and scaffolding
- Probably need to prototype before you are able to propose
- Details go here. Use user stories to show how the system behaves
- Include precise pseudo code and data structure
- Someone didn’t write the doc but understands the problem should be able to implement the solution
- Alternative and Rejected Solutions
- Monitoring and Alerting
- Probably include correctness invariant and testing strategy. Testing strategy needs to be detailed too
- Cross-Team Impact
- Discussion
- If the discussion has more than 5 comments, in-person discussion is probably better
- Detailed Scoping and Timeline
- By engineers, tech lead, and managers, that is why it is near the end of the doc,i.e., end user is the consumer of the design docs without the scoping part
- Update the scope throughput the project. Most likely you need to update your estimate after each milestone
- Each task step should take 2-3 days max
- 1.5x of your estimated dev time to include buffer for interrupt tasks
- Timebox open-ended questions and commit to a solution within the box, but be aware of trade-off between time box and long term cost
- Dropping the project or stop it right after certain milestone is an option too.
A example template
- Context and motivation
- Assumptions and prerequisites
- Goals
- in the dimension of user-driven impact
- measuable key results
- Non-goals
- Role assignment for each member, use one of the following role:
- Perform: carry out
- Accountable: one single owner
- Control: review the results, can veto. binding advice
- Suggest: non-binding advice
- Informed: should know the result of activity. Do NOT have a say in how work is done
- Current/Proposed solution
- Alternative/Rejected Solutions
- Cross-Team Impact
- Monitoring and Alerting
- Milestones/Timelines