Design Discussion: Event Sourcing

2018-11-01 00:00:00 +0000

Scenario

  • In a domain event, Service A persists one domain change to MySQL database and publishs the event DTO to Kafka, which acts as the event bus. Service B listens to Kafka for the event DTO.
  • Reliability requirement: one domain change persisted should cause at least one DTO published to Kafka eventually,i.e., no lost update to Kafka or MySQL
  • Latency requirment: The 99th percentile time between Service A starting the persistence and Service B receiving the message should be < 1s.
  • Throughput requirement: 500 msg/s
  • Following the original BASE paper, service A can have a log table so that DTO can be persisted in the same transaction as the domain change.
  • CDC/binlog is not feasible because we have mutli-row, multi-table txns

Option: Compensating transaction + polling

  • Following Ebay’s BASE paper, service A writes to log table and domain table in the same transaction
  • After the transaction commits, A will asyncly send the DTO to kafka, and upon sendng successful, A will update the entry to log table status to published, or delete the record
  • At the same time, we have log cleaner daemon process, which will clean up the old logs and re-publish logs not marked as published
  • Note that the async call means that in a single domain flow, two DTOs may arrive at the kafka out of order, because they are from two different processes of the same service A. It really depends on the actual logic to figure it out

Domain event format

  • PK
  • OP
  • State before the change
  • State after the change
  • src metadata
    • src log/db/tbl
    • src timestamp (event time)
  • capture time (processing time)

Inside DB, need message id , and message status (new, published). Debatable if we want different messages go into the same log table. If we need it, then a separate message type column is required

Case study: spring-domain-events

  • DB Schema:
    • msg id
    • msg pub date
    • listener id
    • seriaized string
    • event type
    • completion date
  • Problems with this repo in a production setting
    • The recovery didn’t consider the case of concurrent recovery
    • The recovery can be triggered only at afterSingletonsInstantiated, i.e., lacks a separate, continuous recovery process
    • The recovery query itself is just a full table scan
    • Persisting the event happens outside the same db transaction, i.e., with TransactionalApplicationEvent. Therefore, it is still possible that domain change is done, but event is lost

Analysis on the Eventuate Performance

2018-10-29 00:00:00 +0000

Overview

A very nicely designed and abstracted library that solves the message passing between bounded contexts. However, the performance of this library is not enough if you service emits > 500 messages/sec to Kafka.

Design Analysis

  1. In the case of MySQL backend, the library uses this mysql CDC tool. However, our past experience with mysql CDC is not overly positive. That is, it is stable enough for data analytics pipelines, but debatable for latency sensitve (<5s) cross-service communication. This suggests us to use polling instead of CDC to detect message to send.
  2. The library uses a published column in its events table and message table. Right now it has no logic the evict/purge published records. Moreover, the code uses a SELECT * FROM %s WHERE %s = 0 ORDER BY %s ASC LIMIT :limit statement to find message to publish,i.e., it is a naive full table scan
  3. The library stores messages for different topics in the same events or message table. The lock contention on the table’s PK column will be a major problem
  4. ‘events or message` table uses a long VARCHAR as the id column type. In reality, a 128-bit field,e.g., one generated by snowflake is enough.
  5. The producer uses the following code the send the message
    this.producer.send(aggregateTopic, this.publishingStrategy.partitionKeyFor(event), json).get(10L, TimeUnit.SECONDS);
    

    This means it sends each message synchronously. Then if we estimate the latency

    • 1 call to broker > 0.5 ms
    • broker does majority write > 0.5ms network + 0.1 ms SSD write
    • 0.5m network + 0.1 ms SSD write for DB to update the published column.
    • Add the above steps up, The hard limit time alone is already 2 ms per message,i.e., this pseudo code can support 500 msg/sec max

How to write design doc

2018-10-19 00:00:00 +0000

Summary section

  • Roles of stakeholders. Declare it explicitly for each person. Important roles includes:
    • Perform: carry out
    • Accountable: one single owner
    • Control: review the results, can veto. binding advice
    • Suggest: non-binding advice
    • Informed: should know the result of activity. Do NOT have a say in how work is done
  • Overview
    • Read by all engineers to decide if they need to keep reading
    • 3 paragraphs max

Context section

Different teams should mostly interact in this section instead of the decision section. So that teams can be highly aligned on goals

  • Motivation
    • Non-dev should be able to read it
    • Show how the project is tied to the long-term goals
  • Assumptions
  • Goals
    • in the dimension of user-driven impact
    • measuable key results
  • Non-goals

Decision

Mostly for intra-team discussions.

  • Milestones
    • For PM and manager’s manager to quick read
    • Set up milestones and check points. Ideally, milestones should enable partial launch of the project
  • Current/Proposed solution
    • Probably need to include user stories (from different povs) to show how the solution behaves dynamically
    • De-risk the project ASAP, by focusing on the risk part and prototyping and scaffolding
    • Probably need to prototype before you are able to propose
    • Details go here. Use user stories to show how the system behaves
    • Include precise pseudo code and data structure
    • Someone didn’t write the doc but understands the problem should be able to implement the solution
  • Alternative and Rejected Solutions
  • Monitoring and Alerting
    • Probably include correctness invariant and testing strategy. Testing strategy needs to be detailed too
  • Cross-Team Impact
  • Discussion
    • If the discussion has more than 5 comments, in-person discussion is probably better
  • Detailed Scoping and Timeline
    • By engineers, tech lead, and managers, that is why it is near the end of the doc,i.e., end user is the consumer of the design docs without the scoping part
    • Update the scope throughput the project. Most likely you need to update your estimate after each milestone
    • Each task step should take 2-3 days max
    • 1.5x of your estimated dev time to include buffer for interrupt tasks
    • Timebox open-ended questions and commit to a solution within the box, but be aware of trade-off between time box and long term cost
    • Dropping the project or stop it right after certain milestone is an option too.

A example template

  • Context and motivation
  • Assumptions and prerequisites
  • Goals
    • in the dimension of user-driven impact
    • measuable key results
  • Non-goals
  • Role assignment for each member, use one of the following role:
    • Perform: carry out
    • Accountable: one single owner
    • Control: review the results, can veto. binding advice
    • Suggest: non-binding advice
    • Informed: should know the result of activity. Do NOT have a say in how work is done
  • Current/Proposed solution
  • Alternative/Rejected Solutions
  • Cross-Team Impact
  • Monitoring and Alerting
  • Milestones/Timelines

References

Change attitudes by changing behavior

2018-10-11 00:00:00 +0000

source

  • If we engage behaviours we didn’t expect to do, our thoughts and feelings are likely to change
  • Too much reward/penalty may lead to over justitifcation, i.e., a person may think it is more because of external reasons instead of internal reasons.
  • Praise focuses on the internal of activity, i.e., makes us feel good about ourselves due to our accomplishments, is more likely to increase our performance and the liking of the activity
  • When we review our past activity, and decide that we are not going to do anything different, this is often a sign of reducing congitive disssonance, i.e., makes us feel less bad about ourselves - but also introduces bias!
  • Given too many choice, people may worry more about oppurtunity cost, and thus creates cognitive dissonance - a free gift is better than choosing between the two!

Persuation techniques

2018-10-10 00:00:00 +0000

  • Make sure the other side FEELs that they won
  • To trigger other people’s bragging and make them feel good: time + scenario + slightly irritating/negative experience + positively peak as the end.
  • Provide three options instead of yes/no questions, so that people is more likely to give valid info, espeically when under stress

Decoy effect

  • When target and competitor has trade-offs not that obvious to make a decision, insert a decoy which is closer to the target, but clearly inferior to the . Research shows that people will be more likely to pick the target - as much as 40% difference
    • However, the decoy can not to too undesireable
  • More applicable to intuitative thinkers than analytical person
  • Somewhat similar to “door in the face” and “selling top of the line”, where the proposal is meant to be rejected. The difference is that those two in general aims for the more expensive proposals, whereas decoy could work in both directions

pre-giving

  • Give a small favor, ideally something physical, before persuation
  • Don’t wait too long between favor and persuation

foot-in-the-door

  • The smaller request in the similar domain followed by the larger (and larger) requests
  • The inital request should not have too much external incentives, to avoid the overjustification symptom
  • Note that this technique is most useful when the person’s self-image aligns with your persuation,i.e., from yes to bigger yes, instead of no to yes
  • Another scenario is waiting for a person’s acknowledgement before

Low-ball

  • Paint a pleasant mental picture as if the customer can get it with reduced price, even if we intend to sell it at higher price
  • The key is that pleasant mental picture should buy initial agreement, and the second one should not be too outrageous
  • Similar to bait-and-switch, but low-ball is more on money and single interaction, bait and switch has separate bait and actual sales activity

Understanding Kafka exactly once semantics

2018-10-10 00:00:00 +0000

The whole process involves many moving parts. I will approach it by viewing the process from each components POV.

Producer

  • Producer will register its transaction id/PID with the transactional coordinator (TC), and get a new epoch from TC, with sequence number = 0 inside producer’s memory

  • Producer start producing to partitions.

    • To broker, each message will include PID, epoch, and sequence number. After sending a batch to a parition, producer will increase its own seq # by 1.

    • To TC, producer will register the (topic, partition, PID, epoch). Note if we use Stream API, the consumer offset will be written too.

  • When producer commits, it will ask TC to write a PREPARE_COMMIT(PID, epoch) message internally.

Consumer

  • Assume the we set consumer isolation level at read committed

  • consumer will filter out messages from aborted transactions, and will block on the first message from ongoing transactions

  • Consumer will get a list of aborted PIDs whose offset range intersects with the fetch batch size from the consumer coordinator. This means filtering is done on consumer’s client side

Broker

  • The broker process also hosts the coordinator. Thus, the HA of 2PC coordinator is guaranteed via Kafka’s partition leader election.

Trasaction Coordinator

  • Transaction log just stores the status of transactions instead of actual message it self

    • Ongoing (PID, epoch) - analogous to the PREPARE-stage of 2PC

    • (PID, epoch, topic, partition) - information to commit for PRODUER

    • (PID, epoch, topic, partition, offset) - information to commit for CONSUMER

    • Prepare-commit(PID, epoch) - marks the PREPARE-stage of 2PC done. This means we will always

    • COMPLETED(PID, epoch) - end of log marker for 2PC

Consumer Coordinator

  • manages the internal consumer offsets topics

Numbers for performance and capacity estimation

2018-10-09 00:00:00 +0000

  1. mutex lock/unlock is around the same range as RAM access - 100 ns = 0.001 ms. L2 access at 7ns
  2. Reading 1 MB sequentially from RAM - 0.25 ms.
  3. Ping between AWS AZs. Remember to turn on ICMP protocol, which is IP protocol.
    • Same AZ, 0.3 ms
    • ap-northeast-1a to ap-northeast-1c, 2.3 ms
    • ap-northeast-1a to ap-northeast-1d, 1.6 ms
    • ap-northeast-1c to ap-northeast-1d, 0.8 ms
  4. Reading 1 MB from network 10 ms. Note this number seems to include the cost of copying from kernel state to user state
  5. Reading 1 MB sequentially from disk 30 ms, excluding seek time. Sequential read 1GB/s for SSD, 4GB/s for memory. Random read 4KB from SDD is about 0.15ms
    • Disk seek is 10ms on traditional, 0.1 ms on SSD.
    • For traditional disk, writing has similar performance as read, but need to add full rotation + blocksize/transfer rate
    • SSD is log-structured and has lots of errors, but masked on the on-device logic. Hard to survive multiple power outages but average failure time > 6 years
    • for SSD, seq write improves throughput (250 MB/s vs 25 MB/s) and latency is roughly same. Seq read and random read has roughly same throughput and latency.
  6. Roundtrip between CA and Netherland is 150ms. SF to NYC is about 40ms
  7. Compressing 1K bytes with Zippy is in the same range as sending 1k bytes over 1 Gbps network - 0.01ms
  8. Empty Object in Java is 8 bytes (Some sources claim 12?). Actuall size is padded to the closest 8 bytes > estimate

Numbers for capacity planing

There are just rule of thumb. The strongest data point is PoC on the projected workload

  • Nginx
    • at least 20k concurrent connections per machine
  • Tomcat
    • Between 200 to 2k QPS per instance. Use Little’s law to estimate
      • Time out for microservices, try not to make it > 1s, 90th percentile aim for 200ms
      • Each day has 86.4k seconds, although during estimate we probably want to use 40k instead of 80k to take sleep time into account.
      • Without support of other data, we can assume 80% of traffic happens within 20% of time,i.e., peak is 4 times of average traffic
    • Default thread pool size = 200. Max # of threads in OS - 3-5k. 500 threads in a single process is OK.
    • For OLTP workload, The most likely bouding factor is blocking calls, followed by CPU. During daily peaks, the CPU peak should be between 10 and 30 percent
  • Redis
    • 20k to 50k QPS, Mostly bounded by network instead CPU
    • CPU usage should not be > 15%
  • MySQL
    • Default connneciton pool size = 151. Note RDS connection pool size is based on the instance type. Normally keep 1.5k peak/sec per instance is no problem, but something to watch out for when it reachs 3k rps
    • InnoDB default lock wait_time is 50s, i.e., at least 50s before one deadlocked process exits
    • In OLTP workload, each txn should take no more than 10 ms on average on DB layer
    • c5.4xlarge (16 core,68 G ram) should be enough to hand 500 TPS OLTP workload. Less than that means problem with with code
  • Kafka
    • 95th percentile at 5 ms, with replication factor = 3, within the same DC
    • Most likely network and memory bound than CPU
    • 2M record per second on 3 node cheap machines in their standard benchmark

AWS cost saving tips

  • Recycle unattached EBS volumes. They are not reclaimed when you retire EC2 instances
  • Rarely you need more than 15k iops if it is not storage layer. 30k is more than enough for most storage layer
  • On-demand instance should not be more than 1/3 of the fleet. The rest will be a mixture of RI and spot instance

On Fermi Problems

2018-10-08 00:00:00 +0000

  • Often you have a factor or two you have NO domain knowledge about at all, then just give your estimate to your best effort. It works because it becomes a random walk on log scale when you multiply each steps. Therefore, the mean is getting closer to the expectation

  • The more steps you have, the higher the standard deviation becomes. However, you often have better estimate on the smaller steps. This is essentially a trade off. But more steps also suggest your mental model is better

  • If the guess is too far off from the actual value, the reason of error is more valuable than the result itself

  • Only keep the part that is the deciding factor in the estimation

How many people in the world are talking on their cell phones in any given minute?

Each month i use 3 hours of cell phone, so 1/ (8 * 30) chance i will be using cellphone * 8 billion people

How many iPhone screen repairmen are there in the United States?

400 mil people, 60% with iphone, will repair the phone once in 1.5 years. So 160 mil repairs each year. Each repair takes 2 hours so each person can repair 4 * 365 = 1500 per year, i.e., about 100k iphone repair person!

How many people in US die each year?

400 mil people, birth rate at 1.5 per woman, so in 80 years, about 300 mil people are born, each year 3.5 mil are born ~ roughly equal the # of deaths

The number of grains of sand on all the beaches of the world

Guidelines on monitors and alerting

2018-09-13 00:00:00 +0000

  • Feel free to add alerts, but only page when the symptom impacts user or the user-facing impact is very likely to ensue.
  • For alerts requires intervetion but not right away, email and slack is enough.
  • We highly recommend you select “Do Not Require” for sparse metrics, otherwise some evaluations will be skipped - but this may trigger more false alarms if the window is too short
  • One idea is to create a composite monitor, one for the errorneous data, and one for the amount of data, and the alert fires only when both are satisfied
  • Datadog generally recommends against using interpolation for count/rate metrics in monitors, i.e., interpolation works only with gauge metrics
  • Short evaluation window means even shorter time buckets. In monitors with division over short evaluation windows. If metrics for the numerator and denominator aren’t both available at query time, you could get unwanted evaluation values.

References

Datadog monitoring and alert for k8s

2018-09-12 00:00:00 +0000

  1. kubernetes_state.deployment.replicas_available should not be too far off from kubernetes_state.deployment.replicas_desired

  2. Keep a timeseries on # of running pods by node or replica set, and correlate it with resource metrics

  3. For cpu, memory usage, prefer standard metrics from docker, rather than from k8s. Similar princple applies to time-series data - but later on, you can break the data down by pod, and then filter by k8s labels (K8s labels are already applied to Docker metrics)

  4. Focus on monitoring sum of requests on hosts, instead of simple CPU and memory usage

  5. Group/filter your docker metrics by k8s labels instead of hosts

On distributed lock

2018-08-27 00:00:00 +0000

Requirements

  1. only 1 thread holds the lock
  2. no dead lock
  3. can be blocked, and can be waken up from blocked status
  4. Note that it is almost impossible to achieve 1 and 2 at the same time (See the analysis below). Normally it becomes a tradeoff.

Redis

  1. set expiry time on lock. Other contenders will get the expiry time so to know when to retry
    • the current holder will have a watchdog process to renew the lease.
    • setnx and expire can be set in a same command as set
  2. record lock holder to prevent unlocking by mistake.
  3. compare and unlock is done in an atomtic action
  4. Timeshift,e.g., NTP may trick early release
  5. need record the number of acquires and incr/dec accordingly. When dec it to 0, you know it is safe to delete.
  6. Note that on production common just to use Redisson, although it does NOT handle correctness in the case of master fail-over

Problem:

  1. TTL is hard to set it right
  2. Master - slave replication may lose data

Zookeeper

  1. Zookeeper has official lock recipe. Also checkout Apache curator * create a smallest ordered number in a lock dir
    • listen for the smallest value in that lock dir if it exsits already for its deletion event * On that listened val’s deletion, ask again if it is holding the lock
    • On releasing the lock, delete ordered number created in the first setp
  2. heartbeat to prevent dead lock
  3. caller thread can be blocked, and will be waken up via Watch
  4. Problem: lost heartbeat from old caller will trick ZK server to release the lock too early - the caller process itself is still up
  • To scale up,e.g., flash sale, common to use segmented locks on segmented contents, but this also means that the client code HAS to try different segments, if the current segment does not work

On Oral Communications

2018-07-29 00:00:00 +0000

  1. SLOW DOWN! Pause is better than filter pharses
  2. In terms of body languages: a. Lean forward b. Avoid huge movement, e.g., wave hands, shake heads c. Point your feet to the talker. d. Don’t cross your arms.It often sends a signal of defenseness
  3. Have some warm up talk/questions befoe moving onto the true questions
  4. Shut up so that people can move onto the topic they want to talk about
    • stay silent to see if the other side can pursue you
  5. Notice the difference between venter and explainer modei - this applies to both sides!
    • Provide three options instead of yes/no questions, so that people is more likely to give valid info, espeically when under stress
  6. When giving feedbacks, use “I” instead of “you”
  7. Avoid: a. Argument on the well meaning of the behavior b. Jargon if possible c. Comparison with other people d. When you are in the position of authority, must pay very close attention to what you say - it is very easy to get overly-analyzed e. Give feedback immediately, privately, but direct and honest, use concrete and recent example. Only criticize behavoior that can be improved
    • Instead of use “talk”, “speak” seems to work, e.g., “I wanna come down and I wanna speak to you…”
    • Use “something” instead of “anything”, e.g., “Is there something else you want to address in the visit today?”
    • Find a way to reply with equal or more amount of information to keep conversation going

On Handling Defects

2018-07-29 00:00:00 +0000

  1. Overall goal is to deliver value to the customer, not to build a perfect product, i.e., we have to be aware of the oppurtiny cost of doing so!

  2. Product owner (PO) needs to neigotiate quality cutoff line with the dev team

  3. Fix the defect from current sprint’s user stories. Otherwise, the backlog is not considered done! However, if the defect is not from the current spring, much more debatable (see blow)

  4. Fixing defect is more of a “now or never” mentality. If the defect is critial enough, most likely it will manifest itself again. Otherwise, the issue will rarely manifest, and we can probably close the issue with a “won’t fix”. However, we also have to keep in mind that the later we fix the defect, the more expensive it becomes

  5. For quick defect fixes, we may squeeze it into the current sprint. For bigger defects, we can put it in the backlog, and delete it if it lasts over two sprints - need to keep backlog short!

On Meeting Agendas

2018-07-16 00:00:00 +0000

Source

  1. Ask the meeting particpants to suggest itmes along with the reason

  2. Each topic should cover as many members as possible. Otherwise, people will disengage

  3. List topic in the form of specific problem to solve - question to answer

  4. For each topic, specify the purpose: share info, asking for input, or making a decision?

  5. Propose and agree on a process to address each topic. Also, identify the leader for each topic

  6. First item, however, is review and modify agenda

  7. For frequent meetings, ask for feedback to improve meeting process, e.g., preparation time, process effectiveness, off-topic time, topic time estimate..etc

Common concurrency problems

2018-07-04 00:00:00 +0000

Producer - consumer problem

Can have multiple producers - consumers


mutex = 0
cur_cnt = 0 
remain_cap = N //q with max size N
q = []

def producer(item):
	down(remain_cap) //Note the order of downs can't be swapped
	down(mutex)
	q.append(item)	
	up(mutex)
	up(cur_cnt)


def consumer():
	down(cur_cnt)
	down(mutex)
        item = q.pop()
	up(mutex)
	up(remain_cap)
	return item

Readers - writers problem


reader_count = 0
read_lock = 1
write_lock = 1

def read():
	down(read_lock)
	reader_count++
	if reader_count :
		down(write_lock)
	up(read_lock)
	read_op()
	down(read_lock)
	reader_count--
	if reader_count == 0 :
		up(write_lock)
	up(read_lock)
	
def write(v):
	down(write_lock)
	write_op(v)
	up(write_lock)

Dining philosophers problem

Cigarette smokers problem

Sleeping barber problem

On Emotion Control

2018-07-02 00:00:00 +0000

  1. When you enter “fight or flight” mode, you are losing the ability of thinking rationally. Also, your counterpart will sense your emotion and start feeling similar way
  2. Common sign is that your muscle gets tense too. You should start deep breathing - inhale through nose for 5 seconds, hole for a moment, and then exhale. Note this technique is common in fast-sleeping methods too.
  3. Standing up and walking around helps activiating thinking part of the brain
  4. Come up with a neutral phase to remind yourself to stay calm
  5. Trying separating thoughts and emotion. This distances yourself from feelings and helps understand WHY you feel this way
  6. Take a break, and give up a neutral reason to stand up and pause conversation
  7. Let the other side vent. Be empathetic and not defensive
  8. Fake your emotions sparsingly because it is hugely draining, i.e., only when you know your current emotion is really imappropriate for the current scenario
  9. To get out of the negative emotion, change environment and switch focus if possible. Disconnet from outside world - cut off interruptions and noises
  10. If you have a hard time getting out of negative emotion, try delaying your reaction as much as you can

Refereces

atCoder notes

2018-07-01 00:00:00 +0000

arc100_a: Linear Approximation

By observation, we can see that the answer is always the median of A[i] - i array. We can prove by moving any cut toward the median, and it will improve the solution. Note for the even N case either cut would do.

arc100_b: Equal Cut

Multi-dimension problem. We use the standard technique of searching on one, and analysis on others.

So we brute force on the central cut.

Claim: if we fix the central cut, we can directly infer the value of the other two cuts

Proof:

  1. For each half, the cut closet to sum(seg)/2 max min and min max AT THE SAME TIME, because the overall sum is same.

  2. If the global optimals are the on the same side, then we found it already.

  3. If the global optimals are on the different side, then it must be min of both sides and max of both sides. Note that we can not move up min or max any further, so no other choices are possible, i.e, we min max and max min AT THE SAME TIME

I was able to get greedy insight during the contest, but missing the A[i] > 0 requires means I was not able to come up with theliner time method!!!

atCoder notes

2018-06-23 00:00:00 +0000

arc099_a: Minimization

Official solution: Always ceil(N-1/K-1)

Proof:

  1. each op increases # of 1s by K-1 at most, so we know it is the upperbound. We claim such bound is always achievable

  2. Construct segments as c(K-1) ….(c +1)(K-1). We can see that neighboring segments share a comment element, and i =1 must exist within one segs. Also, ceil(N-1/k-1) such segs is enough to cover the whole array.

  3. Therefore, we just start with the seg A[i] = 1 in, and expand to both sides

arc099_b: Snuke Numbers

I had problem proving my insights during the contest!!!

  1. Because each N we find is the new mean, the N/S(N) ratios are actually increasing as K increases

  2. Therefore, to iterate, we can just calculate F(N) = M, M is the the smallest # > N that yields min M/S(M), and then assign N = F(N)

Claim: the suffix of M’s must end with 9s - may be multiple though.

Proof:

a. Because M > N, from left to right there must exists a digit where M(i) > N(i). We can move some of the delta, to make first right to left non-9 to 9.

b. Thus, M/S(M) got reduced, we can repeat, until we can procceed

Therefore, the format of M is {prefix of N}{a single digit >= N[i]}{9s}. Given that each step we have at most 15 digits, we can just brute force search and find the min M/S(M) in the search space

2PC, Saga, and TCC

2018-06-19 00:00:00 +0000

2PC

  • 2PC assumes that data in stable storage is never lost. No node cashes forever
  • Classic 2PC blocks when a machine fails - it locks the object down for further changes, until at the commit
  • One we pass the prepare stage, we know the exeuction result will be all commited

TCC

  • Try: reserve resources
  • Confirm: execute without checking, idempotent. Note that txns other than the one from main service can be executed asyncly
  • Cancel: cancel the execution, release resources done by the try. Idempotent. Note that txns other than the one from main service can be executed asyncly

Saga

P2p distributed saga

  • The Order Service creates an Order in a PENDING state and publishes an OrderCreated event to Cutomser Service
  • The Customer Service receives the event attempts to reserve credit for that Order. It publishes either a Credit Reserved event or a CreditLimitExceeded event to Order Service
  • The Order Service receives the event and changes the state of the order to either approved or cancelled

Centralized coordinator saga

  1. The Order Service creates an Order in a pending state and creates a CreateOrderSaga
  2. The CreateOrderSaga sends a ReserveCredit command to the Customer Service
  3. The Customer Service attempts to reserve credit for that Order and sends back a reply
  4. The CreateOrderSaga receives the reply and sends either an ApproveOrder or RejectOrder command to the Order Service
  5. The Order Service changes the state of the order to either approved or cancelled
  6. Note that even in centralized case, steps are performed IN ORDER instead of parallel

Saga’s isolation problem

Lost update - commited txn change got overwritten

  1. T1 creates order
  2. T2 cancels order
  3. T1 approves the order - without proper check, T2’s change can be ignored! That is, need domain logic to handle when when we are approving the order that is already cancelled

Similarly, during compesntation, the previous result of do is no longer there because of another txn.

Dirty Read - Read uncommited changes

  1. T1 reserves credits
  2. T2 reserves credits, before parent txn of T1 commits

Non-repeatable read - read twice within the same txn. Got different results

Scenario:

  1. T1 creates order
  2. T2 updates order
  3. T1 tries to finalize order , but find the state changes already!
  • To defend against isolation problems above
  • need to have communtative updates between try and compensate, or re-read value
  • mark the OOO try as rollback only - can not execute
  • txn should have a globally unique id
  • In traditional DB, read committed isolation level can prevent it
  • In distributed, we can again use a semantics lock, although need to also introduce timeout + random slacked retry to prevent possible deadlock
  • Domain logic should get money first before giving money to reduce the loss if compenstation failed due to concurrency

TCC vs saga

  1. try vs direct
  2. in sequence vs parallel
  3. compenstation order
  4. For user POV, TCC offers better consistency, because try stage will reduce the likelihood of user seeing partial results. However, partial results cases can always constructed due to the nature of eventual consistency

How to conduct 1-on-1

2018-06-11 00:00:00 +0000

  1. Minimum: 30 mins every two weeks. Try best to stick with plan, and mentally prepared for that.
    • If you can’t handle it, it is a sign the team is too big. 7 +/- 3 people seems to be the sweet spot.
  2. Don’t ask “How are things?” - you should know this through normal standups. Instead, focus on
    • Company direction
    • Career development/path of growth - and follow up before the next review cycle
    • Conflict resolution
    • Constructive feedback
  3. Spent first 10 mins to discuss company-team wise issues, to pass the context down to the team
  4. After that, Give feedback on performance and their individual goals. Your people are hungry to know how they’re doing
  5. Leave the end chunk to questions and discussions. Very likely you don’t have answers to questions on the spot - take notes and follow up later
  6. May consider walking meeting to help relaxation.
  7. Stare into people’s eye and ask “Are you bored?”.
    • Dig deep until they are able to look into your eyes too. Goal is to discover problem before they even realize it
    • “I don’t know what to do next” is a common sign of boredom
    • It’s the job of every manager to help with the flow of information up and down the organization. When people express frustration with leadership, it’s usually due to a failure in that flow.
    • stay silent for 10 seconds to see if the other side can pursue you or open up. In fact, you should try to stay silence as much as you can
    • The simple act of listening to someone bitch is often all you need to do. Let the fury pass. View it as the oppurtinity to let the other part organize their thoughts - People want to be heard
      • But do NOT join the bitching
      • Start the triage only after the steam is off

On diffcult talk

  • Positive visioning. Before the conversation, think about your ideal outcome. It’s very possible that the conversation will go much better than you initially thought.
  • Create talking points at least one night before - you want to avoid monologue. Review the points with another neutral person
  • SLOW DOWN. Disengage after one becomes defensive
  • Look at the person’s face!
  • Tell the person why you are disappointed and how you feel, but reaffirm the person they are better than this and gives actionalbe items
  • When commenting, compare with the same person’s previous product, to cushion the shocks

References