When will tidb binlog replication lose data

2020-01-30 00:00:00 +0000

How tidb writes the binlog

The pump we deployed is actually the pump server(PS) process. Tidb process has a pump client (PC) that writes to PS during a txn
- PC uses PS’s interface WriteBinlog(context.Context, *WriteBinlogReq) (*WriteBinlogResp, error)
- PC implmentation: func (c *PumpsClient) WriteBinlog(binlog *pb.Binlog) error
PC maintains a list of available PS and a list of unavailable PS
- Subscribe to PD and listen for its updates on PS status
- Reguarly heartbeat to PS on the unavailable list, and try to add it to back to the available list
- Keep an ErrNum for each PS. Increment it on failed binlog writes, and reset it on success. If a PS’s ErrNum is too high (default 10), add it to the unavailable list
If PC failed to write the prewrite binlog:
- If err is ResourceExhausted error from grpc, stop retrying
- Otherwise, try every single available pump once
- If still no luck, try every single unavailable pump once
- If still no luck, declare that we failed to write to binlog
If PC failed to write the commit binlog, PC will not retry, because PS can recover it from Tikv after 10 mins (default max txn timeout)
- Note this is a common source of replication lag
When pump gracefully shutdown, e.g., via pdctl, it will endter the readonly mode first, wait until all drainers have read the binlog, and then shutdown

When do we lose binlog

Assuming we turn on ignore-error

binlog stored on PS is lost forever
All PSs are down
PC encounters grpc’s ResourceExhausted error. Note this case very likely we don’t lose any binlog, because most likely the main txn commit will fail too

Default settings

	// RetryInterval is the interval of retrying to write binlog.
	RetryInterval = 100 * time.Millisecond

// DefaultBinlogWriteTimeout is the default max time binlog can use to write to pump.
	DefaultBinlogWriteTimeout = 15 * time.Second

How druid CP recycles connections

2020-01-23 00:00:00 +0000

Selected data structures

public final class DruidConnectionHolder {

   protected final DruidAbstractDataSource       dataSource;
    protected final long                          connectionId;
    protected final Connection                    conn;
    protected final List<ConnectionEventListener> connectionEventListeners = new CopyOnWriteArrayList<ConnectionEventListener>();
    protected final List<StatementEventListener>  statementEventListeners  = new CopyOnWriteArrayList<StatementEventListener>();
    protected final long                          connectTimeMillis;
    protected volatile long                       lastActiveTimeMillis;
    protected volatile long                       lastExecTimeMillis;
    protected volatile long                       lastKeepTimeMillis;
    protected volatile long                       lastValidTimeMillis;
    protected long                                useCount                 = 0;
    private long                                  keepAliveCheckCount      = 0;
    private long                                  lastNotEmptyWaitNanos;
    private final long                            createNanoSpan;
    protected PreparedStatementPool               statementPool;
    protected final List<Statement>               statementTrace           = new ArrayList<Statement>(2);
    protected final boolean                       defaultReadOnly;
    protected final int                           defaultHoldability;
    protected final int                           defaultTransactionIsolation;
    protected final boolean                       defaultAutoCommit;
    protected boolean                             underlyingReadOnly;
    protected int                                 underlyingHoldability;
    protected int                                 underlyingTransactionIsolation;
    protected boolean                             underlyingAutoCommit;
    protected volatile boolean                    discard                  = false;
    protected volatile boolean                    active                   = false;
    protected final Map<String, Object>           variables;
    protected final Map<String, Object>           globleVariables;
    final ReentrantLock                           lock                     = new ReentrantLock();
}

How shrink() works

Assume we turn on time check and keep-alive

DruidConnectionHolder connection = connections[i];
long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;

if (idleMillis < minEvictableIdleTimeMillis
		&& idleMillis < keepAliveBetweenTimeMillis) {
	break; //stop shrinking all together
		}

if (idleMillis >= minEvictableIdleTimeMillis) {
	if (checkTime && i < checkCount) {
		evictConnections[evictCount++] = connection;
		continue;
	} else if (idleMillis > maxEvictableIdleTimeMillis) {
		evictConnections[evictCount++] = connection;
		continue;
	}
}

if (keepAlive && idleMillis >= keepAliveBetweenTimeMillis) {
	keepAliveConnections[keepAliveCount++] = connection;
}

int removeCount = evictCount + keepAliveCount;
if (removeCount > 0) {
	System.arraycopy(connections, removeCount, connections, 0, poolingCount - removeCount);
	Arrays.fill(connections, poolingCount - removeCount, poolingCount, null);
	poolingCount -= removeCount;
}

After this check, it will validate connections for all that requires keep alive

Finally

              int fillCount = minIdle - (activeCount + poolingCount + createTaskCount);
              for (int i = 0; i < fillCount; ++i) {
                  emptySignal();
              }

Validation is done via MySqlValidConnectionChecker and if it is not set

String query = validateQuery;
if (validateQuery == null || validateQuery.isEmpty()) {
	query = DEFAULT_VALIDATION_QUERY;
}

Statement stmt = null;
ResultSet rs = null;
try {
	stmt = conn.createStatement();
	if (validationQueryTimeout > 0) {
		stmt.setQueryTimeout(validationQueryTimeout);
	}
	rs = stmt.executeQuery(query);
	return true;
} finally {
	JdbcUtils.close(rs);
	JdbcUtils.close(stmt);
}

Atcoder notes

2020-01-14 00:00:00 +0000

D - Semi Common Multiple

By substraction we know the delta is the LCM of A
By factoring we know the init must be the LCM/2, just need to verify that every number satisfies the condition
What got me is how LCM is calculated - just use lcm formula rolling through the A. Other approachs couldn’t handle catches
The cnt calc formla can be simplfied to M //(lcm//2) - M // lcm

Code reading notes: sync-diff inspector

2020-01-10 00:00:00 +0000

Major data structures

// TableConfig is the config of table.
type TableConfig struct {
        // table's origin information
        TableInstance
        // columns be ignored, will not check this column's data
        IgnoreColumns []string `toml:"ignore-columns"`
        // field should be the primary key, unique key or field with index
        Fields string `toml:"index-fields"`
        // select range, for example: "age > 10 AND age < 20"
        Range string `toml:"range"`
        // set true if comparing sharding tables with target table, should have more than one source tables.
        IsSharding bool `toml:"is-sharding"`
        // saves the source tables's info.
        // may have more than one source for sharding tables.
        // or you want to compare table with different schema and table name.
        // SourceTables can be nil when source and target is one-to-one correspondence.
        SourceTables    []TableInstance `toml:"source-tables"`
        TargetTableInfo *model.TableInfo

        // collation config in mysql/tidb
        Collation string `toml:"collation"`
}

type Diff struct {
        sourceDBs         map[string]DBConfig
        targetDB          DBConfig
        chunkSize         int
        sample            int
        checkThreadCount  int
        useChecksum       bool
        useCheckpoint     bool
        onlyUseChecksum   bool
        ignoreDataCheck   bool
        ignoreStructCheck bool
        tables            map[string]map[string]*TableConfig
        fixSQLFile        *os.File

        report         *Report
        tidbInstanceID string
        tableRouter    *router.Table
        cpDB           *sql.DB

        ctx context.Context
}

// Bound represents a bound for a column
type Bound struct {
        Column string `json:"column"`
        Lower  string `json:"lower"`
        Upper  string `json:"upper"`

        HasLower bool `json:"has-lower"`
        HasUpper bool `json:"has-upper"`
}

// ChunkRange represents chunk range
type ChunkRange struct {
        ID     int      `json:"id"`
        Bounds []*Bound `json:"bounds"`

        Where string   `json:"where"`
        Args  []string `json:"args"`

        State string `json:"state"`

        columnOffset map[string]int
}

How is diff done

Source

CheckTableData loads the checkpointed chuncks data from the context. For the first time running, it will SplitChunks (See below)
Spawns multiple channels to check chuncks - checkChunkDataEqual
UseChecksum to detect if we go for checksum mode or row-by-row mode
Checksum is computed by dbutil.GetCRC32Checksum bounded by chunk.Where, which is a SQL formated by

fmt.Sprintf(
"SELECT BIT_XOR(CAST(CRC32(CONCAT_WS(',', %s, CONCAT(%s)))AS UNSIGNED)) AS checksum FROM %s WHERE %s;",
		strings.Join(columnNames, ", "), strings.Join(columnIsNull, ", "), TableName(schemaName, tableName), limitRange)

An example checksum sql would be

 SELECT BIT_XOR(CAST(CRC32(CONCAT_WS(',', id, name, age, CONCAT(ISNULL(id), ISNULL(name), ISNULL(age))))AS UNSIGNED)) AS checksum 
 FROM test.test WHERE id > 0 AND id < 10;

How are chunks split

Source

The Range config in the toml is passed to getChunksForTable
Two ways of splitting - bucket spliting and random spliting. Default config goes by random splitting

Random splitting

cnt, err := dbutil.GetRowCount(context.Background(), table.Conn, table.Schema, table.Table, limits, nil)
chunkCnt := (int(cnt) + chunkSize - 1) / chunkSize
chunks, err := splitRangeByRandom(table.Conn, NewChunkRange(), chunkCnt, table.Schema, table.Table, columns, s.limits, s.collation)

Randomly pick chunkCnt number of ids as the splitting ID

SELECT `id` FROM (
	SELECT `id`, rand() rand_value FROM `test`.`test`  
	WHERE `id` COLLATE "latin1_bin" > 0 AND `id` COLLATE "latin1_bin" < 100 
	ORDER BY rand_value LIMIT 5) rand_tmp 
ORDER BY `id` COLLATE "latin1_bin";

and create new trunk based on these separate id values

Summary Info

Source

CREATE TABLE IF NOT EXISTS `sync_diff_inspector`.`summary`(" +
			"`schema` varchar(64), `table` varchar(64)," +
			"`chunk_num` int not null default 0," +
			"`check_success_num` int not null default 0," +
			"`check_failed_num` int not null default 0," +
			"`check_ignore_num` int not null default 0," +
			"`state` enum('not_checked', 'checking', 'success', 'failed') DEFAULT 'not_checked'," +
			"`config_hash` varchar(50)," +
			"`update_time` datetime ON UPDATE CURRENT_TIMESTAMP," +
			"PRIMARY KEY(`schema`, `table`));

Updates every 10 secs

SELECT `state`, COUNT(*) FROM @checkpointSchemaName.@chunkTableName  
WHERE `instance_id` = ? AND `schema` = ? AND `table` = ? 
GROUP BY `state`;

UPDATE `@checkpointSchemaName`.`@summaryTableName` 
SET `chunk_num` = ?, `check_success_num` = ?, `check_failed_num` = ?, `check_ignore_num` = ?, `state` = ? 
WHERE `schema` = ? AND `table` = ?

Chunk Info

CREATE TABLE IF NOT EXISTS `sync_diff_inspector`.`chunk`(" +
			"`chunk_id` int," +
			"`instance_id` varchar(64)," +
			"`schema` varchar(64)," +
			"`table` varchar(64)," +
			"`range` text," +
			"`checksum` varchar(20)," +
			"`chunk_str` text," +
			"`state` enum('not_checked', 'checking', 'success', 'failed', 'ignore', 'error') DEFAULT 'not_checked'," +
			"`update_time` datetime ON UPDATE CURRENT_TIMESTAMP," +
			"PRIMARY KEY(`schema`, `table`, `instance_id`, `chunk_id`));

Atcoder notes

2020-01-05 00:00:00 +0000

C - Candles

ans = dist between ends + min distance from either end

C - Strange Bank

No need for memorization. Note in knapsack solution, we just need >= and A[0] instead of checking >
Note that 6 coins and 9 coins they don’t intersect, so we just brute force on the 6s amount, and the remaining will be shared by 9s, followed by 1s

C - String Transformation

We can prove that the transformation retains the isomorphism if it starts with one. so we can just verify that final mapping is an isomorphism

C - K-th Substring

Brute force won’t work because it will be s ** 2 substrings
K no more than 5 means that the top K answer will not be longer than 5 anyway. Otherwise, we can reduce the length of the answer and improve it
So we just sort all 25000 strings, each with length 5

C - Factors of Factorial

Calcuate the prime factorization of each number and then build the divisor by the PRODCUT of numbers

D - Rain Flows into Dams

Since it is odd, we can do the odd - even trick and sum up together

C - AB Substrings

Corner cases:

no BA
no BX and AX

The general theme is that every time you see a counting formal with minus component, have a dicussion on the cases where the sum could be negative and when the positive item could be 0

C - String Transformation

Every swap will keep the isomorphism, and we can transform to any isomorphism by swapping. Therefore, we just need to ensure the final isomorphsim is good, because it is always reachable

D - Flipping Signs

Simple problem, but similar to two segment insight

D - DivRem Number

N = q * m + q = q (m + 1), with q LT m. Just need to find all divisor pairs

Similar to divisor finding, iterate on all possible divisor b >= max(2, K + 1) to generate a, check how many full b “bands” we have and calculate mod >= K in the last partial band. Note that mod band start at 1! so the partial bond contributes to N % b - K + 1 pairs

Atcoder notes

2020-01-01 00:00:00 +0000

D - Harlequin

One state is “some has odd”, it is counter is “all have even”, and it maps to win/lose conditions

C - Align

It is obvious the solution will be zigzaged. Therefore, we can generate a formula for the final answer
we can store and sort the coefficents, and apply it to the sorted factors - way cleaner than my classification solution

B - Simplified mahjong

When A[i] != 0, it is obvious the anser will be floor(S/2) - by sorting and we will reach the upper bound
When A[i] == 0, we know the problem can be decomposed into segments
So we just sort the cards, and calcuate the size of 0-separated segments, and sum up the result applied by (1)

D - Coloring Dominoes

first col needs classification
I miscounted 2 * 2 + 2 * 2 case, should have 3 choices instead of 2

C - Base -2 Number

Classify based on the mod 4, 00, 01, 10, 11, add 4 to the original number before shifting

B - ABC

Replace BC with D and for a list of new strings
The inverse cost can be calculated by scan/bubble sort, every time we find a need to swap, the cost is increased by the number of to-be-swapped items

D - Blue and Red Balls

Star and band. Separating K into i empty buckets (could be empty) has choose(K + i - 1, K) ways, in the non empty case it becomes (K - 1, K - i). Note it is a direct application of formula
My first guess was incorrect. It was too high, and didn’t consider combiniation case

D - Coloring Edges on Tree

The official approach is cleaner than mine

We know that answer >= deg(v). In fact, max(deg(v)) is sufficient to give a solution
We can just BFS, add keep assigning the color form 1 to K while skipping the color of the edge from grand parent to parent. Such construction ensures the conditions are satisfied
Note that DFS is not feasible here because we lose the coloring information on siblings

E - Colorful Hats 2

The key insight is that given colors in 1…i, the color in i+1, can be uniquely determined by picking one color same as the predecessor. The mulitplication factor acts as the “pick”

Index 0 acts as the sentinel value

I got WA in formula - tried to classify 2 color and 3 color case. This implys at the each index, the pick is not unique even with the multiplier - classic anti pattern

D - Lucky PIN

The official brute force solution is cleaner than mine

for i in range(100):
	ds = [i/100, (i/10)%10, i% 10]
	find = 0 
	for i in range N:
		if match:
			find+=1

	if find == 3:
		ans+= 1

It also has a DP approach with 240M ops and 240M memory

C - HonestOrUnkind2

The official soultion is cleaner than mine. Mainly because even if I used bitmask, I still use a second array to track correctness, which is not needed, due to the bit mask

B. Counting of Trees

For reason unknown, i implemented count in a very verbose way…Just a simply dict and update is enough. Although, I missed the check where D[0] has to be 0 due to the condition in the problem

D. Disjoint Set of Common Divisors

The answer is the prime factorization of the GCD. Note that prime factorization can run in sqrt(N), and after sqrt(N), if the number is not 1, then we know that number is the only divisor left, and it is prime.

Proof: by contradiction

suppose that remaining number is not prime, then we should have divied it in the previous step, where all divisors no more than sqrt(N) have been discoverd
Because it is prime, we know it is the only possible divisor left, otherwise, it would be a product of two divisors > sqrt(n) => contradiction
Note by this proof, during coding we don’t have to limit the upperbound to tight sqrt(n), as long as we exhause all divisors under an upperbound it still holds

A - 01 Matrix

A construction solution that I was not able to come up with. One lesson learned is that in matrix solutions, at least one of row or column is “nicely” cut into bands (as min number as possible), and then we twist the other dimension to fit the requirements

C - Strawberry Cakes

The insight is similar to above, cut the matrix into bands and then twist it in the other dimension. Otherwise, implementation will be messy

D - Powerful Discount Tickets

I used the bsearch approach. One case to cover is when the remainng tickets are not enough to cover all tickets
Official solution uses PQ, the key insight is that floor(X/2 ** Y) = floor(floor(X/2 ** (Y-1))/ 2)

D - Ki

Python version gives TLE. C++ version is fine
Note that when we traverse trees, after we visit the node, we init the children properties on the spot, i.e., every time we reach a node, all its parent/parent edge info has been known

A - Triangle

The key insight that I missed is that suppose one vertex is at (0, 0), then the area of the triangle is abs(x2 * y3 - x3 * y2)/2
Another key insight is to set another ONE coordinate to 1 and the other 10^9 to cover the max value case, so we convert to Euclidian divison formula:
```
  a = bq + r = b(q+ 1) - (b - r)
```

However, I still don’t understand why I need an additional %b to pass the last test case

D - Enough Array

Idea 1: prefix sum with bsearch
Idea 2: two pointers. Note that when you move the head pointer, the checked index should be i instead of i + 1

System design: service registry

2019-12-30 00:00:00 +0000

Requirements

Data consistency
How long it takes to discover offline service?
How many service instances can it support?
How many machines you need to run this registry?

Eureka

P2P, all provides to outside services, when it is registred on node, it will be gossiped to other nodes
ReadOnly default sync time is 30 sec. Service refreshes on 30 sec interval
Default: Every 60 sec to detect offline service. Heartbeat expires at 90 sec

My Design

How Spring resolves circular dependency between beans

2019-12-29 00:00:00 +0000

Object’s instantiation is done via reflection ApplicataionContext.getBean(), and after that object’s properties are done via object’s init
When object A needs another object B, we will recursively uses ApplicataionContext.getBean() to get A, and then inject into B. At this time, the objects are created, but their properties are not set.
Suppose A has B, and B has A as member, on init
- A insantiated first
- A finds that B is needed, getBean() on B
  - B is instantiated
  - B finds that A is needed, getBean() on A. At this time, A’s reference is registered already, so Spring returns reference to A in the first step
  - B sets the member reference to A, finishes the init, and return
- A now gets the reference to B, set its member reference to B, and returns

Implementation

	protected Object getSingleton(String beanName, boolean allowEarlyReference) {
		Object singletonObject = this.singletonObjects.get(beanName);
		if (singletonObject == null && isSingletonCurrentlyInCreation(beanName)) {
		//first time ever to create
			synchronized (this.singletonObjects) {
			//singletonObjects is { bean_name : ObjectFactory} map
				singletonObject = this.earlySingletonObjects.get(beanName);
				if (singletonObject == null && allowEarlyReference) {
				//bean is marked in creation, but not created, do it now
					ObjectFactory<?> singletonFactory = this.singletonFactories.get(beanName);
					if (singletonFactory != null) {
						singletonObject = singletonFactory.getObject();
						this.earlySingletonObjects.put(beanName, singletonObject);
						this.singletonFactories.remove(beanName); //make sure we init only once
					}
				}
			}
		}
		return (singletonObject != NULL_OBJECT ? singletonObject : null);
	}

RocksDB in the context of TiKV

2019-12-27 00:00:00 +0000

RocksDB concepts

Column family: Each k-v belongs to only 1 CF. CFs share the same WAL (to support atomic writes), but different memtable and table files. Note classic CF mapping would be (row_key, a sorted list of (column, value)), but in RocksDB it is just (cf, key, value)
Write buffer stores the memtable. Block cache is where RocksDB caches data in memory for reads
Memtable flushes are by default scheduled on HIGH thread pool, while compactions on LOW thread pool. Stalling memtable flush can stall writes, increasing p99 latency. Set the thread pool with max_background_jobs

Compaction

Size-tiered compaction strategy: small SSTable to medium SSTable. trade read and space for write amplification. Compaction merges all sorted runs in one level to create a new sorted run in the next level.
- A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels
Level-based compaction strategy: Each level has at most 1 sorted run. Some data will go to the next level if current level is approaching limit. Trade read/write amplification for space amplification. In paper is all-to-all merge, but in RocksDB it is some-to-some
- Leveled compaction in RocksDB is also tiered+leveled. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option– only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction – the memtable output creates a new sorted run in L0 and doesn’t read/rewrite existing sorted runs in L0.
Sub compaction speed up a compaction job by partitioning it among multiple threads.

Read & write amplification

RA counts the number of disk reads to perform a query. It is defined separately for point query and range queries
Flash-based storage can be written to only a finite number of times, so write amplification will decrease the flash lifetime
LSM with level-based compaction has better write amplification than B-tree

How TiKV uses RocksDB

All data in a TiKV node shares two RocksDB instances. One is for data, and the other is for Raft log.
- The default RocksDB instance stores KV data in the default, write and lock CFs
raft logs are stored as region_id/log_id -> value
data is stored as key + bit inverted ts -> value. TS is inverted, so that highest bit will be the first item
- Tikv uses prefix bloom filter, i.e., it predicts on the prefix instead of the whole key
Use TableProperties in Split Check, MVCC GC, and compact region range
When adding a replica to a new server, generates a SST snapshot and sends to the new server directly
Disk space will be released only when tombstones have been compacted

On LSM tree

2019-12-27 00:00:00 +0000

Motivations

More writes than reads, but still need to keep an index for retrival, e.g., histroy tables, logs, and other time-series data
The original paper focuses on the hardware cost LSM brings, and claims it is much better than B-tree. The root cause is the lower disk arm cost

Overview

Mutliple level of trees, for the simplicity, we assume 2 levels: c0 and c1
- c0 is a small, in memory, AVL tree. Implmentation wise, it is the MemTable backed by WAL, which stores keys sorted
- c1 is an on disk B tree, but its hot node’s pages are most likely in memory
- In practice,the LSM tree typically consists of multiple levels, each representing a sorted run of data. Levels closer to the top have newer and more frequently accessed data, while lower levels contain older and less frequently accessed data.
On insertion, the new index entry goes to c0, and will be merged to c1 when c0 becomes too big
Retrival will look at c0 and after that c1
The design gives trade single key read for key range scan and write performance

In memory c0

Recovery is done by reconstrution from the record logs, since c0 hosts only the most recent inserts. However, we still need to define the replay starting point carefully
Deletion is handled as a special deletion node
When the MemTable is too huge, convert current MemTable to an Immutable MemTable (IMT),and create a new MemTable.

On disk c1

single page nodes on each level below the root is place together in multi-page disk blocks. Each node 100% full
Note direct write to c1 except during the merges. So that we don’t have to read-modify-write an entire block due to the random write
On disk data is sorted runs of data with no overlapping key ranges
Implementation wise it is the SSTable. Again, keys are store sorted

Read

start from lowever level SSTable and keep going deeper until we first find one.

Update

just like insert. This means we accept that multiple versions of same key exists at different SSTables

Delete

insert a k-del mark. The value will be truly deleted when SSTables merge

How to merge

Idea similar to merge sort. Assuming we have a pointer ,for each tree, to the next to leaf to merge
During the merge the indices are still available except for short locking period
Starting from the smallest value of c0, and load the next multi-page block from c1 into memory
Generate the new multi-page block similar to merge sort, and write them back to the right most positon of c1. Note that parent node references will be updated too, and old c1 blocks will not be recycled immediately for recovery purposes
the merged leaves from c0 will be removed from memory
The process repeats after the right most c0 leave is written to c1
Merge process itself will checkpoint and flush to disks periodically

Prometheus file structure

2019-12-26 00:00:00 +0000

Every two hours’ data is in the same block. Each of block is a directory with chunks, one metadata file, and one index file
On average each sample takes no more than 2 bytes
Almost every section starts with a length field, and ends with a CRC32 checksum

Index file

Symbol table: a sorted list of strings, which are used in label pairs
Series (more below): a list of series. Each series hold the label set of the series and its chunks (more below) with blocks
List of postings: Each posting is a list of series references that contain a given label pair
Posting offset table: list of entries, each is a label pair to offset to its series list in the postings list section. This section is sorted
Table of content: entry point to the entire index and points to various sections in the file.

Series section in the index file

Each series holds number of lables, followed by the tuple of symbol table references
The next section is the number of indexed chunks, each of chunk index contains the min time and max time of the time, and their position in the chunk file. Note that all ts except the first one is stored as deltas

chuck

max 512 MB, each record of chuck is len + encoding + data + CRC

WAL

128MB by default. A segment is written to in pages of 32KB. Mostly borrowed the idea from rocksdb

type
len
CRC
data

WAL accepts

Series records encode the labels that identifies a series and its unique ID
Sample records encode samples as a list of triples (series_id, timestamp, value)
Tombstone records encode tombstones as a list of triples (series_id, min_time, max_time)

Example

Suppose i have one serive deployed on two machines, each will output [{service: sid}, {node: nid}] label sets

Inside the index file

Symbol table: [n1, n2, node, service, sid]
Series:
- [2, 0, 3, 4]
- min time in c1, delta for max in c1, offset in c1
- [2, 1, 3, 4]
- min time in c2, delta for max in c2, offset in c2
Postings:
- [0], [0, 1], [1]
Posting offset table:
- ((node, n1), 0), ((node, n2), 2), ((service, s1), 1)

Note in the design, we have direct index on only single lable pair, or the raw data. This means most queries, we will have to merge to find series entries, and then scatter-merge

On working in Japan

2019-12-25 00:00:00 +0000

Notes form HBR review articles

What we can learn from Japanese Management
How to negotiate in Japan

Working with interpretors

If possible, share the draft doc/presentation beforehand
Loudly and slowly, use simple words
Explain the same idea in 2-3 different ways as checksum
30 secs break to let interpretors go
Number is easy to mistranslate, write it down if possible
Do not interrupt interpreters
Use positive form instead of negative wording. Also avoid double negatives
Use body movements
Japanese decision process is very democratic - ringi means buying in from all stakeholders, instead of LCA of those stakeholders. (I got burned by the slow process many time)
When the negotiation is stuck, keep the channel open and ping from time to time
Ask Japanese lower levels for more info on friends and enemies (nemawashi?)

On Product Managment

2019-12-24 00:00:00 +0000

Reading notes on 俞军产品方法论. Page number in brackets

AB test lowered barrier to become PM, and weaker PM can use good result to slack off on thinking on customers (13)
Bar of PM (23)
- to understand the domain model of a user - able to predict with reasonable accurancy on the effect of iteration
- research user model, and analyze with real individual as sample
For user with consistent perference and congnition, no need for special scenarios. Otherwise, need extra work needed to guide user so that he would max expected utility by conducting the desired behavior
Hierarchy is more efficient than democratic process in the context of company. Note hierarchy here means the professionals and domain experts. Each hierchary inside company needs to have higher efficiency than market, and each expert needs to make better than average decision at his position.
Outcome of the product (useful to write after action report)
1. financial gain
2. domain know-how, more accurate self-evaluation and awareness, and innovation
3. a team which works well with each other
4. trust between user and upstream/downstream stakeholders. This reduces transaction cost of new transactions and products
Once owning something and then selling it means some form of loss. For the same amount of change, negative unitily loss is about 2.5 times of positive gain.
Problematic to analyze user as a natural person, i.e., # of registered user is a flawed metric - company may ignore what utilities it provides and how many user requirements are satified. Similarly, same DAU may mean much more user value, if the product satifies more user requirements as it goes
Tech metrics improvement may have low corelation with the utility the user feels. This means we should decompose products into utilities, analyze the iteration’s effect on the utilites, and then predict user behavior change caused by utilities changes
source of transaction cost (112)
1. bounded rationality
2. opportunism: mistrust between parties means increased monitoring cost
3. uncertainty and complexity: have to price them in
4. small number/volume market
5. information asymmetric
6. atmosphere
Transaction cost in the context of market (113)
1. search and measure
2. compare and decide
3. execution and warranty
User’s trust may be far off from facts. In this case need extra work to fix that(115)
standarize product to make quality easy to manage, and thus to reduce transaction cost (115)
for bigger platforms, it is about trade off instead of innovation and design (132)
convert users new to the domain asap, this will increase the conversion cost for competitors (136)
Data-driven slows down the decision process, and only possible when data is cheap and readily available (142)
Extra profit comes from people with good judgement - often not data driven (146)
- When have an idea, write it down and try to disprove it, and update it with new info and context in the future
User should mean a combination of requirements in a certain scenario (157)
If your manager or partner lacks critical thinking skills, agree and praise on minor points, make them feel that you are similar, only after than insist on major points (159)
Discover, but not create requirement(283)
Use (price user willing to pay) - (company cost) to decide if it is good business
PM needs to have judgement on the decision’s certainty and feasibility (288)
- Care about user feedback but not suggestion
PM needs to decide when to data-driven and when to use data only as reference point (315)
PM should be gave to able to split the difference. (320)
- Avoid right but aggressive person
PM should have good sense in decision making, commercial, and professional/domain. (326)
- If only one product, then CEO or founder is the biggest PM

How to read source code

2019-12-19 00:00:00 +0000

Start from easier ones, e.g., Spring Boot
Read the dependency first, e.g., kafka -> zookeeper -> java.concurrent
Create a hello world, add debug points. Go top-down upfront only after you are familiar with the codebase
Try NOT to understand every single line, just focus on the most important flow
Draw graph
Need to repeat this a few times to trully understand it

Finding problem to work on

2019-12-17 00:00:00 +0000

Wait for problems before you introduce solutions (esp. processes or infrastructure)
Your boss expects you to give input on what should be worked on. Use your down time to research that
If you hear other teams complaining on a problem, don’t bringing in the solution yourself. Wait until they raise it, or inform that team’s manager, and let him decide
Need to convince all stakeholder they want you to solve this problem. This buy-in process will take a while. If it is too tiring, then maybe you should not solve it
If you find a problem in another team with clear examples, and they are not able to address it to your satisfaction, it is time to escalate to your manager, but not before that.
Solving unowned problem is fun from time to time, but should not be normal - you are not following the right company/boss
To build trust, teach your team how to show work, and fix how to give feedback
Talking to customers directly is more effective than any observation data. Observation data needs formal process for evalutaion
To affect opinions of people, not only the person himself, but also whoever that can directly affect the said person too. Build common interest between the people around the authority
Easier to advance your career with profit center than cost center

On ReentrantLock

2019-12-10 00:00:00 +0000

ReentrantLock is just an encapsulation of AQS: it has a Sync object which extends AQS
Need to implement tryAcquire in AQS

vs synchronized

Synchronized is based on JVM, RL is based on AQS in JDK
RL supports lockInterruptibly, so that other waiting threads can stop waiting
RL supports fair lock, synchronized is unfair lock
synchronized uses notify()/notifyAll(), i.e, only one Condition for the whole monitor. No such limitation for ReentrantLock

State: 0 - nolock. Every acq +1, release -1

ReentrantReadWriteLock (RRWL) uses 16 bits to save write lock state, and 16 bits for read lock state

On setting goals

2019-11-26 00:00:00 +0000

The Stretch Goal Paradox

Stretch goal means diffculty and/or novel. Note that a series of small wins itself do not lead to stretch goals, because path to stretch goal is uncertain, and can be broken down up front. However, small wins build the resources, energy, and learning needed for stretch gaols
Slack resources + recent wins = stretch goal. Be self-disruptive
no resource + recent wins = small wins with incremental successes.
slack resources + recent loss = experiement for small losses,i.e., mildly risky experiments. This builds resillence and confidence in the org
no resource + recent loss = small wins with incremental successes. However, orgs in such states are most likely to use stretch goals
Slack resources has be conciously built. One way is to focus on learning to enchance existing resources

High Output Management

2019-11-23 00:00:00 +0000

Reading notes on this famous book by Andy Grove

Manager is responsible for the output of the neighboring orgs they influence
Leverage is best done via information gathering, decision making, and “nudging” others by providing suggestions
For mission-oriented meeting, the meeting chair should be responsible that stakeholders are well prepared
Centralized for leverage, decentralized for speed
The author favors matrix managment structure, one for functional, one for mission.
- The decision making will be done via this peer group
When complexity, uncertainty, and ambiguity is low, the behavior is influenced by the expectations, when they are high, culture is the main driver. Which is more suitable depends on the context
- However, when user’s focus is more on the self-interest rather than group interest, low CUA means people will be driven by incetives, and such people are not manageable in high CUA
It is not a problem if people left for money or perks. It is a problem when they feel unheard or unappreciated
- When people want to quit, just listen. Don’t argue. Wait until the prepared part is let out, so the real issue will emerge
- Ask for time before you come back with solution
Pair the the output indicator with the quality indicator
- Measure output not activity
Batch up issues and questions to reduce interruption
No more than 8 people for decision making meetings. No more than 25% of time spend on ad-hoc mission meetings
turning the workplace into a playing field where subordinates become athletes dedicated to performing at the limit of their capabilities
Don’t make friend with people you manage if you find it hard to deliver a hard performance review. Otherwise, it is OK
You need to be optimist to build a product, but that also means you are more likely to ignore bad news
For conflicts, for each opinion pick the lead, and form a meeting with their lowest common ancestor to break ties
- Don’t make your own decision
- Make sure you get involved
- Breaking tie also means the conflicting sides give up responsibility
People follow company values way more than strategy

Running meetings

Mission-oriented meeting: solve one problem. It should have no more than eight people. After this it is hard to reach a consensus. Once the meeting is over it is the responsibility of the chairman to send minutes that summarize the discussion that occurred, the decision made and the actions to be taken.
Operato reviews: process-oriented,i.e., scheduled at regular intervals, present results to people not related to the area. Supervisor should ask questions

On deadline and time estimation

2019-11-21 00:00:00 +0000

Goal

80% of requirements can be released in 2 weeks
- Deadline is the best driver: so small deadlines
- The true end of task - complete work and get evaluation
  - An outcome is typically defined by a quantitative metric. a number that is both meaningful to the business and measurable by the team.
  - Not easy to define the correct metrics and infrastructure to measure it. Part of the reason OKR uses multiple iteratons to refine the metrics
- Work complicates to fill the available time.
80% of requirement dev can be done in a week
80% of release can be done 1 hour after the submit
Assign a coordintor for progress coordniate if >= 3 people involved - most likely owner
Key to the problem is almost always blocked requirment instead of blocked resources
Don’t commit a hard deadline unless you have to - your schedule will become highly inflexible

Estimation

Underestimates do not change significantly during the activity until about three weeks before the scheduled completion
Imagine 3 times effort from a debugged program to a product, and 3 times effort from debugged problem to the integrated system
Insider tends to under-estimate time (even with prior knowledge on similar problems), outsiders tend to over-estimate
However, if we break down tasks, the segmentation effect tells us that the sum of sub-task estimates are greater than the singel time allocated to the whole tasks

Add/Remove elements from Java ArrayList inside foreach

2019-11-20 00:00:00 +0000

Foreach is the syntactic sugar for iterator for that collection

In ArrayList’s Itr

        int cursor;       // index of next element to return
        int lastRet = -1; // index of last element returned; -1 if no such
        int expectedModCount = modCount; //modCount is from the list

        public boolean hasNext() {
            return cursor != size; //size is from the list
        }

If we call ArrayList.remove() inside foreach to remove the non-first element in the list

hasNext() will return true because cursor is now > size, and Itr.next() will be triggered
next() will checkForComodification()

   final void checkForComodification() {
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
        }

At this moment, because we didn’t remove via iterator, the expectedModCount will be less than modCount, and exception will be thrown!

Note that Itr has a remove() but does not have add

     public void remove() {
            if (lastRet < 0) //this check ensures that we can't call Itr.remove() twice in a row
                throw new IllegalStateException();
            checkForComodification();

            try {
                ArrayList.this.remove(lastRet);
                cursor = lastRet;
                lastRet = -1;
                expectedModCount = modCount;
            } catch (IndexOutOfBoundsException ex) {
                throw new ConcurrentModificationException();
            }
        }

In ArrayList, it has a modCount.

In ArrayList, we have an internal Itr time, which has a expectedModCount, it implments Iterator, i.e., the instance you get from ArrayList.iterator and use that to prevent concurrent change but iterator.remove is OK

Previous Page: 7 of 23 Next