On SSH

2017-11-01 00:00:00 +0000

  1. Client’s private key (identity key, e.g., id_rsa) stays with client and should not be distributed. Client’s public key,e.g., id_rsa.pub, goes to the server, which is stored in server’s autherized_key file

  2. passphrase is used to encrypt/decrpt client’s private key

  3. In a cluster mode, shared host key between hosts may be alright

  4. Host key is the SERVER’s public key. It is stored in ssh_host_<rsa/dsa/ecdsa/ed25519>_key file on the server, known_host on the client, if the client connects to the server ever.

  5. Known_host is of format - [host],IP ALGO key

  6. login is accepted if client proves that it knows the secret key, and the public key is in autherized_key file

CODE FESTIVAL 2017 Qualification Round C

2017-10-22 00:00:00 +0000

B:Similar Arrays

I used the brute force approach, which is slower to implement.

Note that it is even as long as there is 1 even number in the array. It is easier to compute the inverse of the function => i.e., all numbers are odd!

C: Inserting ‘x’

I calcuate the palindrome length and then calcuate how many x should be inserted - WAed because I didn’t consider the case of AXBXXCCXBXXA!!!

Instead, we can just modifly to standard palindrome check slightly, i.e., when we see a mistach at current head and tail, according to the rule, we know exactly the optimal soltuion will be. This will greatly simplify the code.

Expanding on this, what if we can insert any letters, can it be done on O(n)?

Idea 1

Construct t = reverse(s) + $ + s, and then run z-function on t, if i + z[i] == length(t), then we know the substring at i….z[i] is a palindrome suffix of s!

Idea 2

Construct a rolling hash of each suffix/preifx, and compare each (i, n-i-1) pair, this should be close to O(N), as long as our hash function does not have too much collision!

Spring Cloud Eureka on ECS

2017-10-15 00:00:00 +0000

How is EIP bound via DNS

  • Spring-Cloud-Eureka-Server has its now ServerBootStrap. This means the bootstrap logic from vanilla Eureka-Server will be ignored
  • Spring-Cloud by default sets the binding strategy to EIP, i.e., we will be using EIPManager from vanilla Eurika
  • To use DNS resolivng, we need to set shouldUseDnsForFetchingServiceUrls to true in client config.
  • The DNS name we need to config on the client side is
clientConfig.getEurekaServerDNSName()
  • EIPManager look for the record “txt.YOUR_REGION.SERVICE_DNS_NAME” of TXT type, and get a list of CNAMEs from the value of TXT field. Note it expect the first part of CNAME is yoru zone name
  • For each CNAME record, EIPManager will look up “txt.”CNAME TXT field and another list of CNAMEs, and construct service URL in the form of
 String serviceUrl = "http://" + ec2Url + ":"
                            + clientConfig.getEurekaServerPort()
                            + "/" + clientConfig.getEurekaServerURLContext()
                            + "/";

Note EurekaServerURLContext - we need to config it in client config

  • To trasnlate from service url in DNS to EIP, note it is brute force string parsing - get substring between “ec2-“ and “.yourregion.compute”, and then replace all - in the substring with .
  • To bind EIP, we need to configure
String aWSAccessId = serverConfig.getAWSAccessId();
String aWSSecretKey = serverConfig.getAWSSecretKey();

in service config

  1. Note that EIP is a public IPv4 address, but we really want to shield eureka from internet traffic. Instead, we can put eureka instances inside the private subnet, while still using the public host name. The communication from within the VPC should still work, because AWS resolves public DNS host name to the private IPv4 address

On ECS

A working config for Eureka cluster deployed in AWS

spring:
  profiles:
    active: NotDefault #to trigger the @Bean injection on aws aware context
server:
  port: 8761

eureka:
  environment: aws #shown in the Eureka UI
  client:
#I choose true for cluster case, so that cluster will keep receiving heartbeats and some false alarms will not be raised
    registerWithEureka: true 
    fetchRegistry: true
    region: us-east-1
    eurekaServerURLContext: eureka
    eurekaServerPort: 8761
    useDnsForFetchingServiceUrls: true
    eurekaServerDNSName: YOUR.DOMAIN.IN.ROUTE.53
    availabilityZones:
      us-east-2: us-east-1a, us-east-1b

  server:
    waitTimeInMsWhenSyncEmpty: 0
    #Note we need to set server.aWSAccessId and server.aWSSecretKey, so that the instance can bind to an EIP
    #Take them out for confidentiality case
    #Alternatively, you can just set environment variable eureka.server.aWSAccessId and 
    #eureka.server.aWSSecretKey
  region: us-east-1
  datacenter: cloud
  us-east-1:
    availabilityZones: us-east-1a, us-east-1b

Problems I run into when I pack Eureka inside docker

  1. Use java -jar eureka.jar does not trigger injection at @Bean. Still don’t understand why, mostly likely because of missing jar in classpath
  2. Then I tried running gradle wrapper directly inside container, but the minimal java docker image does not have bash, which gradle wrapper requires
  3. So I changed to base image to centos 7, but JAVA_HOME is not set, so I have to add CMD to install java and set JAVA_HOME explicitly
  4. Need to set ip and host name explicitly, during server bootstrap. Otherwise, the host will use the internal ip host name instead of public ip host name. Such discrepancy will cause Eureka to report replicas as unavailable, because the registered host name is different from the EIP host name stored in route 53

When Eureka instance is on ECS, it will be using the docker ID as host name to register. Therefore, the service consumer will not be able to resolve it

We will have to inject an EurekaInstance ourselves, and manually set host name and port to be registered on Eureka. Note that EurekaInstance exists in client

    @Bean
    @Profile("!default")
    public EurekaInstanceConfigBean eurekaInstanceConfigBean(InetUtils utils) {
        final int  managementPort = YOUR_SERVICE_PORT;

        log.info("Setting AmazonInfo on EurekaInstanceConfigBean");
        final EurekaInstanceConfigBean instance = new EurekaInstanceConfigBean(utils) {

	   //needed only when Eureka server instance binds to EIP
            @Scheduled(initialDelay = 10000L, fixedRate = 30000L)
            public void refreshInfo() {
                log.debug("Checking datacenter info changes");
                AmazonInfo newInfo = AmazonInfo.Builder.newBuilder().autoBuild("eureka");
                if (!this.getDataCenterInfo().equals(newInfo)) {
                    log.info("Updating datacenterInfo");
                    ((AmazonInfo) this.getDataCenterInfo()).setMetadata(newInfo.getMetadata());
                }
            }

            private AmazonInfo getAmazonInfo() {
                return (AmazonInfo) getDataCenterInfo();
            }

            @Override
            public String getHostname() {
                AmazonInfo info = getAmazonInfo();
                final String publicHostname = info.get(AmazonInfo.MetaDataKey.publicHostname);
                return this.isPreferIpAddress() ?
                        info.get(AmazonInfo.MetaDataKey.localIpv4) :
                        publicHostname == null ?
                                info.get(AmazonInfo.MetaDataKey.localHostname) : publicHostname;
            }

            @Override
            public String getHostName(final boolean refresh) {
                return getHostname();
            }

            @Override
            public int getNonSecurePort() {
                return managementPort;
            }

            @Override
            public String getHomePageUrl() {
                return super.getHomePageUrl();
            }

            @Override
            public String getStatusPageUrl() {
                String scheme = getSecurePortEnabled() ? "https" : "http";
                return scheme + "://" + getHostname() + ":"
                        + managementPort + getStatusPageUrlPath();
            }

            @Override
            public String getHealthCheckUrl() {
                String scheme = getSecurePortEnabled() ? "https" : "http";
                return scheme + "://" + getHostname() + ":"
                        + managementPort + getHealthCheckUrlPath();
            }
        };

        AmazonInfo info = AmazonInfo.Builder.newBuilder().autoBuild("cloudconfig");
        instance.setDataCenterInfo(info);

        return instance;
    }


Config Related Code in Spring Cloud Netflix - Eureka

2017-10-13 00:00:00 +0000

Server Dependency

Your Eureka project -> spring-cloud-starter-eureka-server -> spring-cloud-starter-netflix-eureka-server, which depends on

  1. spring-cloud-starter
  2. spring-cloud-netflix-eureka-server
  3. spring-cloud-starter-netflix-archaius
  4. spring-cloud-starter-netflix-ribbon
  5. com.netflix.ribbon.ribbon-eureka

Note that there is no concrete code until spring-cloud-netflix-eureka-server

During Server Bootstrap

  1. Look for spring.profiles.active
  2. Look for spring.profiles.default
  3. Look for eureka.datacenter
  4. Look for eureka.environment
  5. Check if applicationInfoManager’s dataCenterInfo is AWS/Amazon, if so, start the AWS binder. Note that it defaults to EIP binding strategy
  6. For other settings, check ServerConfigBean, which is populated via reflection/injection. AVOID VANILLA EUREKA CONFIG PATH FROM NETFLIX DOC!

Spring Cloud Eureka Client Config

  1. Client will look for az in a region, if az is not defined, it will use “defaultZone”
  2. Client will look for serviceUrl.”myZone”, if it is undefined, it will look for serviceUrl.defaultZone
  3. serviceUrl defaults to “defaultZone” -> “http://localhost:8761/eureka”. That is why local run can do without setting default Url
  4. For other settings, check ClientConfigBean, which is populated via relfection/injection. AVOID VANILLA EUREKA CONFIG PATH FROM NETFLIX DOC!

Note

  1. In my Eureka project, application.yml and application.properties can co-exist in src/main/resources and both will be read. I use application.properties to config logger, which is logback
  2. Following the guide, the aws specific init is placed in the main of my eureka server. Note the example has a @Profile(“!default”) flag, i.e., you need to set spring.profiles.active to something other than “default”, either via -D arguments or application.properties

AWS specific code in Spring Cloud Netflix - Eureka

2017-10-11 00:00:00 +0000

Note that Spring Cloud Eureka adds a layer of abstraction on top of vanilla netflix eureka, even though they have similar names such as client/server/registry e.t.c. Therefore, when navigating through the code, I often run into definition that embedded inthe vanilla jar

Eureka Client - Client config

  1. Default region is us-east-1

  2. serviceUrl is az -> fully qualified URL map, default mapping is “defaultZone” -> “http://localhost:8761/eureka”

  3. availabilityZones is a region -> az map. Default is empty

Eureka Client - Instance config

  1. aSGName - autoscalling group associated with this instance

  2. dataCenterInfo - data center where is this instance is deployed - DataCenterInfo.Name.MyOwn

Eureka Server - Bootstrap

  1. awsBinder - during serverContext init, we check if applicationInfoManager.getInfo() is from aws
	protected boolean isAws(InstanceInfo selfInstanceInfo) {
		boolean result = DataCenterInfo.Name.Amazon == selfInstanceInfo
				.getDataCenterInfo().getName();
		log.info("isAws returned " + result);
		return result;
	}

Eureka Server - Server Config

  1. aWSAccessId

  2. aWSSecretKey

  3. aWSBindingStrategy, default EIP

Eureka Server - Controller

  1. In header
     if (info.getName() == DataCenterInfo.Name.Amazon) {
             AmazonInfo amazonInfo = (AmazonInfo) info;
             model.put("amazonInfo", amazonInfo);
             model.put("amiId", amazonInfo.get(AmazonInfo.MetaDataKey.amiId));
             model.put("availabilityZone",
                     amazonInfo.get(AmazonInfo.MetaDataKey.availabilityZone));
             model.put("instanceId", amazonInfo.get(AmazonInfo.MetaDataKey.instanceId));
         }
    
  2. When populating apps on controller, it does similar check as above

Outstanding Questions

  1. How is an instance of ApplicationInfoManager is populated?
  2. In Eureka client, how is the region info populated?
  3. Why my spring cloud eureka is not logging while the original is?

Round 439

2017-10-07 00:00:00 +0000

869A - The Artful Expedient

Claim: Karen always wins Proof:

On Netflix Eureka

2017-10-06 00:00:00 +0000

Deploying a new mid-tier service on AWS without Eureka

This means our service is not facing client on the internet,i.e, edge service. Instead, their customers are other services.

  1. Add DNS entry ,i.e., in route 53, so that traffic within the same AZ can be routed within AZ,e.g., inter-service calls from within the same AZ. The name is the single endpoint exposed to other services, the value is the internal load balancer name.

  2. Add an internal load balancer, whose endpoint is the value of the DNS name in the previous step. This serves the standard LB purposes within the AZ

  3. Add an auto scaling group (ASG) to host all service instances, so that # of instances can be scaled based on traffic

Note that such steps must be performed on EVERY SINGLE service, i.e., each service will have 3 fix points (DNS, LB, ASG) the service’s team has to manage.

Deplying a new mid-tier service on AWS with Eureka

  1. After Eureka is up, the service just use Eureka’s API to register the instance. Moreover, given the spring-cloud-netflix suite, such registration is automatically handled by client => the service team don’t need to worry about any fix point, except its own stable service name!

  2. The only fix points Eureka ops need to maintain is the directly eureka related, and has nothing to do with registered service,i.e., Eureka is like a service database, and ops don’t worry about the data inside ops except database params

Notes

  1. Eureka client uses client-side load balance, whereas ELB is a traditional proxy-based load balancing,i.e., all traffic goes through ELB
  2. Netflix prefers stateless servers, i.e., centralize all state (“fix point”) into DBs
  3. Registrations may happen in an orphaned server and some clients may reflect new registrations while the others may not.
  4. Seems that Eureka doesn’t maintain persistant state itself, instead, it is constructed purely from heatbeats of services, both between eureka servers, and applicaion servers.
  5. Zookeeper is a CP/BASE system. As a comparsion, Eureka is AP system, since service discovery is a fundmental service.

HA Eureka in AWS

  1. One ASG for a Eureka cluster within a region
  2. One Elastic IP (EIP) for every server in the cluster. Eureka server will find unused EIP and binds it to itself during the start up, after you config Eureka with the last of EIP address
  3. Add a DNS TXT record name for the whole region, the record should be of format
txt.region1.eurika-server-domain-name="az1.eurika-server-domain-name" "az2.eurika-server-domain-name" "az3.eurika-server-domain-name" 
txt.az1.eurika-server-domain-name="domain-with-EIP1-in-zone1" "domain-with-EIP2-in-zone2"
txt.az2.eurika-server-domain-name="domain-with-EIP3-in-zone2"
txt.az3.eurika-server-domain-name ="domain-with-EIP4-in-zone3"

By default, Eureka client searches for the property file eureka-client.properties in the classpath.In all eureka-client.properties

eureka.shouldUseDns=true
eureka.eurekaServer.domainName=mydomaintest.netflix.net
eureka.eurekaServer.port=7001
eureka.eurekaServer.context=eureka/v2

Note that space as separate in TXT record

You can also hardcode Eureka instances in Eureka config files, but this means the file content will be in AMI and distributed to ALL instances!

MemSQL Start[c]UP 3.0 Round 2

2017-10-04 00:00:00 +0000

865A: Save the problem!

Official solution: suppose we have only denomination 1 and 2, then given amount A, we have only A/2 + 1 ways to form the number, because the only replace we can do is replace 2 1s with a 2

This is way quicker than my reverse dp approach during the contest!!!

865B: Ordering Pizza

I tried a double pointer approach, and it gave me TLE!!! very likely because I put # of pizzas in an infinite loop.

Official solution:

To simplify implementation, we add another dummy people that reads all leftover pieces without gaining any happiness Also, there is at most one pizza shared by people preferring different types. So there are potential 2 cases, 1 for each type of pizza. We just try both and pick the better one

865D: Buy Low Sell High

Everyday, we gain one option to buy, and if today’s price is higher than some options price, we know we can sell some options at that price, but we don’t know if it is the best time to sell, so we can just sell and give us options to buy back at the price we sell. This transformation works because on a day we do nothing <=> selling and buying on the same day.

On TCP

2017-10-03 00:00:00 +0000

  1. Even though we call tcp “packets”, the correct term is segment for TCP, datagram for IP, and frame for link layer

  2. The length of the data section is not in the TCP segmet header, and thus needs to be inferred

  3. During the 3-way handshake of establishing connection, the initial sequence number on either side is randomly chosen

  4. During the 4-way handshake of connection terminiation, each side terminates independently. Also, the side that sends the first FIN will wait for a timeout period after sending out the final ACK, before closing the connection, so that the local port is unavailable for new connections. Some optimization turns the handshake into 3 ways, by letting the receiving end send FIN & ACK together, since they are neighboring steps.

  5. Most implementaions maps a session to an OS process, with address and port (0 to 65535) as the identifier. Note that port is only on the TCP layer, IP is only on the IP layer.

  6. The window field specifies # of bytes the other side is willing to receive

DR Drills with AWS Aurora

2017-10-01 00:00:00 +0000

Objectives

  1. Verify that our services on Aurora can still perform within SLA with degraded aurora service
  2. Build tools and procedures for such drills, so that we can repeat drills against other services on different platform.

Note

  1. All drills should happen in the context of our performance testing load, i.e., new traffic coming.
  2. Upon injected failure, aurora will just restart on the spot instead of failover, this means the service will wait until the master recovers
  3. Between AZs, the replica lag is around 20ms. However, this just means the lags in cache, because aurora uses a share-disk design, the data is always consistent.
  4. Failover times are typically 60-120 seconds.This means most connections will timeout during failover
  5. Aurora also has the capabity to perform disk failure and disk congestions, but drilling on such things brings debatable additonal value, until we gain more experience on that

Crash dispatcher on writer

  1. master: create db instance
  2. read replica: Read replica has been disconnected from master. Restarting Mysql => create db instance

Crash instance on writer

  1. master: DB instance restarted
  2. read replica: DB instance restarted

Crash node on writer

  1. master: DB instance restarted
  2. read replica: Read replica has fallen behind the master too much. Restarting Mysql => DB instance restarted

Failover

  1. old master:
a. Started cross AZ failover to DB instance
b. A new writer was promoted. Restarting database as a reader.
c. DB instance restarted
d. Completed failover to DB instance
  1. new master:
a. Started cross AZ failover to DB instance
b. DB instance shutdown
c. DB instance restarted	
d. Completed failover to DB instance

Schedule

  1. Such drill should happen once a month, before major version release.
  2. The drill should start during low-traffic times, e.g., 2am local time

Drill 1: Failover

  1. Ensure traffic is going through our service. Either through traffic replication or load testing tool
  2. Failover the current writer to a reader in a different AZ
  3. During the failover, the service health check should be remain OK all the time
  4. During the failover, write failure is expected, but read failure should not happen

Drill 2: Deleting a read replica

  1. Ensure traffic is going through our service. Either through traffic replication or load testing tool
  2. Ensure we have at least 2 healthy aurora instances running
  3. Pick a read replica and delete it
  4. During the failover, the service health check should be remain OK all the time
  5. During the failover, write/read failure should not happen
  6. Create a new read replica off the current writer

Drill 3: Deleting the current writer

  1. Ensure traffic is going through our service. Either through traffic replication or load testing tool
  2. Ensure we have at least 2 healthy aurora instances running
  3. Pick the current writer and delete it
  4. During the failover, the service health check should be remain OK all the time
  5. During the failover, write failure is expected, but read failure should not happen
  6. Create a new read replica off the current writer

CS Academy Round 50: Min Races

2017-09-29 00:00:00 +0000

I solved it during the contest, but the official explanation is cleaner than mine.

Official solution

sort all drivers by their final rank, i.e., we will have a list of classes

Claim: answer to the problem = length of longest decreasing subsequence.

Proof: This is actually Dilworth Theorem. In any finite partially ordered set, the maximum number of elements in any antichain equals the minimum number of chains in any partition of the set into chains.

Our greedy construction actually serves as proof

Codeforces Notes

2017-09-26 00:00:00 +0000

855B: Marvolo Gaunt’s Ring

Offical solution uses a multiple stage DP approach

v[0][0] = p * a[0]
v[0][1] = (p + q) * a[0]
v[0][2] = (p + q + r) * a[0]

LP(i , 1, n) {
	v[i][0] = max(v[i-1][0], p *a[i]) 
	v[i][1] = max(v[i-1][1], q * a[i] + v[i][1])
	v[i][2] = max(v[i-1][2], r * a[i] + v[i][2])
}

return v[n-1][2];

855C: Helga Hufflepuff’s Cup

I got the DP states during the contest, but I got stuck calcualting ways to distribute max security nodes among children!!!

consider the node with n nodes, # of ways to distribute k top security nodes amoung them is

ways(childI, type, secN) = sum over ways(childI - 1, type, secN - sofar) * f(childI, type, j)

Note runtime/memory is OK because secN <= 10

864E: Fire

Obviously if we save 2 items in the final solutions, we can greedily save the one with less t(i) first. So we sort by t(i) first, we know the final solution must be a subsequence in this sorted array

Now we have 2 states we need to worry about, current time and value. I got stuck how to handle it!!!

Note that value range is small, we can force the parameter as part of DP state,i.e., knapsack idea

minT(i, v) = minimum time to save value v among the first i items

Alternatively, we can use maxV(i, t) too, but the value range for v is 400 , much less than the t range

On HA of AWS Mysql RDS and Aurora

2017-09-25 00:00:00 +0000

MySQL RDS

In multi-AZ deployments for MySQL, data in primary DB is synchrously replicated to a standby instance in a different AZ. Failover is automatic, and endpoint remains same. Backup is taken from the standby, so that I/O is not suspendend on primary

Note the read replica, a different concept from the multi-az replica, uses MySQL’s async replication

Aurora RDS

Storage is handled by the storage layer,i.e., shared-disk approach, and is replicated across 3 AZs, each with 2 copies.

Aurora uses a SSD-backed virutalized storage layer.

Each write has to be acknolwedged by 4 out of 6 copies (sync), while read covers 3(4 + 3 > 6). This is to ensure

  1. If 1 AZ + 1 machine is down, can still read
  2. If lost 2 replicas, even if they are in the same AZ, can still write

For each source DB cluster, you can only have one cross-region Read Replica DB cluster. Data transfer between Regions incurs extra charge.

It is possible to promote the cross-region to standalone cluster, but this also means we need to add a cross-region replica to this newly promoted master - the old master will be on a split brain, and can not be re-incorporated!

Also note that for EACH region, aurora gives 2 end points, one read point, and one write end point. It is not smart enough to do auto-load balance!

CODE FESTIVAL 2017 qual A

2017-09-24 00:00:00 +0000

C: Palindromic Matrix

I used classification to solve it. The official solution uses brute force

Given H and W, we can tell how many positions has 4 mirrors, 2 mirrors, and 1 mirror

So for each letter, we try to fit it into 4 mirrors, 2 mirrors, and 1 mirror, in that order one by one

The answer is no if when we run out of possible 1 mirror positions. Otherwise, yes.

D: Four Coloring

  1. Covert Mahattan distance to Chebyshev Distance

  2. Divide the converted grid into d * d squares

  3. assign each 4 sqaures 4 colors. It is correct because each pair with distance d is in neighboring converted squares

Distributed Systems Testing - from TiDB talk

2017-09-22 00:00:00 +0000

Original in Chinese

  1. Profile everything, even on production - to catch once in a life time bug

  2. Tests may make your code less beautiful - May need to add members to structs just for testsings, but we still need to design for tests

  3. Fault injeciton: to test network failure case, try automate it without human intervention - Otherwise it is inefficient

  4. Disk fails > 8% after 3 years

  5. Importatnt to monitor NTP, detect jumping back => normally bad!!

  6. Reading data from disk without checksum => no protection against potential data corruption

  7. Fault injection: disk error, netowrk card, cpu, clock, file system, network & protocol => need to simulate everything so that you can inject error

  8. Common tools: libfiu, openstack fault injection factory, Jepsen (mostly famous),

  9. FoundationDB limitation: fake multi-process does not work well with languages where single thread is in effect multi-threaded,e.g., channel)

  10. TiKV uses namazu. Planning to introduce OpenTracing (in Go) to fill the similar role as Google Dapper

  11. Dont test failure case by triggering failure automatically, use your simulation layer

My TiDB benchmark runs

2017-08-21 00:00:00 +0000

Document results here since there is not many English sources on this

Setup

6 r3.2xlarge (8vCPU, 61 G ram). 3 act as PD/TiDB, 3 act as TiKV. All TiDB/TiKVs run on mounted instance storage SSD

2 c4.2xlarge (8vCPU, 15G ram) as test driver servers.

All 8 servers are within the same availability zone. Ping between each server is around 0.15ms

Sysbench: Oltp_read_only

A. Prepare data: 16 tables, each with 1M rows

./sysbench ./lua/oltp_read_only.lua         --mysql-host=$TIDB1 --mysql-port=4000 --mysql-user=root --mysql-password=""         --mysql-db=test --tables=16 --table-size=1000000         --report-interval=10         --threads=10 --time=0  prepare

B. on the first test driver server run

./sysbench ./lua/oltp_read_only.lua         --mysql-host=$TIDB1 --mysql-port=4000 --mysql-user=root --mysql-password=""         --mysql-db=test --tables=16 --table-size=1000000         --report-interval=10         --threads=10 --time=0         run

C.on the second test driver server run

./sysbench ./lua/oltp_read_only.lua         --mysql-host=$TIDB2 --mysql-port=4000 --mysql-user=root --mysql-password=""         --mysql-db=test --tables=16 --table-size=1000000         --report-interval=10         --threads=10 --time=0         run

Note that the two test servers use different –mysql-host in an attempt to balance network traffic on TiDB/PD servers

Result: Combined QPS from the test servers around 12.5 k. 95% latency between 25-39 ms

Sysbench: Oltp_insert

A. Prepare data: same command as oltp_read_only: 16 tables each with 1M rows

B. on the first test driver server run

./sysbench ./lua/oltp_insert.lua         --mysql-host=$TIDB1 --mysql-port=4000 --mysql-user=root --mysql-password=""         --mysql-db=test --tables=16 --table-size=1000000         --report-interval=10         --threads=20 --time=0         run

C. on the second test driver server run

./sysbench ./lua/oltp_insert.lua         --mysql-host=$TIDB2 --mysql-port=4000 --mysql-user=root --mysql-password=""         --mysql-db=test --tables=16 --table-size=1000000         --report-interval=10         --threads=20 --time=0         run

Again, the two test servers use different –mysql-host in an attempt to balance network traffic on TiDB/PD servers

Result: Combined TPS from the two test servers around 4.5 k. 95% latency around 14 ms

Findings from sysbench runs

  1. When I run the same commands with –range_selects=false, the combined qbs is around 17k, with 95% latency between 14-22 ms

  2. Among the 4 range queries in oltap_read_only.lua, only execute_distinct_ranges is slower. Such behaviour is expected, given the distinct_ranges query.

  3. The latency/QPS improves almost linearly as I take out range queries one by one, which is expected

TPCC: latency test

I populate data with the simplistic load

./tpcc_load -h 127.0.0.1 -P 4000 -d tpcc1000 -u root -p "" -w 1

and run a simple test

./tpcc_start -h 127.0.0.1 -P 4000 -d tpcc1000 -u root -w 1 -c 1 -r 10 -l 40

Results

 10, trx: 80, 95%: 59.947, 99%: 62.177, max_rt: 62.718, 80|54.310, 8|61.505, 8|144.893, 9|304.212
  20, trx: 86, 95%: 59.696, 99%: 60.778, max_rt: 61.960, 83|55.630, 8|59.896, 8|142.030, 8|290.288
  30, trx: 81, 95%: 59.911, 99%: 62.270, max_rt: 63.184, 85|55.237, 9|61.069, 8|144.407, 8|300.596
  40, trx: 74, 95%: 61.437, 99%: 62.344, max_rt: 64.457, 75|54.663, 7|58.993, 8|145.278, 9|308.553

Findings from TPCC runs

  1. when I deploy 3 instances across AZs but within the same DC, 95% percentile jumps to about 110ms

  2. Because of MVCC, if number of warehouses is too low (< 10 * number of threads) we will see unable to update error from time to time

  3. Even though TPCC is estabilished bench mark, both AliSQL and TiDB primarily use sysbench

Expected Values in Codeforces

2017-08-21 00:00:00 +0000

Easy problems in the last two contests, but I didn’t solve them quickly because of formulas with expected values

839C: Journey

Idea 1

ans = sum of (Pr(leaf) * Depth(leaf)) . Pr(leaf) can be calculated in DP steps, while we can calucate Depth(leaf) in separate DFS, and when we reach the leaf

Idea 2

For each tree, ans(root) = sum of (ans(leave)/k) + 1. This formula works because all leaves increase depth by 1, therefore, we can abstract all 1s out

841C: Leha and function

Intuition is obvious, but how to prove it? I used experimentation to prove the greedy solution - a slow approach.

For any K-size subset in N, we can use a bijection to represent it as a list of deltas d(1), d(2), …..d(k), d(k+ 1), where d(1) = a(1), sum of all d(i) = N + 1, i.e., d(k+1) = N + 1 - a(K)

Since this applies to any K-size subset in N, we know Ex(d(1)) + Ex(d(2)) + … = N + 1, because of linearity of expected value

The hard part is to prove E(d(1)) = (N + 1) / (K+ 1)

Idea 1

Use hockey-stick identity!

sum of [r, n] choose(i , r) = choose(n+ 1, r+1)

Idea 2

Note that if we swap any d(i) and d(j), and keep others same, the corresponding a(i) is still valid, i.e., d(i) and d(j) follow the same distribution => Ex(d(i)) = Ex(d(j))

835D: Palindromic characteristics

2017-08-01 00:00:00 +0000

Insights

  1. if a string is k-palindrome, then it is also a k-1 palindrome!!!
    Proof: simple induction on k
    
  2. if a string is k-palindrome <=> then it is a palindrome, plus it is left half is a k - 1 palindrome
    Proof: induction on k, base case k = 2
    
So we need to calculate maxK(l, r) = max degress of k in in the substring [l, r]. This means we need 2 DPs, one for if the substring is a palindrome, another for the max degree of left half >= k - 1.In the end, we do a reverse prefix sum from s to 1

TiDB vs CockroachDB

2017-07-28 00:00:00 +0000

Last updated: Apr 8, 2019

Similar designs

  • Both follow the architecture of a stateless SQL layer on top of a replicated, strongly consistent KV store.
  • Both use RocksDB to serve the KV store. Consistency is guaranteed by the standard write ahead log (WAL) + replicated state machine (RSM) model. WAL is replicated by raft. 
    • This architecture is inspired by Spanner/F1.
    • Each db has its own optimization on top of the vanilla raft. Will not dive deep into details here, because the actual optimization keeps evolving
  • To implement transaction, both use the standard 2PC + MVCC idea.
    • Both, by default, follows serializable concurrency model, implemented by lease reads.
  • Both use online schema change, based on the paper ”Online, Asynchronous Schema Change in F1”,i.e., multi-stage schema change

The similarities end at this high level designs. Huge difference in implementation details

Performance

  • The test I did in 2017. Note that I was completely new to tidb, if I run it again, the numbers would be much matter.
    • Use snapshot read instead of consistency read
  • For standard innodb mysql, our planning expectation is 5k qps at 10 ms avg latency on an r4.4xlarge.

System design: distribute whitelist

2017-07-24 00:00:00 +0000

Consider we have mail client that updates the blocked website list for our service. Design one such service so that we can distribute the black list. the server list is updated once per second the client pulls our service once every 30 mins

Design: Obviously, we need to optimize for read performance

so the client has been updated 1800 times between each pull

The client should send a request fetchUpdate(clientVersion), which should return the deltas since the client version

On the reader side, we should have a list of deltas and with each version, and return list of versions. In a k-v store that supports range scan, this should be highly efficient => because they are aligned together in persistant layer

To future improve performance for long missed read, we can have a compacted view every certain versions, if the new request < last compacted version, just return the whole thing. we should do a log compaction from time to time to purge too old deltas.

In terms of global updates, we will designate 1 DC as master, and other as slaves, and we use cross region ZK to locate which one is alive and forward all write traffic to it. when the master recovered, we will just talk to the new master and sync from the single source of truth to recover