TiDB in OLAP

Jun 2, 2018

Note: all links are in Chinese

Use case: aggregate isolated MySQL clusters for OLAP. They run TiSpark on top of it
Production TiDb cluster has tens of nodes, with tens of TBs of data.

To keep the production table size in check, they regularly purge production tables and move old entries to separate archive tables
Need to watch out for the performance issues of DDL operations in production and archive tables
For migration, they use the official TiDB syncer tool to run TiDB as a Mysql read slave
Daily archived > 100 mil rows, > 100 G. Current TiDB volume “tens of TBs”
As the post was written, TiDB Binlog relies on Kafka

Ping++ - Payment SaaS

Use TiDB as dataware house. It replaces AliCloud’s ADS and ES
ADS has cost issue, and ES has difficulty handling complex queries with high dev/op costs.
Current cluster 5 nodes, each with 16 cores and 32G ram

A Truck Fleet Management SaaS

Sync data from Alicloud’s DRDS to TiDB, and runs TiSpark against TiKV, TiDB’s storage layer
2T raw data incoming per day.

A Restaurant Merchant/Order/Cashier Saas

TiDB supports the operational datastore. Check the link to see the test queries they run.
Before: RDS -> Mongo via Kafka -> Hive. After: RDS -> TiDB via Kafka -> Hive. TiSpark queries both TiDB and Hive
Cluster setup: 8 nodes. 5 of them are for storage layer. Each TiKV/Storage node is 16 core/128 G ram with 2 1.8T SSD
Peak QPS 23K. Data volume “couple of Ts”

Another Restaurant Merchant/Cashier SaaS

Near real time complex queries. TiDB replaces Solr.
Current deployment 8 nodes. Storage layer on 16 cores and 64G ram

They need real time analysis capabilities.
Their raw data source is Mysql databases, but they don’t want to spend too much effort on Mysql -> Hive/Hbase ETL pipeline.
They need Spark support, so they choose TiSpark + TiDb