System design: web scale chat systems
Requirements
- Multiple clients for a single user is allowed, and all messages should be synced between the clients
- Messages are displayed in the same order as closely as possible on all clients. See the discussion below
- Web scale
- Use WebSocket to deliver messages
- User Authentication needed
- Can ignore friend list feature for now
- Need to support group chat
What kind of order must be maintained?
- Partial order within the same sender on the same client must be maintained. From a sender’s pov, the message it sends should arrive in the same order in all clients,e.g., If the user sends messages in the order of is 1, 2, it is OK to delay the delivery of 2 until 1 is delivered, but it is not ok to display 2 on the receiver’s client, and then insert 1 in front of 2.
- Complete order between senders within the same dialog should not be maintained. Otherwise, the whole group will be slowed down by the slowest participant - Everyone will be fighting for the same distributed lock within the group
Capacity estimation
- 1B user clients. Each sends 1 message every 2 min in 12 hours = 1B * 1 * 30 * 12 = 360B messsages
- We will assume they are in the same region instead of distributed across regions
- Each message 1kb = 360 TB storage per day
- Peak write traffic = 4 times of average about 35 mil writes per sec
- Because the read is powered by websocket, the read request will be higher by write but not much higher.
- Each websocket server maintains 100k connections, so we need 10k node
- Each DB node handles 10k request per sec peak, so need a 350 node cluster
High level components breakdown
Excluding common services, we need
- User authenticaiton and authorization module (UAA) for concrete auth logic
- User client service that manages 1-to-many user-client relationship
- User domain service that stores detailed user info. Can be combined with user client service
- Message domain service manages the details and life-cycle of individual message
- Dialog domain service that manages dialog lifecycle, including dialog-> message mapping, and dialog->user, mapping
- Notification service to host websocket connections to clients
Why separate message domain and dialog domain?
Aside from conceptually reasons, the practical reasons:
- Message is subject to extension, e.g., video/voice messages.
- Need way to locate message by its id for various purposes
Why separate notification domain and dialog domain
The two have different characterisitcs
- Dialog: heavy state + short connections
- notificaiton: light state + long connections/websocket
Otherwise, heavy state + long connection is a much harder problem
User authenticaion and authorization(UAA)
- SSO such as open id, gives the access_token and refresh_token to every single client. For each request, user will provide
- access_token
- openid
- login_type
- When a request hits the API gateway, API gateway will hit the UAA and get the corresponding JWT token before forwarding to the actual target service
- UAA makes sure (login_type, openid) exists, and return the code to the client
- Client will use the return code to get the token
- Need to maintain
- (user_id, auth_type) -> (token, expiry, meta)
- (user_id, auth_id) -> (name, salted hash, meta) - can be combined with above table
- auth_id -> (user_id, openid, login_type, access_token)
Message service
- Maintains the mapping (msg_id) -> (sender_id, dialog_id, content, status, create_at, updated_at). msg_id is a snowflake id which is roughly sorted by the processing time
- Status is to mark if the message has been delivered to the dialog service. This is required as 2PC/TCC state
- We need 2PC/TCC support between the message domain and dialog domain because messages from the same sender must display in the same relative order across all clients.
- Only after the we deliver to the dialog service that we return to the client the message has been delivered. Again, if we don’t have to maintain a consistent order for messages from the same sender, we can return to client the moment it is persisted to message db, and do the delivery async
TCC vs MQ for tranferring data between message domain and dialog domain
Basically a decision between sync and async communication. In this case TCC is preferred over MQ
- Chat favors low latency over throughput. We can fail early and let user try again
- Reverting operation in MQ is a lot trickier than sync communication
This also means we accept the risk that we will lose th buffer capability of the MQ when there is a major outage across the cluster
Dialog service
Maintains the following mappings
- (dialogue_id, user_ID) -> metadata - members in a dialouge
- (user_id, dialogue_id) -> metadata - dialogues a user is in
- (dialog_id, message_created_at) –> message_id - for the content of dialogue
-
partitioned by the dialogue_id, order key is the PK so that transactions remain on the same node
- Upon persisting message from the message service, dialog service send a notification via notification service to client that new messages are available. Client will then poll from the dialog servic.
- Can also push directly to the client as part of notification instead of notifying client to poll. In production, it is most likely a hybrid to be reactive
- Delivered messaged can be purged from the DB, since we have message service as source of truth
User client service
- Maintains (user_id, client_id) -> (notification node, metadata)
- Acts as router to route the incoming notifiction request to the real server
- 1B client * 1KB each entry = 1T storage
- Simple ID based sharding + 10 nodes is enough to host the metadata
Where do we maintain the checkpoint on the messages user’s poll
Either local client side or user client service side could work. In any case, server side needs to defend against malicious poll requests
Local cilent
- In local storage maintains received message ids and their content so that client can do local dedup and reordering.
- Client does not send out another message until the previous one has been acked by the server side.
- Inputing message is a low throughput activity so such delay is OK.
- If we really want to support concurrent send, then we have to maintain a sequence number for each dialog, which further complicates the design
- Note we can’t rely on client time because of the consist, in order requirement - the processing time in the message service is the source of truth time
What if I want to maintain the same relative order of the message i sent across all clients
- This means we need to introduce a sequence number for each (dialog_id, user_id) which acts as the logical distributed lock, and fetch and then increment it every one i send a message.
- In theory we could still use the message creation time instead of sequence number, but in practice hard to deal with time drift issues
Notification service
- This is the stateless layer that maintains the long-living client connections
- Long connection gateway (LCG) connects mulitple clients and multiple business backend (many-to-many). Between them only 1 long conneciton.
- LCG supports pub/sub model.
- LCG should provide allow callback to business backend to support domain-specific auth
- Open resty layer to keep session and LB.
- Long connection broker for protocal, auth, session, pub/sub management
If we have to deploy it globally to serve the global traffic, which areas need to change
TODO: dive deep on this later