TAO: Facebook’s Distributed Data Store for the Social Graph
Workload
- 99.8% of the transactions are reads and only the 0.2% are writes
- Creation time locality dominates the workload — a data item is likely to be accessed if it has been recently created.
- the content on a page is highly customizable depending on users privacy settings and it is personalized for every user. This means that the data needs to be stored as-is and then filtered when it is being viewed/rendered.
- A relatively large percentage of requests are for relations that do not exist — e.g., “Does this user like that story?” is false for most of the stories
Requirements
- High probability that users always see their own updates. 99.99% of the results were the same as the ones that would have been served under a strict consistency model
- It is tolerable to have some inconsistencies in the content that is presented to the user while it is not tolerable to have high latency or unavailability.
Sample queries to support
- “Does this user like that story?”
- “What are the 50 most recent comments on this piece of content?”
- “Give me the most recent 10 comments about a check-in by Alice”
- “How many likes did the comment from Cathy have?”
Pain points
- Storing the list of edges as a single value
- Every read loads all edges, even if the final result contained only a few edges or was empty.
- Small incremental updates to the list invalidates the whole list in the cache
- A native list type solves the edge-list lookup, but makes updating such a list tricky
- MySql assumes data is accessed with spatial locality. This assumption is not true with Facebook’s workload
- Creation time locality instead, i.e., loading data into the Mysql block cache, but most of data is not really needed and in effect just occupies the block cache with no hits
- Hard to mitigate thundering herds with leases to Memcache clients
- Read and write on the same popular objects causing misses in cache and then going to database
- Hard to coordinate between clients since they don’t talk to each other.
- Changing table schemas as products evolved required coordination between engineers and MySQL cluster operators.
Data model
“Alice checks in at the Golden Gate Bridge and tags Bob there, while Cathy comments on the check-in and David likes it”
Data items as nodes/objects
- e.g., user, check-in, comment, and location are all nodes. Actions that can happen multiple times/be repeatable are modeled as nodes instead of edges
- A typed object containing a dictionary of named fields, e.g., text of the comment will be field in the comment node.
- Each node has an 64-bit id
- New fields can be registered for an object type at any time and existing fields can be marked deprecated by editing that type’s schema. In most cases product engineers can change the schemas of their types without any operational work.
- Each object_id contains a shard_id in it, reflecting the logical location of that object.
- All the data belonging to objects is serialized and stored against the id.
Relationships between nodes as edges/associations
- Models relationships happen at most once or state transition, e.g., “liked by”, “friend of”, and “accepts invitation”
- grouped in association lists by their origin, ordered by time
- Multiple associations may connect the same pair of objects as long as the types of all those associations are distinct.
- Together objects and associations form a labeled directed multi-graph.
- For every association type a so-called inverse type can be specified, e.g., “likes” and “liked by”. The inverse and the original edge will be maintained by TAO when an edge is created/deleted,
- The inverse association type is often not same as the original type. For example, Golden Gate location object is connected to check-in object using check-in association type. While the check-in object connects to the golden gate location object using location association type. As a comparison, “friend” has a symmetric inverse type.
- Every association to have a special time attribute that is commonly used to represent the creation time of association. The creation-time locality depends on this timestamp.
- Association are stored similarly with id as the key and data being serialized and stored in one column. The table has index on (originating id(Object1), time based sorting, type of association)
API
- Standard CRUD APIs for nodes and edges respectively
- A point query on (node_id_1, edge_type, node_id_2)
- Range queries find outgoing associations given an (id1, type) pair.
- Count queries give the total number of outgoing associations for an (id1, type) pair. TAO optionally keeps track of counts as association lists grow and shrink, and can report them in constant time
Translate user query to TAO modes
- Are two objects are connected by an association?
- A point query on (node_id_1, edge_type, node_id_2). This point query is also used to fetch data for an an edge
- “What are the 50 most recent comments on this piece of content?”
- Range queries find outgoing associations given an (id1, type) pair.
- “Give me the most recent 10 comments about a check-in by Alice”
- This can be modeled as assoc_range(CHECK_IN_ID, COMMENT, 0, 10).
- “How many likes did the comment from Cathy have?”
- assoc_count(COMMENT_ID, LIKED_BY)
Out of scope
- NO operations for complex traversals or pattern matching on the graph. Executing such queries while responding to a user request is almost always a suboptimal design decision.
- TAO does not offer a server-side set intersection primitive. Instead we provide a client library function. The lack of clustering in the data set virtually guarantees that having the client orchestrate the intersection through a sequence of simple point and range queries on associations will require about the same amount of network bandwidth and processing power as doing such intersections entirely on the server side.
Cluster topology
- Nodes and edges are stored in separate clusters, since they have different workload patterns.
- All objects and associations in the same shard are stored persistently in the same MySQL database, and are cached on the same set of servers in each caching cluster. Individual objects and associations can optionally be assigned to specific shards at creation time.
- A single primary region per shard and rely on MySQL replication to propagate updates from the region where the shard is primary to all other regions (secondary regions). A secondary region cannot update the shard in its regional persistent store.
- There are far more shards in the system than the number of hosts that host the mysql servers. So many shards are mapped onto a single host.
Failover
- A slave database failure can be addressed by going to the leader in the master region.
- The primary region can be switched to another region at any time. This is an automated procedure.
Removing hot spots
- the hot spots can be alleviated by consistent hashing which makes addition of tiers easier without re-balancing caches a lot.
- Shards can be migrated or cloned among servers in the same cluster to equalize the load and to smooth out load spikes
- A shard can be hosted by a slave region in Asia that has a replica DB, followers and leaders.
Sequence of actions
Assuming the client is in Asia
SoA: Read
- Client talks to the follower TAO cluster in the Asia slave region
- If the cache missed, follower fetches data from the leader cluster
- Leader cluster fetches data from the MySql cluster in the region
SoA: Write
- Clients talks to the slave leader cluster in the Asia slave region.
- Slave leader forwards to request to the leader in the master region synchronously
- Master leader forwards to the master DB synchronously
- Master DB replicates to the slave DB
- Master leader updates the master cache
- Master leader forwards to the master DB synchronously
- Slave leader updates the slave cache
- Slave leader sends cache maintenance/invalidation message to followers asynchronously. This message lets each follower know changed done by other followers.
- Slave leader forwards to request to the leader in the master region synchronously
- TAO maintains the write through cache, i.e., new data in cache won’t be visible until the write to storage is done. The write-through design of cache simplifies maintaining read-after-write consistency for writes that are made in a secondary region for the affected shard.
- TAO clients may override the default policy at the expense of higher processing cost and potential loss of availability.