Common problems for hugh data processing
Find common URLs
Given 2 files, each with 5B URL, each URL has 64B. 4G memory size
- Memory can store 4 * 10^9 / 64 = 60 M records in total.
- Each big file has to be loaded about 80 times into memory to read it
- So calculation the hash of each URL mod 200, and output each to 200 files
- Load two files with same hash into the same memory, one into a hashset. The other will be doing looking up
- This step can be parallelized
- Pipe the results into the final file
Top 100 popular words
1G file with words, each word < 16B. Find the most popular 100 words with 1MB memory
- Memory fits at least 50k words. File contains at least 100M words
- Read the file sequentially and hash the words into 2k bucket files, about 0.5 MB per file. Note each file can’t be too huge to defend against the case of all distinct words
- Read each file to generate the word count for each bucket
- After it is generated, compare and update with current result in memory
- if the distict words can fit into memory, we can also use a trie or suffix array to save space
Find all distinct numbers in a file
3B numbers in the file. Unable to fit into the memory
- Idea 1: hash into bucket files and dedup for each file. Again, each file size < memory to defend against all distinct case
- Idea 2: bitmap will consume 3G ram. Note bloomfilter may still be too huge because each will consum about 10 bits
- Similar idea can be used to find if a number exists in the file
Top K in a list of N sorted arrays with same size
- Maintain a size N heap, and index for each array
- Populate first element into the heap, with record of which array it is from
- Get the min of this heap, and get the next item from the same array
- Stop when we have done this K times
Sort queries by frequency
10 files with 1G size
- If duplicaiton is high, just use Hashmap in memory to store the count, and then sort over the keys
- If duplication is low, use bucket hash, and then merge sort
Find median of 5B numbers
- Idea 1: If can fit into memory, use two heaps, one greater than current median, and one less than current median
- Invariant: size diff of two heaps < 1
- Idea 2: Parition based on prefix. Based on size of each file we know count and precisely which file the median resides and in what order, and then we do a precise mapping