Common problems for hugh data processing

Given 2 files, each with 5B URL, each URL has 64B. 4G memory size

Memory can store 4 * 10^9 / 64 = 60 M records in total.
Each big file has to be loaded about 80 times into memory to read it
So calculation the hash of each URL mod 200, and output each to 200 files
Load two files with same hash into the same memory, one into a hashset. The other will be doing looking up
- This step can be parallelized
Pipe the results into the final file

1G file with words, each word < 16B. Find the most popular 100 words with 1MB memory

Memory fits at least 50k words. File contains at least 100M words
Read the file sequentially and hash the words into 2k bucket files, about 0.5 MB per file. Note each file can’t be too huge to defend against the case of all distinct words
Read each file to generate the word count for each bucket
After it is generated, compare and update with current result in memory
if the distict words can fit into memory, we can also use a trie or suffix array to save space

3B numbers in the file. Unable to fit into the memory

Idea 1: hash into bucket files and dedup for each file. Again, each file size < memory to defend against all distinct case
Idea 2: bitmap will consume 3G ram. Note bloomfilter may still be too huge because each will consum about 10 bits
Similar idea can be used to find if a number exists in the file

10 files with 1G size

If duplicaiton is high, just use Hashmap in memory to store the count, and then sort over the keys
If duplication is low, use bucket hash, and then merge sort

Idea 1: If can fit into memory, use two heaps, one greater than current median, and one less than current median
Invariant: size diff of two heaps < 1
Idea 2: Parition based on prefix. Based on size of each file we know count and precisely which file the median resides and in what order, and then we do a precise mapping