System design: search engine

Spider crawls web to create build data (BD)
Estimate 1 T new page generated per day, e.g., 10 pages for everyone in the owlrd
- Each 100K is 100 PB per day
Need to store:
URL - which can be in a trie to reduce the char space
last_visted_at
hash
Links to crawl, separately from the crawled links
May Separate day db, hour data, month db, change only the hour db, and divide & conquer the search result
Seed URLs can be populated by hand
DNS caching needs to be tuned or enchanced so since it discovers a lot new URLs

assume all URLs are of lenght up to l, evenly distributed chars. So the total space is 26 + 26^2 + … = O(l * 26^l)
In a trie, we have 26 chars at each level, so it is 26^l => so we saved a factor of l
The intuition being that no matter which way we estimate, we need space to keep track only the unique marker

Indexer reads/receives URL from the crawler, and build inverted index by visiting the URL
Child links from that page is needed for page rank
Choose to snapshot the page visited, including the link and snippets. This will power the document service when the search indexer returns the URLs
- may build the document a forward index of key words and prefixes
- for quick filtering, can use a bloomfilter for the key words and prefixes

Goes through search indexer to limit the documents to search
Ranker will score and sort results and return the first page result
Skiplist to the most common appraoch for ordered LL union problem. Log(n) perf.
Cache popular search results. Reading 1MB sequentially is 250 microsec from memory, 1ms on SSD