 |

 |
| |
The Anatomy of a Large-Scale Hypertextual Web Search Engine
|
|
|
|  |  |
|
 |
|
|
|
URL: http://infolab.stanford.edu/~backrub/google.html
Summary:
- Human-maintained indices (e.g. Yahoo) are expensive to build and maintain, subjective, and hard to cover all topics.
- Design Goals of Google:
- Quality search results (e.g. only 1 of 4 top commercial search engines returns itself when its name is searched). The goal is to have very high precision even at the expense of recall because people are only willing to look at the first few ten results.
- Academic search engine research - the goal is to give people opportunities to do experiment with large-scale web data. These large-scale data is traditionally considered valuable by commercial search engines.
- System Features
- PageRank - an objective measure of a webpage's citation importance that corresponds well with people's subjective idea of importance. Therefore, it is an excellent way to rank search results. PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
- In the initial experiment, Google is able to compute the PageRank of 26 million web pages in a few hours on a medium size workstation.
- Anchor Text - the anchor text generally provides a better description of the target page than the page itself. It can also be used to crawl non-textual documents such as images, multimedia, databases.
- Other features
- Location information of all hits so it takes care of proximity.
- Font size and weight - larger fonts are more important.
- Full raw HTML of webpages are available in repository.
- System Anatomy
- Components: URL Server, Crawler, Store Server, Repository, URL Resolver, Indexer, Barrels, Searcher.
- Data Structures
- BigFiles - virtual file spanning multiple file system and are addressable by 64-bit integers.
- Repository - contains full HTML of all webpages. Each page is compressed using zlib.
- Document Index - information includes current document status, a pointer into the respository, document checksum, and other statistics.
- Lexicon - a list of words concatenated by separated by nulls.
- Hit List - a list of occurrences of a particular word in a particular document including position, font, and capitalization information.
- Forward Index
- Inverted Index - consists of the same barrel as the forward index except that it has been processed by the sorter.
- Crawling - most challenging task because it involves hundreds of thousands of web servers that are beyond Google's control.
- Indexing - includes parsing, indexing documents into barrels, and sorting.
- Searching - the algorithm is as follow:
- Parse the query.
- Convert words into wordIDs.
- Seek to the start of the doclist in the short barrel for every word.
- Scan through the doclists until there is a document that matches all the search terms.
- Compute the rank of that document for the query.
- If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
- If we are not at the end of any doclist go to step 4.
- Sort the documents that have matched by rank and return the top k.
|
 |
|
|
|

|
|
 |

|
 |

|