DNN Designer

Login |  
Special Topics Presentation
opera fix
Print   Minimize 
opera fix
View Notes

Title: Google PageRank Algorithm

Introduction: Many of us are fascinated with the success of the Google search engine since its inception in 1998. What was the landscape of the World Wide Web back then? What made the Google search engine unique among all of the other search engines at the time? The answer lies in Google’s PageRank algorithm. In 1998, Sergey Brin and Lawrence Page, two Stanford University doctorate students, presented their search engine prototype called Google at the Seventh International World Wide Web conference (WWW98) in Brisbane, Australia. The Google search engine is based primarily on the PageRank algorithm, which is different from the three traditional information retrieval (IR) models: boolean model, vector space model, and probabilistic model. In their own words, Brin and Page defined PageRank as "an objective measure of its citation importance that corresponds well with people’s subjective idea of importance" (Brin & Page, 1998). This paper presents an overview of the unique problems in IR on the web, design goals of the Google search engine, description of the PageRank algorithm, and system architecture of the initial prototype.

Downloads:

Report (165KB)

Presentation (806KB)

opera fix

The Anatomy of a Large-Scale Hypertextual Web Search Engine
opera fix
Print   Minimize 
opera fix

URL: http://infolab.stanford.edu/~backrub/google.html

Summary:

  • Human-maintained indices (e.g. Yahoo) are expensive to build and maintain, subjective, and hard to cover all topics.
  • Design Goals of Google:
    • Quality search results (e.g. only 1 of 4 top commercial search engines returns itself when its name is searched).  The goal is to have very high precision even at the expense of recall because people are only willing to look at the first few ten results.
    • Academic search engine research - the goal is to give people opportunities to do experiment with large-scale web data.  These large-scale data is traditionally considered valuable by commercial search engines.
  • System Features
    • PageRank - an objective measure of a webpage's citation importance that corresponds well with people's subjective idea of importance.  Therefore, it is an excellent way to rank search results.  PageRank is defined as follows:
      • We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

        PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

        Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

      • In the initial experiment, Google is able to compute the PageRank of 26 million web pages in a few hours on a medium size workstation.
    • Anchor Text - the anchor text generally provides a better description of the target page than the page itself.  It can also be used to crawl non-textual documents such as images, multimedia, databases.
    • Other features
      • Location information of all hits so it takes care of proximity.
      • Font size and weight - larger fonts are more important.
      • Full raw HTML of webpages are available in repository.
  • System Anatomy
    • Components: URL Server, Crawler, Store Server, Repository, URL Resolver, Indexer, Barrels, Searcher.
    • Data Structures
      • BigFiles - virtual file spanning multiple file system and are addressable by 64-bit integers.
      • Repository - contains full HTML of all webpages.  Each page is compressed using zlib.
      • Document Index - information includes current document status, a pointer into the respository, document checksum, and other statistics.
      • Lexicon - a list of words concatenated by separated by nulls.
      • Hit List - a list of occurrences of a particular word in a particular document including position, font, and capitalization information.
      • Forward Index
      • Inverted Index - consists of the same barrel as the forward index except that it has been processed by the sorter.
    • Crawling - most challenging task because it involves hundreds of thousands of web servers that are beyond Google's control.
    • Indexing - includes parsing, indexing documents into barrels, and sorting.
    • Searching - the algorithm is as follow:
      1. Parse the query.
      2. Convert words into wordIDs.
      3. Seek to the start of the doclist in the short barrel for every word.
      4. Scan through the doclists until there is a document that matches all the search terms.
      5. Compute the rank of that document for the query.
      6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
      7. If we are not at the end of any doclist go to step 4.
      8. Sort the documents that have matched by rank and return the top k.
opera fix


The PageRank Citation Ranking: Bringing Order to the Web
opera fix
Print   Minimize 
opera fix

URL: http://dbpubs.stanford.edu:8090/pub/1999-66

Summary: The article first differentiates between PageRank and citation analysis.  Citation analysis is geared towards academic publications where they typically go through a long peer-reviewed process.  As compared to webpages where anybody can post any content on the web.  It is also much easily to create new webpages, thus inflating (manipulating) "citation counts".  The following is an intuitive description of PageRank:

A page has a high rank if the sum of the ranks of its backlinks is high.

This description takes care of the situations where a page has alot of backlinks and where a page has a few backlinks that have high PageRanks.  The articles describes the two prototype search engines that were designed.  The first is a title-based search engine.  The second is a full-text search engine called Google.  In the first prototype, a webpage title is used to match a user's query and PageRank is used to rank the search results.  The results were very successful.  The title match ensures high precision; while PageRank ensures high quality.

opera fix
Can Google's PageRank be used to find the most important ...
opera fix
Print   Minimize 
opera fix

URL: http://www.emeraldinsight.com/Insight/viewContentItem.do;jsessionid=0AA6E2DF1D8893F5360DC95A31298912?contentType=Article&hdAction=lnkpdf&contentId=864214

Summary:

  • PageRank is based on the assumption that good quality pages are more likely to be linked to than poor quality ones.
  • The 2 research questions in this study are:
    • Are the pages given the highest rank by PageRank clearly the most useful or highest quality in the system analysed, or can their high positions be the result of unrelated factors?
    • Is PageRank more successful than simple inlink counts at identifying the top pages?
  • Conclusion:
    • There are no significant difference between the success of raw inline counts and PageRank in the rank of the university homepage.
    • There is a number of pages that have high number of inlink counts but low PageRanks because these pages have contain a large number of links which dissipates the effect of each individual link through sharing "rank" between all targets of a page.
    • Certain pages that have high PageRank are not high quality (e.g. disclaimer page, copyright page).
opera fix
Investigating the impact of the blogosphere
opera fix
Print   Minimize 
opera fix

URL: http://eprints.qut.edu.au/archive/00010517/01/10517.pdf

Summary: The study analyzes 8.8 million blogs (about 10% of entire industry) in 2005 and 2006 and investigates their apparent impact on the web itself.  The paper addresses these questions:

  • How is PageRank distributed across the blogsphere?
  • Does it indicate the existence of measurable, visible effects of blogs on the overal mediasphere?
  • Can we compare the distribution of attention to blogs as characterized by the PageRank with the situation for other forms of web content?
  • Has there been a growth in the impact of blogsphere on the web over the two years?

The author hypothesizes that many blogs may have high PageRanks because of their search engine-friendly URLs, which means that most of the blogs are indexed by Google.  The large number of interlinkings make the impression that the content is of high quality.

  • Definition of Impact - the chance of a blog to be read and cited by many people, and therefore as the possibility to influence their cognitive reception and behavior.
  • Results of the Study: PageRank 0 remained steady; PageRank 1-2, and 7-10 diminished substantially; PageRank 3-6 grew slightly.
opera fix

Copyright 2008 by WillWork.Org
Terms Of Use | Privacy Statement