项目作者: freniapinto

项目描述 :
Implement the Pagerank Algorithm in Hadoop to retrieve top-100 pages
高级语言: Java
项目地址: git://github.com/freniapinto/Hadoop-Pagerank-Impl.git
创建时间: 2018-05-12T23:18:21Z
项目社区:https://github.com/freniapinto/Hadoop-Pagerank-Impl

开源协议:MIT License

下载


Implementation of Pagerank in a distributed environment (Hadoop)

Preprocessing

The Pre-processing job includes a Map-Reduce (to get all pages including dangling nodes and the adjacency lists) and Map job (initialize all pages with rank as 1/numberOfPages)
The Parser.java file is a standalone program to parse input files and print in human-readable form and create a graph from the wiki dump.
Issues:

  • Special characters in Page names of Wiki pages (handled by converting to Bytes and Latin encoding)
  • Replacing & with &
  • Removed all the duplicates in adjacency list
  • If a link in an adjacency list does not have an adjacency list, made it dangling node

Pagerank calculation

The pagerank operation consists of 10 iterations of Map – Reduce and a final Map job to distribute delta values across all pageranks

Top-100

Each Mapper sends the local top 100 pages with high pagerank values. The number of reducers is set to 1 to compute the global top 100 pages.