项目作者: guillaume6pl
项目描述 :
Computing pagerank with Hadoop MapReduce
高级语言: Python
项目地址: git://github.com/guillaume6pl/mr_pagerank.git
Page Rank
Model
White Paper from Brin & Page
We describe the web according to the model defined by Brin & Page.
- Web = oriented graph with n nodes (pages) and branches (links)
- Web surfer = goes from nodes to nodes through links or teleportation (dumping factor)
When the web surfer is on a node “i”
* [proba (1-d)] goes randomly on linked nodes (j_1,... j_k) through n_i outcoming links
* [proba d] goes randomly among the n nodes
Algorithm
Variables
- n : nombre de pages du web
- d : probabilité de téléportation (dumping factor) = 0.15
- n_i : nombre de liens sortants de la page i
- tol : tolerance for convergence
- (future version) memory_size : minimum node’s memory size in the cluster
Description
Process
- (not done yet) Launch mapper0 to distribute values per range such as they fit in memory of each node in the cluster
- with test dataset, n < 10 000 so it’s not required
- Launch mapper/reducer through hadoop streaming for each step until convergence
Architecture
MAPPER
INPUT DATA: file “inv_adj_list” on stdin with each line such as
* <node "i">: <j1, j2... -1>
* <key>: node "i"
* <value>: "incoming links" j1,...
- getting variables
- building P(s-1) vector from last line in “ps.txt”
- building outcoming_links_number from adj_list
outcoming_links_number: # of outcoming links “nj” for each node j - building Tmat (T matrix) from input data
- computing (Tmat*P_previous_step)
- printing output data on stdout
OUTPUT DATA: on stdout with each line such as
- \t
- : node “i” | : Ti,j x Pj(s-1)
REDUCER
INPUT DATA: Mapper OUTPUT DATA
- setting variables
- building P(s-1) vector from last line in “ps.txt”
- getting input value on stdin as \t
- computing “TMat_Pvect_vector” by summing “TMat_Pvect_product” for each node “i”
- computing P(s) as P(s) = (1-d) x T x P(s-1) + d/n x U
- sending output data
OUTPUT DATA:
- if P didn’t converge: Ps added in “/output/pagerank/ps.txt” as a new line
- if P converged:
- ordered pagerank list in “/output/pagerank/pagerank.txt”
- top 20 highest pageranks on stdout
Dataset
Instructions
Manually
- Download dataset of your choice and put the 3 required files on hdfs @ /input/pagerank/
- Start hadoop
$ start-dfs.sh
- Set variables
$ cd <path_to_scripts>
$ HADOOP_HOME = <path/to/hadoop>
$ HDFS_INPUT_DIR = <path/to/hdfs/input/dir>
$ HDFS_OUTPUT_DIR = <path/to/hdfs/output/dir>
$ L_INPUT_DIR = <path/to/local/input/dir>
$ L_OUTPUT_DIR = <path/to/local/output/dir>
- Execute mapper/reducer until convergence (each execution is one step)
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-input $HDFS_INPUT_DIR/inv_adj_list \
-output $HDFS_OUTPUT_DIR \
-mapper "$SCRIPT_PATH/pagerank_mapper.py $L_INPUT_DIR $L_OUTPUT_DIR"\
-reducer "$SCRIPT_PATH/pagerank_reducer.py $L_INPUT_DIR $L_OUTPUT_DIR <dumping_factor> <tolerance>"
With pagerank_launcher.py
“pagerank_launcher.sh” is a bash script that allows you to:
- set initial parameters
- execute automatically mapper/reducer until convergence
$ ./pagerank_launcher.sh
Output data
- stdout du reducer sur hdfs: $HDFS_OUTPUT_DIR/part-00000 :
- not converged yet: nothing displayed
- @ convergence: “P(s) converged!” will be printed
- @ convergence: display top 20 highest pageranks
- $L_OUTPUT_DIR/ps.txt:
- display on each line P(s) from s=1 (line 1) until last computed step
- file is cleared when P converged
- $L_OUTPUT_DIR/pagerank.txt:
- after convergence, display ordered list of pagerank for all nodes
- each line as: ,,()\tPagerank[node]
Any feedback welcomed! Thanks for reading.