项目作者: thongnguyen2410

项目描述 :
Hadoop Map Reduce project
高级语言: Java
项目地址: git://github.com/thongnguyen2410/BD_MapRed_Project.git
创建时间: 2019-10-14T19:30:53Z
项目社区:https://github.com/thongnguyen2410/BD_MapRed_Project

开源协议:

下载


Big data - Map Reduce project

Part 1 (a) (b)

Set up a single node cluster and optionally an eclipse development environment to create and test your programs.

This project uses Google cloud VM for setting up a Cloudera quick start docker container for Hadoop Map Reduce development environment. @alipazaga07/big-data-as-a-service-get-easily-running-a-cloudera-quickstart-image-with-dockers-in-gcp-34d28aa7dad7">This shows how the setting up in details.

Below is how to connect to the Cloudera quick start container and check Hadoop services status:

  1. thongnguyen2410@small:~$ sudo docker ps
  2. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
  3. 194453c4f758 90568ffbcb7c "/usr/bin/docker-qui…" 2 days ago Up 2 days 0.0.0.0:90->80/tcp, 0.0.0.0:7190->7180/tcp, 0.0.0.0:8777->8888/tcp trusting_gagarin
  4. thongnguyen2410@small:~$ sudo docker exec -it 194453c4f758 bash
  5. [root@quickstart /]# service --status-all
  6. ...
  7. Hadoop datanode is running [ OK ]
  8. Hadoop journalnode is running [ OK ]
  9. Hadoop namenode is running [ OK ]
  10. Hadoop secondarynamenode is running [ OK ]
  11. Hadoop httpfs is running [ OK ]
  12. Hadoop historyserver is running [ OK ]
  13. Hadoop nodemanager is running [ OK ]
  14. ...

Part 1 (c)

Get WordCount (test run)

How to build and run

The script run.sh will:

  • run javac to build .java files. This use the -cp to refer to Hadoop libs(.jar files) in e.g. /usr/lib/hadoop
  • run jar to package .class files to .jar file
  • run hadoop fs -copyFromLocal to copy test files ininput/* to HDFS
  • run hadoop jar to execute .jar file in pseudo distributed mode (or run java -jar in local mode)
  • run hadoop fs -cat to display output

Usage of run.sh:

This script is default to use Hadoop libs at /usr/lib/hadoop. If in your environment, Hadoop libs is at different location, then run the below before this script:

  1. HADOOP_LIB_DIR=</path/to/lib/hadoop>

To run in pseudo distributed mode:

  1. Usage : ./run.sh <package_dir> <class_name> [numReduceTasks]
  2. Example: ./run.sh part1/c WordCount

Or to run in local mode:

  1. Usage : ./run.sh <package_dir> <class_name> local
  2. Example: ./run.sh part1/c WordCount local

WordCount ouput

  1. [cloudera@quickstart BD_MapRed_Project]$ ./run.sh part1/c WordCount 4
  2. ...
  3. ==================================================
  4. hadoop fs -cat /user/cloudera/input/*
  5. ==================================================
  6. one six three
  7. two three five
  8. two six four six five
  9. three six four
  10. four five five six
  11. four five six
  12. ==================================================
  13. hadoop fs -cat /user/cloudera/output/*
  14. ==================================================
  15. ==>/user/cloudera/output/_SUCCESS<==
  16. ==>/user/cloudera/output/part-r-00000<==
  17. ==>/user/cloudera/output/part-r-00001<==
  18. one 1
  19. six 6
  20. three 3
  21. ==>/user/cloudera/output/part-r-00002<==
  22. ==>/user/cloudera/output/part-r-00003<==
  23. five 5
  24. four 4
  25. two 2

Part 1 (d)

Modify WordCount to InMapperWordCount and test run

  1. ./run.sh part1/d InMapperWordCount

Part 1 (e)

Average Computation Algorithm for Apache access log

  1. ./run.sh part1/e ApacheLogAvg

Part 1 (f)

In-mapper combining version of Average Computation Algorithm for Apache access log

  1. ./run.sh part1/f InMapperApacheLogAvg

Part 2

Pairs algorithm to compute relative frequencies

  1. ./run.sh part2 RelativeFreqPair 2

Part 3

Stripes algorithm to compute relative frequencies

  1. ./run.sh part3 RelativeFreqStripe 2

Part 4

Pairs in Mapper and Stripes in Reducer to compute relative frequencies

  1. ./run.sh part4 RelativeFreqPairStripe 2

Part 5

Solve a MapReduce problem of your choice!

The problem of facebook common friends finding is described here.

Assume the friends are stored as Person->[List of Friends], our friends list is then:

A -> B C D

B -> A C D E

C -> A B D E

D -> A B C E

E -> B C D

The result after reduction is:

(A B) -> (C D)

(A C) -> (B D)

(A D) -> (B C)

(B C) -> (A D E)

(B D) -> (A C E)

(B E) -> (C D)

(C D) -> (A B E)

(C E) -> (B D)

(D E) -> (B C)

  1. ./run.sh part5 FriendFinding 2
  1. ==================================================
  2. hadoop fs -cat /user/cloudera/input/*
  3. ==================================================
  4. A B C D
  5. B A C D E
  6. C A B D E
  7. D A B C E
  8. E B C D
  9. ==================================================
  10. hadoop fs -cat /user/cloudera/output/*
  11. ==================================================
  12. ==>/user/cloudera/output/_SUCCESS<==
  13. ==>/user/cloudera/output/part-r-00000<==
  14. (A, B) [ D, C ]
  15. (A, D) [ B, C ]
  16. (B, C) [ D, E, A ]
  17. (B, E) [ D, C ]
  18. (C, D) [ E, A, B ]
  19. (D, E) [ B, C ]
  20. ==>/user/cloudera/output/part-r-00001<==
  21. (A, C) [ D, B ]
  22. (B, D) [ E, A, C ]
  23. (C, E) [ D, B ]