项目作者: ShanLu1984

项目描述 :
Udacity-Intro to Hadoop and MapReduce-Part 1
高级语言: Python
项目地址: git://github.com/ShanLu1984/Hadoop-and-MapReduce.git
创建时间: 2017-10-09T23:59:55Z
项目社区:https://github.com/ShanLu1984/Hadoop-and-MapReduce

开源协议:

下载


Udacity-Course-Projects-Intro-to-Hadoop-and-MapReduce

Introduction

Class website: https://classroom.udacity.com/courses/ud617. Cited from the homepage:
The The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Learn the fundamental principles behind it, and how you can use its power to make sense of your Big Data.

What’s inside:

Lesson 6 Projects:

Part One:

  • Data: purchases.txt, Purchase records of different stores.
  • Code: mapper.py, reducer.py
  • Experiment resesults:

    | Quiz | Reults |
    |—————————-|————————-|
    |Sales per Category | Toys: 57463477.11, Consumer Electronics: 57452374.13 |
    | Highest Sale |Reno: 499.99, Toledo: 499.98, Chandler: 499.98|
    | Total Sales | Number of Sales: 4138476, Total Vale of Sales: 1034457953.26|

    Part Two:

  • Data: access_log, Web server log file from a public relation company whose clients were DVD distributors.
  • Code: mapper.py, reducer.py
  • Experiment results:

    | Quiz | Reults |
    |—————————-|————————-|
    | Hits to Page | /assets/js/the-associates.js: 2456 |
    | Hits from IP | 10.99.99.186: 6 |
    | Most Popular | File path: /assets/css/combined.css, Number of occurrences: 117352|

Installation

Example Use

  1. Download and run the virtual machine (includes datasets). The datasets are also available from:
  1. Upload dataset “access_log” into hadoop HDFS
    1. hadoop fs -put access_log myinput
  2. Make corresponding changes to mapper.py and reducer.py according to the Quiz questions
  3. To test your python code, make a small testfile
    1. head -100 ../data/access_log > testfile
    and using pipline to test
    1. cat testfile | ./mapper.py | sort | ./reducer2.py
  4. Run a mapreduce job,
    1. hs mapper.py reducer.py myinput myoutput
  5. Read results directly from the output file
    1. hadoop fs -cat myoutput/part-00000
  6. If the file is too big to read, save outputfile
    1. hadoop fs -get myoutput/part-00000 mylocalfile.txt
    and search for the answer to the Quiz questions using “grep” commend
    1. grep "/assets/js/the-associates.js" mylocalfile.txt