项目作者: Banyc

项目描述 :
In C#. Master-Worker. From scratch. No Hadoop. Done.
高级语言: C#
项目地址: git://github.com/Banyc/MapReduce.git
创建时间: 2021-03-12T17:16:52Z
项目社区:https://github.com/Banyc/MapReduce

开源协议:

下载


MapReduce

  • Distributed systems.
  • Object-oriented programming.
  • Educational only.

How to run

  1. dotnet run -p src/MapReduce.Sample

How to use

Principle

map():

  1. part of object -> list<(key, value)>
  2. return list<(key, value)>

combine():

  1. hash<key, list<value>>
  2. foreach ((key,value) in list<(key, value)>)
  3. {
  4. hash<key, list<value>>[key].Add(value)
  5. }
  6. return hash<key, list<value>>

partition():

  1. hash<partitionIndex, hash<key, list<value>>>

reduce():

  1. hash<key, valueAggregated>
  2. foreach ((key,values) in hash<key, list<value>>)
  3. {
  4. foreach (value in values)
  5. {
  6. hash<key, valueAggregated>[key] += value
  7. }
  8. }
  9. // foreach (key,value) in other list<(key, value)>
  10. // omitted
  11. return hash<key, valueAggregated>

  • each intermediate file is a partition.
  • ith reducer take every ith partition in each mapper’s local disk.

Master Data Structure

  • class master
    • List<MapTask>
    • List<ReduceTask>
    • List<Worker>
  • relative data structures
    • enum state { idle, in-progress, completed }
      • idle:
        • task waiting to be scheduled.
        • the task is not done yet.
    • class MapTask { state, CompletedFile, ... }
    • class ReduceTask { state, CompletedFile, ... }
    • class CompletedFile { location, size }

Failure

  • worker failure
    • master pings worker.
      • no response in amount of time -> worker failed.
  • master failure
    • exception on user code.
    • master writes data structures in checkpoints periodically.
    • master gives the same task to a different worker.

Use Cases

  • When map worker completes a map task
    1. worker —-{file names}—> master.
    2. master saves file names to data structure.
  • When reduce worker completes a reduce task
    1. rename temp output file to final output file.
  • Task processing
    • worker
      1. The workers talk to the master via RPC.
      2. worker ask the master for a task
      3. worker read the task’s input from one or more files,
      4. worker executes the task,
      5. worker writes the task’s output to one or more files.

Partitioning

Assignment

Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you’ll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task’s input from one or more files, execute the task, and write the task’s output to one or more files. The master should notice if a worker hasn’t completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.