In C#. Master-Worker. From scratch. No Hadoop. Done.
dotnet run -p src/MapReduce.Sample
src/MapReduce.Sample/Program.cs
.src/MapReduce.Sample/Playbook/WorkerHelper.cs
.Map
and Reduce
functions:src/MapReduce.Sample/Playbook/WordCount.cs
.src/MapReduce.Sample/Playbook/InvertedIndex.cs
.map()
:
part of object -> list<(key, value)>
return list<(key, value)>
combine()
:
hash<key, list<value>>
foreach ((key,value) in list<(key, value)>)
{
hash<key, list<value>>[key].Add(value)
}
return hash<key, list<value>>
partition()
:
hash<partitionIndex, hash<key, list<value>>>
reduce()
:
hash<key, valueAggregated>
foreach ((key,values) in hash<key, list<value>>)
{
foreach (value in values)
{
hash<key, valueAggregated>[key] += value
}
}
// foreach (key,value) in other list<(key, value)>
// omitted
return hash<key, valueAggregated>
i
th reducer take every i
th partition in each mapper’s local disk.class master
List<MapTask>
List<ReduceTask>
List<Worker>
enum state { idle, in-progress, completed }
class MapTask { state, CompletedFile, ... }
class ReduceTask { state, CompletedFile, ... }
class CompletedFile { location, size }
key.GetHashCode() % numPartitions
.numPartitions
:= number of reduce tasks.i
th partition of outputs of all mappers.Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you’ll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task’s input from one or more files, execute the task, and write the task’s output to one or more files. The master should notice if a worker hasn’t completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.