项目作者: haodemon

项目描述 :
Set of Input Formats for Hadoop Streaming
高级语言: Java
项目地址: git://github.com/haodemon/HadoopStreaming.git
创建时间: 2019-01-24T15:49:09Z
项目社区:https://github.com/haodemon/HadoopStreaming

开源协议:MIT License

下载


HadoopStreaming

Set of Input Formats for Hadoop Streaming.

These classes specifically designed to allow you to pass a lot of small files into hadoop streaming.

One single map task will be able to process input size up to the amount in kb defined by the formula:

  1. max(
  2. mapreduce.input.fileinputformat.split.minsize,
  3. min(mapreduce.input.fileinputformat.split.maxsize, dfs.blocksize)
  4. )

Reading Avro:

  1. $ hadoop jar -libjars streaming-1.0.jar \
  2. -inputformat com.haodemon.streaming.avro.CombinedAvroInputFormat \
  3. -input <input> \
  4. -output <output> \
  5. -mapper <mapper> \
  6. -reducer <reducer>

Reading Sequence:

  1. $ hadoop jar -libjars streaming-1.0.jar \
  2. -inputformat com.haodemon.streaming.sequence.CombinedSequenceInputFormat \
  3. -input <input> \
  4. -output <output> \
  5. -mapper <mapper> \
  6. -reducer <reducer>

To build the package:

  1. $ mvn package