项目作者: davideanastasia

项目描述 :
Getting Started with Apache Beam: inverted index
高级语言: Java
项目地址: git://github.com/davideanastasia/apache-beam-getting-started.git


Getting Started with Apache Beam

This is 3-2-1-go project on how to get started with Apache Beam.

Inverted Index

More on this on Medium: @davide.anastasia/getting-started-with-apache-beam-26bfc5126438"">https://medium.com/@davide.anastasia/getting-started-with-apache-beam-26bfc5126438

The idea behind this simple batch job is to create an inverted index: given a set of documents in text format, the job will parse and build a word -> location mapping for each of the words.
The job is an interesting toy, as it shows how:

  • read data + file name (slightly different than using TextIO)
  • filter out common stop words (in a very naive way, but more interesting ways can be found!)
  • create a CombineFn in order to avoid streaming all the data for a single key to a single node

References