项目作者: laidasani

项目描述 :
Implemented the SON Algorithm using the Apache Spark Framework to find frequent itemsets. Used the A-Priori Algorithm to process each chunk of the data.
高级语言: Python
项目地址: git://github.com/laidasani/Finding-Frequent-Itemset.git
创建时间: 2020-06-18T05:58:41Z
项目社区:https://github.com/laidasani/Finding-Frequent-Itemset

开源协议:

下载


Finding-Frequent-Itemset

Overview

In this project, I implemented SON Algorithm using the Apache Spark Framework to find frequent item sets.

One of the major tasks is to find all the possible combinations of the frequent itemsets in a given input file using A-Priori algorithms. The project involves working of SON algorithm on two different datasets, one simulated dataset and one real-world generated dataset.

Apart from input file, 2 separate inputs are provided:
Filter threshold: Integer that is used to filter out qualified users
Support: Integer that defines the minimum count to qualify as a frequent itemset

The steps for finding frequent itemset includes:

1) Finding the candidates of frequent itemset (as singletons, pairs, triples, etc.) that maybe qualified as frequent given a support threshold (that maps to a frequent bucket).

2) Calculating the combinations of frequent itemset (as singletons, pairs, triples, etc.) that are actually frequent given a support threshold.

The code is optimized to run efficiently under 500 seconds for support 50 and filter threshold 20. The printed itemsets are sorted in lexicographical order.