项目作者: ebtelmarz

项目描述 :
MapReduce implementation of LSH Ensemble
高级语言: Python
项目地址: git://github.com/ebtelmarz/big_data_lsh_ensemble.git
创建时间: 2020-06-18T15:50:40Z
项目社区:https://github.com/ebtelmarz/big_data_lsh_ensemble

开源协议:

下载


LSH Ensemble

This is an assignment for the Big Data course in Roma Tre University.

This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.

Requirements

To run this project you need:

  • Python 3.6.9
  • Hadoop 3.2.1
  • Spark 3.0.0
  • pip3 intstalled in your machine. To install pip3 run the following commands in a shell
    1. sudo apt update
    2. sudo apt install python3-pip

Usage

To run the project locally

Start Hadoop, open a shell and run

  1. $HADOOP_HOME/sbin/start-dfs.sh

Download this repo or clone it by running

  1. git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git

Move inside the downloaded directory

  1. cd big_data_lsh_ensemble/

Execute the run.sh script by running in a shell

  1. sh run.sh

To run the project on cluster

Create a virtual environment

  1. python3 -m venv my_env
  2. source .my_env/bin/activate

Execute the run.sh script by running

  1. sh run.sh