duplicate-detection

Algorithms for duplicate document and question detection/classification, implemented as part of a project. The file de_duplication.py has three functions that calculate the exact cosine, jaccard and MinHash LSH jaccard distance respectively. For exact cosine distance SpaCy is used, and datasketch for MinHash LSH. Jaccard is implemented by scratch, probably in a less than optimal fashion. The functions cosine_naive and jaccard_naive are basically brute force and will take a very long time to run, but that’s on purpose. The main objective of this file is to compare the runtimes of brute force and faster (but less accurate) alternatives like MinHash LSH. For this purpose, three dictionaries with statistics are created and printed.

Dataset

The dataset used is huge to showcase how well LSH scales to big data. More specifically, there are 531990 questions in the training set, and the objective is to see if any of the 5374 questions in the test set are duplicates. The dataset was provided as part of a Big Data Mining course in Department of Informatics and Telecommunications, University of Athens so I’m not sure if I have permission to upload it just yet. However, by simply changing the name of the files the load_data function loads, the script can be run for any data that consists of two sets of documents.

Possible optimizations

Instead of running the algorithms directly on sets of strings, a vectorizer like the ones from sklearn (TfIdf and Count) can be used to possibly speed up computations. However, the dataset is so massive that it can’t be processed efficiently without running out of memory in my machine (8GB RAM). The brute force approaches could be further improved as well, because the cosine_naive function takes a huge time to run, indicating a suboptimal approach.