SAS>> pai>> 返回
项目作者: petrvecera

项目描述 :
School project for Parallel algorithms. Text similarity. Written in Python with usage of MPI.
高级语言: Python
项目地址: git://github.com/petrvecera/pai.git
创建时间: 2016-01-22T03:14:54Z
项目社区:https://github.com/petrvecera/pai

开源协议:

下载


PA I

School project for subject Parallel algorithms. Text similarity. Written in Python with usage of MPI.

Installation

  1. python3
  2. pip install mpi4py
  3. mpi (mpiexec has to be available)

Usage

  1. Enter the paths for text files into the file
  2. For Single thread Run python mainsp.py filelist.txt
  3. For MPI run: mpiexec -n 5 python mainmpi.py filelist.txt

Overview of algorithm TF-IDF and Cosine Similarity

  1. TF for each document
  2. IDF for all words across all documents
  3. TF*IDF vectors for each document
  4. Cosine similarity of the vectors (documents)

Example output

  1. Input file not specified trying with default:
  2. Usage: file.py filelist.txt
  3. Data load time --- 0.040 seconds ---
  4. Loaded files:
  5. File 1: lotr1.txt
  6. File 2: lotr2.txt
  7. File 3: twk.txt
  8. File 4: Dune.txt
  9. Text 1-1: 1.000
  10. Text 1-2: 0.991
  11. Text 1-3: 0.904
  12. Text 1-4: 0.906
  13. Text 2-1: 0.991
  14. Text 2-2: 1.000
  15. Text 2-3: 0.906
  16. Text 2-4: 0.904
  17. Text 3-1: 0.904
  18. Text 3-2: 0.906
  19. Text 3-3: 1.000
  20. Text 3-4: 0.903
  21. Text 4-1: 0.906
  22. Text 4-2: 0.904
  23. Text 4-3: 0.903
  24. Text 4-4: 1.000
  25. TF compute time --- 0.996 seconds ---
  26. IDF compute time --- 0.066 seconds ---
  27. TF*IDF compute time --- 0.029 seconds ---
  28. Cos Sim compute time --- 0.171 seconds ---
  29. Complete compute time --- 1.313 seconds ---

Sources

https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://en.wikipedia.org/wiki/Cosine_similarity
https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/