项目作者: charulatalodha

项目描述 :
Data Compression- Golomb Codes, Index Creation, Document Search Tool using Vector Space Model, Bm25, Max Score Heuristic Algorithms
高级语言: Java
项目地址: git://github.com/charulatalodha/InformationRetrieval.git
创建时间: 2019-10-20T15:46:13Z
项目社区:https://github.com/charulatalodha/InformationRetrieval

开源协议:

下载


InformationRetrieval-

1. Document Search Tool using Vector Space Model (VSM scores)

  1. 1. This project builds an Inverted Index capable of performing boolean queries for information retrieval.
  2. 2. The code ranks result according to the vector space model.
  3. 3. It computes VSM scores use all the terms from the query. i.e., for the below these would be : good, dog, bad, cat.
  4. An example of running the program might be:
  5. $java PositiveRank my_corpus.txt 5 "_OR _AND good dog _AND bad cat"
  6. Files used: PositiveRank.java , InvertedIndexer.java
  7. The output should is a line with DocId Score on it, followed by a sequence of num_result lines with this information for the top num_results many documents.
  8. For example,
  9. DocId Score
  10. 7 .65
  11. 2 .51
  12. 3 .23
  13. 11 .0012

2. Document Search Tool /retrieval using BM25 Algorithm along with Max Score Heuristic

  1. This program ranks the results using Bm25 Algorithm and Max Score Huristic.
  2. An example of running the program might be:
  3. $java Bm25MaxScore my_corpus.txt 5 "good php python cat"
  4. Files used: Bm25MaxScore.java , InvertedIndexer.java , MinHeap.java , Node.java

3. Golomb Rice Coding implementation for Index File

  1. Golomb coding is a lossless data compression method highly suitable for situations in which the occurrence of small values in the input stream is significantly more likely than large values. [1]
  2. Since, Index files are quite large in size, this implementation significantly reduces the size of Index file.
  3. References :
  4. 1.https://en.wikipedia.org/wiki/Golomb_coding