项目作者: FarhanShoukat

项目描述 :
Parse HTML pages. Create inverted index. Search for pages
高级语言: Python
项目地址: git://github.com/FarhanShoukat/Information-Retrieval.git
创建时间: 2018-10-12T16:40:18Z
项目社区:https://github.com/FarhanShoukat/Information-Retrieval

开源协议:MIT License

下载


Information-Retrieval

Abstract:

In this project, a parser and an inverter was made to parse HTML pages and create inverted index. Four search algorithms (Okapi-TF, Okapi-TFIDF, Okapi-BM25 and Language Model with Jelinek Mercer Smoothing) were also implemented for document retrieval.

How to run:

Files should be run in the following order

parser.py

  • python parser.py \
  • uses stoplist.txt, files in folder (contains HTML files) provided while execution
  • creates docids.txt, termids.txt, doc_index.txt

inverter.py

  • python inverter.py
  • uses docids.txt, termids.txt, doc_index.txt
  • creates term_info.txt, term_index.txt

docLengthCalculator.py

  • python docLengthCalculator.py
  • uses doc_index.txt
  • creates doc_lengths.txt

query.py

  • python query.py —score \ —query \
  • available score functions: TF, TF-IDF, BM25, JM
  • uses docids.txt, termids.txt, stoplist.txt, term_index.txt, doc_lengths.txt

Contact

You can get in touch with me on my LinkedIn Profile: Farhan Shoukat

License

MIT
Copyright (c) 2018 Farhan Shoukat