项目作者: FarhanShoukat
项目描述 :
Parse HTML pages. Create inverted index. Search for pages
高级语言: Python
项目地址: git://github.com/FarhanShoukat/Information-Retrieval.git
Abstract:
In this project, a parser and an inverter was made to parse HTML pages and create inverted index. Four search algorithms (Okapi-TF, Okapi-TFIDF, Okapi-BM25 and Language Model with Jelinek Mercer Smoothing) were also implemented for document retrieval.
How to run:
Files should be run in the following order
parser.py
- python parser.py \
- uses stoplist.txt, files in folder (contains HTML files) provided while execution
- creates docids.txt, termids.txt, doc_index.txt
inverter.py
- python inverter.py
- uses docids.txt, termids.txt, doc_index.txt
- creates term_info.txt, term_index.txt
docLengthCalculator.py
- python docLengthCalculator.py
- uses doc_index.txt
- creates doc_lengths.txt
query.py
- python query.py —score \ —query \
- available score functions: TF, TF-IDF, BM25, JM
- uses docids.txt, termids.txt, stoplist.txt, term_index.txt, doc_lengths.txt
You can get in touch with me on my LinkedIn Profile: Farhan Shoukat
License
MIT
Copyright (c) 2018 Farhan Shoukat