项目作者: vancanhuit

项目描述 :
Simple boolean retrieval model implementation with Python 3
高级语言: Python
项目地址: git://github.com/vancanhuit/boolean-retrieval-engine.git
创建时间: 2018-03-11T02:54:40Z
项目社区:https://github.com/vancanhuit/boolean-retrieval-engine

开源协议:

下载


Simple boolean retrieval implementation with Python 3

Prepare

  • Install Python 3.5+
  • Install NLTK 3
  • Open terminal / command prompt and enter following command:
    1. $ python
    2. >>> import nltk
    3. >>> nltk.download('stopwords')

Usage

To index data, run index.py script and pass document’s directory and directory for storing indexed data:

  1. $ python index.py --help
  2. usage: index.py [-h] docs_path data_path
  3. Index data for boolean retrieval
  4. positional arguments:
  5. docs_path Directory for documents to be indexed
  6. data_path Directory for storing indexed data
  7. optional arguments:
  8. -h, --help show this help message and exit
  9. $ python index.py ./docs ./data

After indexing data successfully, run query.py script to perform query:

  1. $ python query.py --help
  2. usage: query.py [-h] query
  3. Boolean query
  4. positional arguments:
  5. query words seperated by space
  6. optional arguments:
  7. -h, --help show this help message and exit
  8. $ python query.py "popular available"
  9. {'D:\\workspace\\boolean-retrieval-engine\\docs\\A Festival of Books.txt'}

When provide input for the query script, words must be seperated by space. For example, with input "popular available", it’s mean that find all documents which contain popular AND available. The returned result will be a set of documents satisfy the query. All numeric, punctuation and word which is not in dictionary will be ignored.