项目作者: aligusnet

项目描述 :
Information Retrieval
高级语言: C#
项目地址: git://github.com/aligusnet/InformationRetrieval.git
创建时间: 2019-04-30T20:36:56Z
项目社区:https://github.com/aligusnet/InformationRetrieval

开源协议:

下载


Information Retrieval

Build Status

Definitions

List of definitions of key types and concepts.

Corpus

The project defines the way how the collection of documents is organized in the corpus and blocks.

  • Corpus - collection of text documents organized into blocks;
  • Block - subset of corpus, small enough to fit processing in memory;
  • Document - piece of text with metadata, the most important metadata is DocumentId;
  • DocumentId - unique identifier of the document.

InformationRetrieval

The projects defines a number of types to process text documents organized in corpus.

  • Tranformer - converts a corpus of documents, preserving the structure of the corpus, but changing the presentation: texts parsing/cleaning/tokenization etc.
  • Indexer - builds an index from a corpus.
  • Token is a tuple of term, document id and term’s position in the document.
  • BuildableIndex is a type used to build an index out of list of tokens, created SearchableIndex_.
  • SearchableIndex supports search for a term in the corpus.
  • Boolean Search Engine - performs text serching in the corpus using the index. Supports AND/OR/NOT operators.

Wikidump

A set of types to build a corpus from a Wikipedia’s dump.