项目作者: aligusnet
项目描述 :
Information Retrieval
高级语言: C#
项目地址: git://github.com/aligusnet/InformationRetrieval.git

Definitions
List of definitions of key types and concepts.
Corpus
The project defines the way how the collection of documents is organized in the corpus and blocks.
- Corpus - collection of text documents organized into blocks;
- Block - subset of corpus, small enough to fit processing in memory;
- Document - piece of text with metadata, the most important metadata is DocumentId;
- DocumentId - unique identifier of the document.
The projects defines a number of types to process text documents organized in corpus.
- Tranformer - converts a corpus of documents, preserving the structure of the corpus, but changing the presentation: texts parsing/cleaning/tokenization etc.
- Indexer - builds an index from a corpus.
- Token is a tuple of term, document id and term’s position in the document.
- BuildableIndex is a type used to build an index out of list of tokens, created SearchableIndex_.
- SearchableIndex supports search for a term in the corpus.
- Boolean Search Engine - performs text serching in the corpus using the index. Supports AND/OR/NOT operators.
Wikidump
A set of types to build a corpus from a Wikipedia’s dump.