A simple document search engine powered by Lucene.
A simple document search engine powered by Lucene.
Gradle is used thoroughly in this project. Check out build.gradle by yourself if you have knowledge on Gradle.
I recommend open it with IDE who supports Gradle, such as Intellij IDEA.
# Make sure gradle is installed.
git clone https://github.com/PointerFLY/Lucene-Example.git
cd Lucene-Example
gradle run
Class | Functionality |
---|---|
FileUtils | Create local folder structures, automatically download and decompress Cranfield files. Provide access to all necessary files (Cranfield files, index files). |
FileParser | A parser used to parse and load Cranfield files into a Java supported data structre. |
DocumentModel | A simple object-oriented way to represent Cranfield documents. |
Indexer | Responsible to create all indices. |
Searcher | Read indices created by Indexer, parse the query string and exectute search. A search will return top hits documents ids. |
CustomAnalyzer | A analyzer created by myself to compare its performances with other built-in Analyzer in Lucene. |
Evaluator | Calculate Mean Average Precision, Recall. |
Main | Entry of program, run different analyzers and scoring models and print results. |
StandardAnalyzer
Built-in Analyzer in Lucene, the most robust one. Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
CustomAnalyzer
I select differnt tokenizer and filters, combine them, fine-tune the pipeline. Finally choose a analysis pipeline below:
public class CustomAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new ClassicTokenizer();
TokenStream tokenStream = new LowerCaseFilter(source);
tokenStream = new EnglishPossessiveFilter(tokenStream);
tokenStream = new EnglishMinimalStemFilter(tokenStream);
tokenStream = new KStemFilter(tokenStream);
tokenStream = new PorterStemFilter(tokenStream);
CharArraySet stopSet = CharArraySet.copy(StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream = new StopFilter(tokenStream, stopSet);
return new TokenStreamComponents(source, tokenStream);
}
}
I choose two built-in Lucene scoring model, which is subclasses of Similarity in Lucene.
ClassicSimilarity
A implementation of TFIDFSimilarity. It is actually a Vector Sapce Model base on TF-IDF term weighting.
BM25Similarity
A TF-IDF-like ranking function used in document retrieval, which is generally considered superior to TF-IDF.
Note: MAP and recall are average of 225 queries.
N/A | StandardAnalyzer | CustomAnalyzer |
---|---|---|
Vector Space Model | MAP: 0.4905 Recall: 0.6445 | MAP: 0.5038 Recall: 0.6839 |
BM25 Model | MAP: 0.5418 Recall: 0.6826 | MAP: 0.5435 Recall: 0.7233 |
Conclusion: