Currently only tantivy and lucene are supported, but you can add another search
engine by creating a directory in the engines directory and add a Makefile
implementing the following commands :
Removes all files, including the built index, and your compiled bench program.
Starts a program that will receive documents from stdin and build a search index. Check out the lucene implementation for reference.
Stemming should be disabled. Tokenization should be something reasonably close to Lucene's
StandardTokenizer. Discrepancies should be documented in README.md
.
Starts a program that will get tests
from stdin, and output
a result hit count as fast as possible. *If this is not your language's default,
be sure to flush stdout after writing your answer".
The tests consist in a command followed by a query.
The command describes the type of operation that should be performed. Right now there are three commands
COUNT
Outputs the document count.TOP10
computes the top-K elements. Just outputs "1"TOP10_COUNT
computes the topK documents and the overall count of matching documents. Outputs the document count.
Scores for these commands should be as close as possible to lucene's BM25.
If BM25 is not available, fall back to TfIdf. If TfIdf is not available,
just implement whatever is available to you. Make sure to document any difference in the README.md
file.
Queries are expressed in the Lucene query language.
If a command is not supported, just print to stdout "UNSUPPORTED".
If you would like a command to be added please open an issue. Wanting to show a specific case where your engine shines is a perfectly valid motivation.
TOP10
should give some advantage to engines implementing variations of the WAND
algorithm.