This layer provides two integration points with Lucene, FDBDirectory
and
FDBCodec
. These are full implementations of the Directory
and Codec
interfaces which are backed entirely by FoundationDB.
FDBDirectory
can be used on its own with the default Codec
doing the
interesting work. Files generated by Lucene are stored as blobs in the
database instead of the file system.
FDBCodec
, which must be used in conjunction with FDBDirectory
, implements
new serialization and data models for Lucene. This results in explicit keys and
values in the database instead of file-like blobs.
This layer is at an early alpha stage (note the 0.0.1 version number). While
most of the stock Lucene tests pass when using FDBDirectory
, many currently
fail when running with FDBCodec
. There are no known correctness issues at
this time but slowness and timeout issues could easily be hiding such problems.
Please try it out and let us know how it works (e.g. on our community site), but production usage is not recommended.
The Subspace concept is used extensively to provide a simple, logical mapping and easy storage and retrieval. Each directory, segment and format are identified by a unique string. These identifier strings are then concatenated together to yield key ranges associated with each logical format being stored.
For example, assume we have a FDBDirectory
created with the path
("lucene")
and a segment named "_0"
. That would result in the following
Tuples:
("lucene", "_0", "dat")
for DocValues("lucene", "_0", "inf")
for FieldInfos("lucene", "_0", "liv")
for LiveDocs- etc
Additional keys and values exist under each of those subspaces for storing the information associated with each format. In the documentation below, the full subspace is the concatenation of the directory, segment and format subspaces.
Encodes/decodes strongly typed, per document values. See DocValuesFormat and FieldInfo.DocValuesType.
The long_BINARY
, long_NUMERIC
, long_SORTED
and long_SORTED_SET
key parts
below refer to the DocValuesType
enum ordinal()
values.
Subspace: ("dat")
(str_fieldName, long_BINARY, long_doc0) => (bytes_value)
(str_fieldName, long_BINARY, long_doc1) => (bytes_value)
...
(str_fieldName, long_NUMERIC, long_doc0) => (long_value)
(str_fieldName, long_NUMERIC, long_doc1) => (long_value)
...
(str_fieldName, long_SORTED, "bytes", long_ordinal0) => (bytes_value)
(str_fieldName, long_SORTED, "bytes", long_ordinal1) => (bytes_value)
...
(str_fieldName, long_SORTED_SET, "ord", long_doc0) => (long_ordinal)
(str_fieldName, long_SORTED_SET, "ord", long_doc1) => (long_ordinal)
...
(str_fieldName, long_SORTED_SET, "bytes", long_ordinal0) => (bytes_value)
(str_fieldName, long_SORTED_SET, "bytes", long_ordinal1) => (bytes_value)
...
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal0) => ()
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal1) => ()
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc1, long_ordinal0) => ()
...
Encodes/decodes filed metadata. See FieldInfosFormat and FieldInfos:
Subspace: ("inf")
(long_field0, "name") => (string_fieldName)
(long_field0, "has_index") => (boolean_value)
(long_field0, "has_payloads") => (boolean_value)
(long_field0, "has_norms") => (boolean_value)
(long_field0, "has_vectors") => (boolean_value)
(long_field0, "doc_values_type") => (string_docValuesType)
(long_field0, "norms_type") => (string_normsType)
(long_field0, "index_options") => (string_indexOptions)
(long_field0, "attr", string_attr0) => (string_value)
(long_field0, "attr", string_attr1) => (string_value)
...
(long_field1, "name") => (string_fieldName)
...
Encodes/decodes live-ness of documents. See LiveDocsFormat.
Subspace: ("liv")
(long_liveGen0) => (long_totalSize)
(long_liveGen0, long_setBitIndex0) => ()
(long_liveGen0, long_setBitIndex1) => ()
(long_liveGen1) => (long_totalSize)
...
Encodes/decodes per-document score normalization values. See NormsFormat.
Subspace: ("len")
Uses DocValuesFormat
with a different subspace extension.
Encodes/decodes terms, postings, and proximity data. See PostingsFormat.
Subspace: ("pst")
(long_field0, bytes_term0, "numDocs") => (littleEndianLong_value)
(long_field0, bytes_term0, long_doc0) => (long_termDocFreq)
(long_field0, bytes_term0, long_doc0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)
...
(long_field1, bytes_term1, "numDocs") => (littleEndianLong_value)
...
Encodes/decodes segment metadata. See SegmentInfoFormat.
Subspace: ("si")
("doc_count")=> (long_docCount)
("is_compound_file") => (boolean_value)
("version") => (long_version)
("attr", string_attr0) => (string_value)
("attr", string_attr1) => (string_value)
...
("diag", string_diag0) => (string_value)
("diag", string_diag1) => (string_value)
...
("file", string_file0) => ()
("file", string_file1) => ()
...
Encodes/decodes per-document fields. See StoredFieldsFormat.
The key parts long_TYPE
and long_DATA
below refer to constants values,
currently 0
and 1
.
Subspace: ("fld")
(long_doc0, long_TYPE, long_field0) => (string_typeName, long_dataIndex)
(long_doc0, long_TYPE, long_field1) => (string_typeName, long_dataIndex)
...
(long_doc0, long_DATA, long_field0, long_dataIndex, long_offset0) => (bytes_value)
(long_doc0, long_DATA, long_field0, long_dataIndex, long_offset1) => (bytes_value)
(long_doc0, long_DATA, long_field1, long_dataIndex, long_offset0) => (bytes_value)
...
(long_doc1, long_TYPE, long_field0) => (string_typeName, long_dataIndex)
...
Encodes/decodes per-document term vectors. See TermVectorsFormat.
Subspace: ("vec")
(long_doc0, "field", string_field0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
(long_doc0, "field", string_field1) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
...
(long_doc0, "term", string_field0, bytes_term0) => (long_freq)
(long_doc0, "term", string_field0, bytes_term0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)
(long_doc0, "term", string_field0, bytes_term0, long_pos1) => (long_startOffset, long_endOffset, bytes_payload)
(long_doc0, "term", string_field0, bytes_term1) => (long_freq)
...
(long_doc1, "field", string_fieldName0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
...
Maven is used for building, packaging and running tests.
$ mvn test
-
Package fdb-lucene-layer
$ mvn package
-
Download the Solr source
$ curl -O http://mirror.nexcess.net/apache/lucene/solr/4.4.0/solr-4.4.0-src.tgz $ tar xzf solr-4.4.0-src.tgz $ cd solr-4.4.0/
-
Run the full test suite
$ ant test -Dtests.codec=FDBCodec \ -Dtests.directory=com.foundationdb.lucene.FDBTestDirectory \ -lib ../target/fdb-lucene-layer-0.0.1-SNAPSHOT.jar \ -lib ../target/dependency/fdb-java-1.0.0.jar