Skip to content

Latest commit

 

History

History
327 lines (273 loc) · 21.2 KB

development.md

File metadata and controls

327 lines (273 loc) · 21.2 KB

Overview

mod-search is based on metadata-driven approach. It means that resource description is specified using JSON file and all rules, mappings and other thing will be applied by internal mod-search services.

Supported search field types

Elasticsearch mapping field types: field data types The field type is used to define what search capabilities the corresponding field can provide. For example, keyword field type is used for term queries and aggregations (providing facets for the record). The text fields are intended to use by the full-text queries.

Resource description

Property name Description
name The resource name, it used for searching by resource to determine index name, creating index settings and mappings
parent The parent resource name (currently, it is used for browsing by subjects when the additional index is added to arrange the instance subjects uniquely)
eventBodyJavaClass The Java class that incoming JSON can be mapped to. Currently, it's used to make processing of search field more convenient
languageSourcePaths Contains a list of json path expressions to extract languages values in ISO-639 format. If the multi-language is supported for the resource, this path must be specified.
searchFieldModifiers Contains a list of field modifiers, which pre-processes incoming fields for elasticsearch request.
fields List of field descriptions to extract values from incoming resource event
fieldTypes List of resource descriptions that can be used with an alias using $ref field of PlainFieldDescription. It's done to reduce duplication in resource description.
searchFields Contains a list of generated fields for the resource events (for example, It can be contain ISBN normalized values or generating subset of field values).
indexMappings Object with additional index mappings for resource (It can be helpful for copy_to functionality of Elasticsearch
mappingSource It's used to include or exclude some field from storing those values in _source object in Elasticsearch. Mainly, it's used to reduce the size per index. See also: _source field
reindexSupported Indicates if the resource could be reindexed

Supported field description types

Field type Description
plain This field type is default and there is no need to explicitly specify the field.
It can be used to define all fields containing the following values: string, number, boolean, or array of plain values.
object This field type is used to mark that key contains subfield, each of subfield must have its own field description.
authority This field type is designed to provide special options to divide a single authority record into multiple based on the distinctType property value.

Plain field description

Property name Description
searchTypes List of search types that are supported for the current field. Allowed values: facet, filter, sort
searchAliases List of aliases that can be used as a field name in the CQL search query. It can be used to combine several fields together during the search. For example, a query keyword all title combines for instance record following fields - title, alternativeTitles.alternativeTitle, indexTitle, identifiers.value, contributors.name
Other way of using it - is to rename field keeping the backward compatibility without required reindex.
index Reference to the Elasticsearch mappings that are specified in index-field-types
showInResponse Marks field to be returned during the search operation. mod-search adds to the Elasticsearch query all marked field paths. See also: Source filtering
searchTermProcessor Search term processor, which pre-processes incoming value from CQL query for the search request.
mappings Elasticsearch fields mappings. It can contain new field mapping or can enrich referenced mappings, that comes from index-field-types
defaultValue The default value for the plain field
indexPlainValue Specifies if plain keyword value should be indexed with field or not. Works only for full-text fields. See also: Full-text plain fields
sortDescription Provides sort description for field. If not specified - standard rules will be applied for the sort field. See also: Sorting by fields

Object field description

Property name Description
properties Map where key - is the subfield name, value - is the field description

Authority field description

Property name Description
distinctType Distinct type to split single entity to multiple containing only common fields excluding all other fields marked with other distinct types
headingType Heading type that should be set to the resource if a field containing some values.
authRefType Authorized, Reference, or Auth/Ref type for divided authority record.

Creating Elasticsearch mappings

Elasticsearch mappings are created using field descriptions. All fields, that are specified in the record description will be added to the index mappings, and they will be used to prepare the Elasticsearch document.

By default, mappings are taken from index-field-types. It's the common file containing pre-defined mapping values that can be accessed by reference from index field of PlainFieldDescription. The field mappings for specific field can be enriched using mapping field. Also, the ResourceDescription contains section indexMappings which provides for developers to add custom mappings without specifying them in the index-field-types.json file.

For example, the resource description contains the following field description:

{
  "fields": {
    "f1": {
      "index": "keyword",
      "mappings": {
        "copy_to": [ "sort_f1" ]
      }
    },
    "f2": {
      "index": "keyword"
    }
  },
  "indexMappings": {
    "sort_f1": {
      "type": "keyword",
      "normalizer": "keyword_lowercase"
    }
  }
}

Then the mappings' helper will create the following mappings object:

{
  "properties": {
    "f1": {
      "type": "keyword",
      "copy_to": [ "sort_f1" ]
    },
    "f2": {
      "type": "keyword"
    },
    "sort_f1": {
      "type": "keyword",
      "normalizer": "keyword_lowercase"
    }
  }
}

Adding mod-search specific kafka topics

In order to make mod-search create his own topic for kafka, it should be added to application.yml file with application.kafka.topics path.

Topic parameters:

Property name Description
name Topic base name that will be concatenated with environment name and tenant name.
numPartitions Break a topic into multiple partitions. Can be left blank in order to use default '-1' value.
replicationFactor Specify how much replicas do you need for a topic. Can be left blank in order to use default '-1' value.

Example

application:
  kafka:
    topics:
      - name: search.instance-contributor
        numPartitions: ${KAFKA_CONTRIBUTORS_TOPIC_PARTITIONS:50}
        replicationFactor: ${KAFKA_CONTRIBUTORS_TOPIC_REPLICATION_FACTOR:}

Full-text fields

Currently, supported 2 field types for full-text search:

Also, to support the wildcard search by the whole phrase the plain values are added to the generated document. For example, multi-language analyzed field with indexPlainValue = true (default):

Source record:

{
  "title": "Semantic web primer",
  "language": "eng"
}

Result document:

{
  "title": {
    "eng": "Semantic web primer",
    "src": "Semantic web primer"
  },
  "plain_title": "Semantic web primer"
}

Example of document with field with index = standard:

Source:

{
  "contributors": [
    {
      "name": "A contributor name",
      "primary": true
    }
  ]
}

Result document:

{
  "contributors": [
    {
      "name": "A contributor name",
      "plain_name": "A contributor name",
      "primary": true
    }
  ]
}

Field Sorting

All fields marked with searchType = sort must be available for sorting. To sort by text values following field indices can be applicable:

  • keyword (case-sensitive)
  • keyword_lowercase (case-insensitive)

Sort Description

Property name Description
fieldName Custom field name, if it is not specified - default strategy will be applied: sort_${fieldName}.
sortType Sort field type: single or collection
secondarySort List of fields that must be added as secondary sorting (eg, sorting by itemStatus and instance title fields)

By default, if the field is only marked with searchType = sort - the mod-search will generate the following sort condition:

{
  "sort": [
    {
      "name": "sort_$field",
      "order": "${value comes from cql query: asc/desc}"
    }
  ]
}

if sortDescription contains sortTYpe as collection the following rules will be applied:

  • if sortOrder is asc then the mode will be equal to min. It means that for sorting by a field containing a list of values - the lowest value will be picked for sorting.
  • if sortOrder is desc the the mode will be equal to max. It means that for sorting by a field containing a list of values - the highest value will be picked for sorting.

Testings

Unit testing

The project uses mostly only one framework for assertions - AssertJ A few examples:

assertThat(actualQuery).isEqualTo(matchAllQuery());

assertThat(actualCollection).isNotEmpty().containsExactly("str1", "str2");

assertThatThrownBy(() -> service.doExceptionalOperation())
  .isInstanceOf(IllegalArgumentException.class)
  .hasMessage("invalid parameter");

Integration testing

The module uses Testcontainers to run Elasticsearch, Apache Kafka and PostgreSQL in embedded mode. It is required to have Docker installed and available on the host where the tests are executed.

Local environment testing

Navigate to the docker folder in the project and run docker-compose up. This will build local mod-search image and bring it up along with all necessary infrastructure:

  • elasticsearch along with dashboards (kibana analogue from opensearch)
  • kafka along with zookeeper
  • postgres
  • wiremock server for mocking external api calls (for example authorization)

Then, you should invoke

curl --location --request POST 'http://localhost:8081/_/tenant' \
--header 'Content-Type: application/json' \
--header 'x-okapi-tenant: test_tenant' \
--header 'x-okapi-url: http://api-mock:8080' \
--data-raw '{
  "module_to": "mod-search-$version$",
  "purge": "false"
}

to post some tenant in order to bring up kafka listeners and get indices created. You can check which tenants enabled by wiremock in the src/test/resources/mappings/user-tenants.json

To rebuild mod-search image you should:

  • bring down existing containers by running docker-compose down
  • run docker-compose build mod-search to build new mod-search image
  • run docker-compose up to bring up infrastructure

Hosts/ports of containers to access functionality:

  • http://localhost:5601/ - dashboards UI for elastic monitoring, data modification through dev console
  • localhost - host, 5010 - port for remote JVM debug
  • http://localhost:8081 - for calling mod-search REST api. Note that header x-okapi-url: http://api-mock:8080 should be added to request for apis that take okapi url from headers
  • localhost:29092 - for kafka interaction. If you are sending messages to kafka from java application with spring-kafka then this host shoulb be added to spring.kafka.bootstrap-servers property of application.yml

Consortium support for Local environment testing

Consortium feature is defined automatically at runtime by calling /user-tenants endpoint. Consortium feature on module enable is defined by 'centralTenantId' tenant parameter.

Invoke the following

curl --location --request POST 'http://localhost:8081/_/tenant' \
--header 'Content-Type: application/json' \
--header 'x-okapi-tenant: consortium' \
--header 'x-okapi-url: http://api-mock:8080' \
--data-raw '{
  "module_to": "mod-search-$version$",
  "parameters": [
    {
      "key": "centralTenantId",
      "value": "consortium"
    }
  ]
}

Then execute the following to enable member tenant

curl --location --request POST 'http://localhost:8081/_/tenant' \
--header 'Content-Type: application/json' \
--header 'x-okapi-tenant: member_tenant' \
--header 'x-okapi-url: http://api-mock:8080' \
--data-raw '{
  "module_to": "mod-search-$version$",
  "parameters": [
    {
      "key": "centralTenantId",
      "value": "consortium"
    }
  ]
}

Consider that tenantParameters like loadReference and loadSample won't work because loadReferenceData method is not implemented in the SearchTenantService yet.