Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469

ivanmrsulja · 2024-06-10T14:02:05Z

What does this pull request do?

Updates current ES 6.x integration to 8.x.

What's new?

Changes in ResponseParser and ES documentation on the first draft.

Example:

Changed src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ResponseParser.java to be in line with current ES API
Updated src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/Elasticsearch_notes_on_the_first_draft.md with new mapping
Updated example.applicationSetup.n3 to show ES setup example

How should this be tested?

Initial setup

Install elasticsearch/opensearch somewhere.
Create a search index with the appropriate mapping (see below).
Check out VIVO and this branch of Vitro (see below), and do the usual installation procedure.
Modify {vitro_home}/config/applicationSetup.n3 to use this driver (see below).
Modify the vitro.local.searchengine.url configuration property to contain ES index base URL (due to backward compatibility, Solr can also be configured using vitro.local.solr.url. This will however result in a warning that is shown in logs, advising the client to switch to a new configuration parameter)
Modify the vitro.local.searchengine.username configuration property to contain ES/OS basic auth username
Modify the vitro.local.searchengine.password configuration property to contain to contain ES/OS basic auth password
Start elasticsearch/opensearch
Start VIVO

A mapping for the search index

curl -X PUT "localhost:9200/vivo?pretty" -H 'Content-Type: application/json' -d'
{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "keyword_tokenizer":{
            "type":"keyword"
          },
          "whitespace_tokenizer":{
            "type":"whitespace"
          }
        },
        "filter":{
          "lowercase_filter":{
            "type":"lowercase"
          },
          "edgengram_filter":{
            "type":"edge_ngram",
            "min_gram":2,
            "max_gram":25
          },
          "word_delimiter_filter":{
            "type":"word_delimiter",
            "generate_word_parts":true,
            "generate_number_parts":true,
            "catenate_words":false,
            "catenate_numbers":false,
            "catenate_all":false,
            "split_on_case_change":true
          },
          "porter_stem_filter":{
            "type":"snowball",
            "language":"English"
          }
        },
        "analyzer":{
          "default":{
            "type":"english"
          },
          "edgengram_untokenized":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_untokenized_query":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter"
            ]
          },
          "edgengram_stemmed":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_stemmed_query":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter"
            ]
          },
          "sort_field_analyzer":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings":{
    "dynamic_templates":[
      {
        "field_sort_template":{
          "match":"*_label_sort",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword"
              }
            },
            "fielddata":true,
            "analyzer":"sort_field_analyzer"
          }
        }
      },
      {
        "field_ss_template":{
          "match":"*_ss",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            },
            "fielddata":true
          }
        }
      },
      {
        "date_range_template":{
          "match":"*_drsim",
          "mapping":{
            "type":"date_range",
            "format":"strict_date_optional_time||epoch_millis"
          }
        }
      }
    ],
    "properties":{
      "ALLTEXT":{
        "type":"text",
        "analyzer":"english",
        "fields":{
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "ALLTEXTUNSTEMMED":{
        "type":"text",
        "analyzer":"standard"
      },
      "DocId":{
        "type":"keyword"
      },
      "classgroup":{
        "type":"keyword"
      },
      "type":{
        "type":"keyword"
      },
      "mostSpecificTypeURIs":{
        "type":"keyword"
      },
      "indexedTime":{
        "type":"long"
      },
      "nameRaw":{
        "type":"keyword"
      },
      "URI":{
        "type":"keyword"
      },
      "THUMBNAIL":{
        "type":"integer"
      },
      "THUMBNAIL_URL":{
        "type":"keyword"
      },
      "nameLowercaseSingleValued":{
        "type":"text",
        "analyzer":"standard",
        "fielddata":true
      },
      "BETA":{
        "type":"float"
      },
      "acNameUntokenized":{
        "type":"text",
        "analyzer":"edgengram_untokenized",
        "search_analyzer":"edgengram_untokenized_query"
      },
      "acNameStemmed":{
        "type":"text",
        "analyzer":"edgengram_stemmed",
        "search_analyzer":"edgengram_stemmed_query"
      }
    }
  }
}
'

Modify `applicationSetup.n3`

Change this (it is already changed in this PR):

# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :solrSearchEngine .

To this:

# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :elasticSearchEngine .

:elasticSearchEngine
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch.ElasticSearchEngine> ,
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> .

Your setup should be completed now 😃 ! After this, you should perform common manual tests that are done for every new release.

Interested parties

@chenejac

Reviewers' expertise

Candidates for reviewing this PR should have some of the following expertises:

Java
Elasticsearch 7.* or 8.*

chenejac · 2024-10-08T12:54:10Z

The following features should be tested:

search form - searching, filtering and sorting (changing localization)
list of research, persons, org units - alphabetical index (changing localization)
lookup/autocompletion at some forms - for instance adding author to a publication (changing localization)
visualizations - map of science, collaboration network

chenejac · 2024-11-11T15:47:30Z

@ivanmrsulja please create a VIVO PR with updated example.runtime.properties. Also, please move JSON configuration into vivo-es project. Add in the vivo-es project a Docker file, and update README file to explain how ES should be run.

chenejac

@ivanmrsulja basic VIVO search functionalities works for me. I didn't review the code. Instructions from the PR description about setup of the elasticsearch index might be replaced with a pointer to the vivo-es Readme file.

api/src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/base/SearchEngineUtil.java

...ain/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/CustomQueryBuilder.java

litvinovg · 2025-03-21T12:02:08Z

It seems the mapping in the PR description is not the same as the mapping in https://github.com/ivanmrsulja/vivo-es/blob/main/index-config.json
Also if search string is empty then there are no search result on search page, also no filters are visible, that is different from current behavior with Solr based search indexes.

…entation and configuration mechanisms.

…-side service detection mechanism.

…s own utility class.

…did not work.

…query.

litvinovg · 2025-06-05T13:08:00Z

@ivanmrsulja Could you have a look why results on pages with content type Browse search filter results is not shown?
Another issue that appear on the same type of pages is not working alphabetical index filtering. Zero results is returned if some letter is selected.

…exp queries.

ivanmrsulja · 2025-06-11T13:57:30Z

@ivanmrsulja Could you have a look why results on pages with content type Browse search filter results is not shown? Another issue that appear on the same type of pages is not working alphabetical index filtering. Zero results is returned if some letter is selected.

I haven't realized there is additional functionality besides facet search on global search page 😄 . My newest commit should address both of the problems you mentioned. Please test it when you have time and let me know if something is not working.

litvinovg · 2025-06-12T09:08:12Z

Thanks a lot!
I just found one more issue while trying date range slider (similar to this) . Could you take a look?
error.log
Example filter to reproduce (remove .txt suffix and put into rdf/display/firsttime directory )time_period_example.n3.txt

ivanmrsulja · 2025-06-13T14:10:32Z

Thanks a lot! I just found one more issue while trying date range slider (similar to this) . Could you take a look? error.log Example filter to reproduce (remove .txt suffix and put into rdf/display/firsttime directory )time_period_example.n3.txt

Should be fixed now, please test it out when you have time 😄

litvinovg · 2025-06-24T12:42:38Z

I think on previous dev meeting we discussed null pointer exceptions in case of using brackets in search text input field.
And also there were issue with using ":" character, in that case word was removed from the search if I remember it correctly.
It seems it still doesn't work for inputs like Andrew:-
And throws new exception on input like this Andrew:"
VItro-469.error.log

litvinovg · 2025-06-26T09:59:09Z

api/src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ESQuery.java

+        try {
+            parser.parse(query.getQuery());
+        } catch (ParseException e) {
+            treatAsLuceneQuery = false;


Isn't the Lucene query format used for all current search queries in VIVO?

Unfortunately not, if you add a query like :something"( that is regarded as an invalid Lucene query, Solr handles this implicitly by treating everything as full text.

chenejac

@ivanmrsulja please check my comments.

chenejac · 2025-06-27T13:31:00Z

api/pom.xml

+        <dependency>
+            <groupId>org.apache.lucene</groupId>
+            <artifactId>lucene-core</artifactId>
+            <version>9.9.2</version>
+        </dependency>


Please double check do we need this if we are using only queryparser?

Yes, we need it. The lucene-queryparser module depends on core classes from lucene-core, such as:

org.apache.lucene.analysis.Analyzer

org.apache.lucene.search.Query

org.apache.lucene.util.*

I updated the dependencies to the latest 9.x.x version as version >=10 needs newer Java.

chenejac · 2025-06-27T14:02:25Z

.../java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ExpressionTransformer.java

@@ -0,0 +1,341 @@
+package edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch;


This class (and linked classes CustomQueryBuilder and SearchType) make this PR a little bit more complex. If I understood well using query_string in ElasticSearch is possible, meaning to pass query to ElasticSearch and to wait for a response, but it is less powerful than Ivan's approach and it is not a good practice especially for search boxes (check the first warning box at this link, and also this discussion).

I think regarding some advance features of query syntax of ElasticSearch query language over Lucene query_string, we might conclude it is not crucial for us at the moment, because we are expecting end-users wouldn't use advance query syntax elements in the search box, they will use only content they are looking for (meaning we need keywords searching).

However, the question is whether it is a good practice to use query_string via search box due to the following limitations:

While versatile, the query is strict and returns an error if the query string includes any invalid syntax (due to some brackets, columns, and other elements which might be present in title of work someone is looking for) - source. This maybe might be fixed by using lenient flag.

A malicious bot could inject special Lucene syntax like wildcards, range queries, or even malformed expressions, leading to performance degradation (e.g., complex regex/wildcard queries), etc.

Can we discuss here advantages and disadvantages of query_string and ElasticSearch DSL query?
What about simple_query_string? It might be more safe for us.
Moreover, I am wondering whether the listed issues above are also present when we are using Solr?

Thanks for the detailed comment! You're absolutely right to bring up the trade-offs between query_string, simple_query_string, and our current custom query parser based approach. Let me add some thoughts on why I’ve stuck with the custom builder and why switching to query_string (or even simple_query_string) might not be the best path forward in our case:

As Georgy mentioned, users often submit queries like Deep Learning: A Survey (2023), with query_string, these inputs must be perfectly formed Lucene syntax which users won’t know. Even a missing quote or special character can cause parsing failures or incorrect behavior. While the lenient flag can help, it can also mask deeper issues and result in confusing results, often returning 0 results because of parse failure or malformed query (e.g. :something"().

Using query_string directly opens the door to Lucene syntax injection. Malicious users or bots could send wildcard-heavy, deeply nested, or regex-based queries that can degrade search performance or cause errors. Our current approach allows us to filter, escape, or block these patterns early before hitting Elasticsearch.

Elasticsearch’s query DSL gives us the ability to define clear search logic with must, should, boost, and filter. We can control how fields like title and author contribute to scoring, or apply different analyzers. query_string flattens this control, and it becomes harder to evolve the search experience in the future.

Since we must support both structured and free-text queries in the same engine endpoint (I have no known way to differentiate where the query came from, and not all structured parts are in filters), a naive switch to query_string wouldn’t improve our situation. It would require extra parsing or escaping on our side anyway. Our current query builder handles both gracefully and consistently, keeping our logic centralized and testable. simple_query_string also doesn’t support more advanced query structures or field-level boosting (which we might want to add in the future). It's a safer subset of query_string, but I think it will not be flexible enough even for our current feature set.

chenejac · 2025-06-27T14:06:17Z

...rc/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/QueryConverter.java

@@ -140,25 +197,25 @@ public QueryStringMap(String queryString) {

        /**
         * This is a kluge, but perhaps it will work for now.
-         * 
+         * <p>


Please remove those p tags if they are not needed.

This is a codestyle convention, they are needed if you want a blank line in the intellisense or geerated documentation.

ivanmrsulja marked this pull request as draft June 11, 2024 07:11

chenejac marked this pull request as ready for review June 18, 2024 13:54

chenejac linked an issue Jun 21, 2024 that may be closed by this pull request

VIVO-1646: Epic for tracking the implementation of ElasticSearch functionality vivo-project/VIVO#3236

Open

ivanmrsulja changed the title ~~Small mapping update and response parsing fix.~~ Elasticsearch 7.* and 8.* integration. Jun 24, 2024

ivanmrsulja changed the title ~~Elasticsearch 7.* and 8.* integration.~~ Elasticsearch 7.* and 8.* integration. OpenSearch integration. Jul 5, 2024

chenejac linked an issue Oct 29, 2024 that may be closed by this pull request

VIVO-1587: Elasticsearch integration with VIVO vivo-project/VIVO#3177

Open

chenejac requested a review from litvinovg November 7, 2024 16:00

chenejac previously approved these changes Dec 5, 2024

View reviewed changes

ivanmrsulja dismissed chenejac’s stale review via 8b659c1 December 9, 2024 10:30

ivanmrsulja requested a review from chenejac December 9, 2024 10:31

wwtamu reviewed Dec 12, 2024

View reviewed changes

api/src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/base/SearchEngineUtil.java Show resolved Hide resolved

wwtamu reviewed Dec 12, 2024

View reviewed changes

...ain/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/CustomQueryBuilder.java Show resolved Hide resolved

ivanmrsulja force-pushed the feature/elasticsearch-integration branch from f9e3825 to 8a008c0 Compare April 1, 2025 13:34

ivanmrsulja added 14 commits May 21, 2025 09:03

Small mapping update and response parsing fix.

3723d73

Added query parser to transform Solr queries to ES JSON-like queries.

951508b

Updated mapping to support aggregations.

987fa5c

Added small ES query optimizations.

22a85e6

Completed implementation of missing ES engine methods. Improved docum…

ee3fbb4

…entation and configuration mechanisms.

Switched to one search engine configuration URL. Implemented a server…

b74091c

…-side service detection mechanism.

Added SSL and Basic auth support. Added OpenSearch support.

d46a80c

Added fallback configuration property for legacy Solr configurations.

84a822c

Refactored code so that common property fallback resolution is in it'…

a6b288a

…s own utility class.

Fixed fetch count bug. Fixed advanced search filter bug where filter …

789a23e

…did not work.

Fixed facets query bug. Fixed delete by query bug.

1a4b269

Fixed UTF-8 parsing bug while indexing.

db39bcc

Added support for *_drsim fields. Fixed pagination and statistics bug.

991d06c

Fixed sort by relevance bug.

162a03e

ivanmrsulja added 7 commits May 21, 2025 09:03

Updated documentation.

b6fbaf2

Updated example.runtime.properties

a1fd76c

Fixed inaccurate count retrieved for large indexes when using search …

829a0e6

…query.

Added field mappings which were missing for autocomplete search.

b5b7383

Small bugfix and code refactor.

03d74b9

Fixed sorting issue.

6c0cd8b

Fixed match all query bug.

f95843d

ivanmrsulja force-pushed the feature/elasticsearch-integration branch from 8a008c0 to f95843d Compare May 21, 2025 07:04

Added support for aggregation (facet) search.

d346d22

Added support for facet browsing functionality. Added support for reg…

d11b311

…exp queries.

Fixed date range slider issue.

023e53f

Implemented a linear pass experimental fix to parsing problems.

7e01a57

Fixed all query parsing issues.

ba5387a

litvinovg reviewed Jun 26, 2025

View reviewed changes

chenejac requested changes Jun 27, 2025

View reviewed changes

Updated lucene dependencies to latest supported version.

b008651

		@@ -0,0 +1,341 @@
		package edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch;

Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469

Are you sure you want to change the base?

Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469

Uh oh!

Conversation

ivanmrsulja commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this pull request do?

What's new?

How should this be tested?

Initial setup

A mapping for the search index

Modify applicationSetup.n3

Interested parties

Reviewers' expertise

Uh oh!

chenejac commented Oct 8, 2024

Uh oh!

chenejac commented Nov 11, 2024

Uh oh!

chenejac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

litvinovg commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

litvinovg commented Jun 5, 2025

Uh oh!

ivanmrsulja commented Jun 11, 2025

Uh oh!

litvinovg commented Jun 12, 2025

Uh oh!

ivanmrsulja commented Jun 13, 2025

Uh oh!

litvinovg commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenejac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmrsulja Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ivanmrsulja commented Jun 10, 2024 •

edited

Loading

Modify `applicationSetup.n3`

litvinovg commented Mar 21, 2025 •

edited

Loading

ivanmrsulja Jun 30, 2025 •

edited

Loading