Skip to content

Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

ivanmrsulja
Copy link
Member

@ivanmrsulja ivanmrsulja commented Jun 10, 2024

What does this pull request do?

Updates current ES 6.x integration to 8.x.

What's new?

Changes in ResponseParser and ES documentation on the first draft.

Example:

  • Changed src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ResponseParser.java to be in line with current ES API
  • Updated src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/Elasticsearch_notes_on_the_first_draft.md with new mapping
  • Updated example.applicationSetup.n3 to show ES setup example

How should this be tested?

Initial setup

  • Install elasticsearch/opensearch somewhere.
  • Create a search index with the appropriate mapping (see below).
  • Check out VIVO and this branch of Vitro (see below), and do the usual installation procedure.
  • Modify {vitro_home}/config/applicationSetup.n3 to use this driver (see below).
  • Modify the vitro.local.searchengine.url configuration property to contain ES index base URL (due to backward compatibility, Solr can also be configured using vitro.local.solr.url. This will however result in a warning that is shown in logs, advising the client to switch to a new configuration parameter)
  • Modify the vitro.local.searchengine.username configuration property to contain ES/OS basic auth username
  • Modify the vitro.local.searchengine.password configuration property to contain to contain ES/OS basic auth password
  • Start elasticsearch/opensearch
  • Start VIVO

A mapping for the search index

curl -X PUT "localhost:9200/vivo?pretty" -H 'Content-Type: application/json' -d'
{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "keyword_tokenizer":{
            "type":"keyword"
          },
          "whitespace_tokenizer":{
            "type":"whitespace"
          }
        },
        "filter":{
          "lowercase_filter":{
            "type":"lowercase"
          },
          "edgengram_filter":{
            "type":"edge_ngram",
            "min_gram":2,
            "max_gram":25
          },
          "word_delimiter_filter":{
            "type":"word_delimiter",
            "generate_word_parts":true,
            "generate_number_parts":true,
            "catenate_words":false,
            "catenate_numbers":false,
            "catenate_all":false,
            "split_on_case_change":true
          },
          "porter_stem_filter":{
            "type":"snowball",
            "language":"English"
          }
        },
        "analyzer":{
          "default":{
            "type":"english"
          },
          "edgengram_untokenized":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_untokenized_query":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter"
            ]
          },
          "edgengram_stemmed":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_stemmed_query":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter"
            ]
          },
          "sort_field_analyzer":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings":{
    "dynamic_templates":[
      {
        "field_sort_template":{
          "match":"*_label_sort",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword"
              }
            },
            "fielddata":true,
            "analyzer":"sort_field_analyzer"
          }
        }
      },
      {
        "field_ss_template":{
          "match":"*_ss",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            },
            "fielddata":true
          }
        }
      },
      {
        "date_range_template":{
          "match":"*_drsim",
          "mapping":{
            "type":"date_range",
            "format":"strict_date_optional_time||epoch_millis"
          }
        }
      }
    ],
    "properties":{
      "ALLTEXT":{
        "type":"text",
        "analyzer":"english",
        "fields":{
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "ALLTEXTUNSTEMMED":{
        "type":"text",
        "analyzer":"standard"
      },
      "DocId":{
        "type":"keyword"
      },
      "classgroup":{
        "type":"keyword"
      },
      "type":{
        "type":"keyword"
      },
      "mostSpecificTypeURIs":{
        "type":"keyword"
      },
      "indexedTime":{
        "type":"long"
      },
      "nameRaw":{
        "type":"keyword"
      },
      "URI":{
        "type":"keyword"
      },
      "THUMBNAIL":{
        "type":"integer"
      },
      "THUMBNAIL_URL":{
        "type":"keyword"
      },
      "nameLowercaseSingleValued":{
        "type":"text",
        "analyzer":"standard",
        "fielddata":true
      },
      "BETA":{
        "type":"float"
      },
      "acNameUntokenized":{
        "type":"text",
        "analyzer":"edgengram_untokenized",
        "search_analyzer":"edgengram_untokenized_query"
      },
      "acNameStemmed":{
        "type":"text",
        "analyzer":"edgengram_stemmed",
        "search_analyzer":"edgengram_stemmed_query"
      }
    }
  }
}
'

Modify applicationSetup.n3

  • Change this (it is already changed in this PR):
# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :solrSearchEngine .

  • To this:
# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :elasticSearchEngine .

:elasticSearchEngine
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch.ElasticSearchEngine> ,
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> .

Your setup should be completed now 😃 ! After this, you should perform common manual tests that are done for every new release.

Interested parties

@chenejac

Reviewers' expertise

Candidates for reviewing this PR should have some of the following expertises:

  1. Java
  2. Elasticsearch 7.* or 8.*

@ivanmrsulja ivanmrsulja marked this pull request as draft June 11, 2024 07:11
@chenejac chenejac marked this pull request as ready for review June 18, 2024 13:54
@ivanmrsulja ivanmrsulja changed the title Small mapping update and response parsing fix. Elasticsearch 7.* and 8.* integration. Jun 24, 2024
@ivanmrsulja ivanmrsulja changed the title Elasticsearch 7.* and 8.* integration. Elasticsearch 7.* and 8.* integration. OpenSearch integration. Jul 5, 2024
@chenejac
Copy link
Contributor

chenejac commented Oct 8, 2024

The following features should be tested:

  • search form - searching, filtering and sorting (changing localization)
  • list of research, persons, org units - alphabetical index (changing localization)
  • lookup/autocompletion at some forms - for instance adding author to a publication (changing localization)
  • visualizations - map of science, collaboration network

@chenejac chenejac linked an issue Oct 29, 2024 that may be closed by this pull request
@chenejac chenejac requested a review from litvinovg November 7, 2024 16:00
@chenejac
Copy link
Contributor

@ivanmrsulja please create a VIVO PR with updated example.runtime.properties. Also, please move JSON configuration into vivo-es project. Add in the vivo-es project a Docker file, and update README file to explain how ES should be run.

chenejac
chenejac previously approved these changes Dec 5, 2024
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja basic VIVO search functionalities works for me. I didn't review the code. Instructions from the PR description about setup of the elasticsearch index might be replaced with a pointer to the vivo-es Readme file.

@litvinovg
Copy link
Member

litvinovg commented Mar 21, 2025

It seems the mapping in the PR description is not the same as the mapping in https://github.com/ivanmrsulja/vivo-es/blob/main/index-config.json
Also if search string is empty then there are no search result on search page, also no filters are visible, that is different from current behavior with Solr based search indexes.

@ivanmrsulja ivanmrsulja force-pushed the feature/elasticsearch-integration branch from f9e3825 to 8a008c0 Compare April 1, 2025 13:34
@ivanmrsulja ivanmrsulja force-pushed the feature/elasticsearch-integration branch from 8a008c0 to f95843d Compare May 21, 2025 07:04
@litvinovg
Copy link
Member

@ivanmrsulja Could you have a look why results on pages with content type Browse search filter results is not shown?
Another issue that appear on the same type of pages is not working alphabetical index filtering. Zero results is returned if some letter is selected.

@ivanmrsulja
Copy link
Member Author

@ivanmrsulja Could you have a look why results on pages with content type Browse search filter results is not shown? Another issue that appear on the same type of pages is not working alphabetical index filtering. Zero results is returned if some letter is selected.

I haven't realized there is additional functionality besides facet search on global search page 😄 . My newest commit should address both of the problems you mentioned. Please test it when you have time and let me know if something is not working.

@litvinovg
Copy link
Member

Thanks a lot!
I just found one more issue while trying date range slider (similar to this) . Could you take a look?
error.log
Example filter to reproduce (remove .txt suffix and put into rdf/display/firsttime directory )time_period_example.n3.txt

@ivanmrsulja
Copy link
Member Author

Thanks a lot! I just found one more issue while trying date range slider (similar to this) . Could you take a look? error.log Example filter to reproduce (remove .txt suffix and put into rdf/display/firsttime directory )time_period_example.n3.txt

Should be fixed now, please test it out when you have time 😄

@litvinovg
Copy link
Member

I think on previous dev meeting we discussed null pointer exceptions in case of using brackets in search text input field.
And also there were issue with using ":" character, in that case word was removed from the search if I remember it correctly.
It seems it still doesn't work for inputs like Andrew:-
And throws new exception on input like this Andrew:"
VItro-469.error.log

try {
parser.parse(query.getQuery());
} catch (ParseException e) {
treatAsLuceneQuery = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Lucene query format used for all current search queries in VIVO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, if you add a query like :something"( that is regarded as an invalid Lucene query, Solr handles this implicitly by treating everything as full text.

Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja please check my comments.

api/pom.xml Outdated
Comment on lines 115 to 119
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>9.9.2</version>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check do we need this if we are using only queryparser?

Copy link
Member Author

@ivanmrsulja ivanmrsulja Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it. The lucene-queryparser module depends on core classes from lucene-core, such as:

  • org.apache.lucene.analysis.Analyzer
  • org.apache.lucene.search.Query
  • org.apache.lucene.util.*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the dependencies to the latest 9.x.x version as version >=10 needs newer Java.

@@ -0,0 +1,341 @@
package edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class (and linked classes CustomQueryBuilder and SearchType) make this PR a little bit more complex. If I understood well using query_string in ElasticSearch is possible, meaning to pass query to ElasticSearch and to wait for a response, but it is less powerful than Ivan's approach and it is not a good practice especially for search boxes (check the first warning box at this link, and also this discussion).

I think regarding some advance features of query syntax of ElasticSearch query language over Lucene query_string, we might conclude it is not crucial for us at the moment, because we are expecting end-users wouldn't use advance query syntax elements in the search box, they will use only content they are looking for (meaning we need keywords searching).

However, the question is whether it is a good practice to use query_string via search box due to the following limitations:

  • While versatile, the query is strict and returns an error if the query string includes any invalid syntax (due to some brackets, columns, and other elements which might be present in title of work someone is looking for) - source. This maybe might be fixed by using lenient flag.
  • A malicious bot could inject special Lucene syntax like wildcards, range queries, or even malformed expressions, leading to performance degradation (e.g., complex regex/wildcard queries), etc.

Can we discuss here advantages and disadvantages of query_string and ElasticSearch DSL query?
What about simple_query_string? It might be more safe for us.
Moreover, I am wondering whether the listed issues above are also present when we are using Solr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comment! You're absolutely right to bring up the trade-offs between query_string, simple_query_string, and our current custom query parser based approach. Let me add some thoughts on why I’ve stuck with the custom builder and why switching to query_string (or even simple_query_string) might not be the best path forward in our case:

  • As Georgy mentioned, users often submit queries like Deep Learning: A Survey (2023), with query_string, these inputs must be perfectly formed Lucene syntax which users won’t know. Even a missing quote or special character can cause parsing failures or incorrect behavior. While the lenient flag can help, it can also mask deeper issues and result in confusing results, often returning 0 results because of parse failure or malformed query (e.g. :something"().

  • Using query_string directly opens the door to Lucene syntax injection. Malicious users or bots could send wildcard-heavy, deeply nested, or regex-based queries that can degrade search performance or cause errors. Our current approach allows us to filter, escape, or block these patterns early before hitting Elasticsearch.

  • Elasticsearch’s query DSL gives us the ability to define clear search logic with must, should, boost, and filter. We can control how fields like title and author contribute to scoring, or apply different analyzers. query_string flattens this control, and it becomes harder to evolve the search experience in the future.

  • Since we must support both structured and free-text queries in the same engine endpoint (I have no known way to differentiate where the query came from, and not all structured parts are in filters), a naive switch to query_string wouldn’t improve our situation. It would require extra parsing or escaping on our side anyway. Our current query builder handles both gracefully and consistently, keeping our logic centralized and testable. simple_query_string also doesn’t support more advanced query structures or field-level boosting (which we might want to add in the future). It's a safer subset of query_string, but I think it will not be flexible enough even for our current feature set.

@@ -140,25 +197,25 @@ public QueryStringMap(String queryString) {

/**
* This is a kluge, but perhaps it will work for now.
*
* <p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove those p tags if they are not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a codestyle convention, they are needed if you want a blank line in the intellisense or geerated documentation.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VIVO-1646: Epic for tracking the implementation of ElasticSearch functionality VIVO-1587: Elasticsearch integration with VIVO
4 participants