-
Notifications
You must be signed in to change notification settings - Fork 85
Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469
Conversation
The following features should be tested:
|
@ivanmrsulja please create a VIVO PR with updated example.runtime.properties. Also, please move JSON configuration into vivo-es project. Add in the vivo-es project a Docker file, and update README file to explain how ES should be run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanmrsulja basic VIVO search functionalities works for me. I didn't review the code. Instructions from the PR description about setup of the elasticsearch index might be replaced with a pointer to the vivo-es Readme file.
api/src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/base/SearchEngineUtil.java
Show resolved
Hide resolved
...ain/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/CustomQueryBuilder.java
Show resolved
Hide resolved
It seems the mapping in the PR description is not the same as the mapping in https://github.com/ivanmrsulja/vivo-es/blob/main/index-config.json |
f9e3825
to
8a008c0
Compare
…entation and configuration mechanisms.
…-side service detection mechanism.
…s own utility class.
8a008c0
to
f95843d
Compare
@ivanmrsulja Could you have a look why results on pages with content type Browse search filter results is not shown? |
I haven't realized there is additional functionality besides facet search on global search page 😄 . My newest commit should address both of the problems you mentioned. Please test it when you have time and let me know if something is not working. |
Thanks a lot! |
Should be fixed now, please test it out when you have time 😄 |
I think on previous dev meeting we discussed null pointer exceptions in case of using brackets in search text input field. |
try { | ||
parser.parse(query.getQuery()); | ||
} catch (ParseException e) { | ||
treatAsLuceneQuery = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the Lucene query format used for all current search queries in VIVO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, if you add a query like :something"(
that is regarded as an invalid Lucene query, Solr handles this implicitly by treating everything as full text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanmrsulja please check my comments.
api/pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.lucene</groupId> | ||
<artifactId>lucene-core</artifactId> | ||
<version>9.9.2</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check do we need this if we are using only queryparser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need it. The lucene-queryparser module depends on core classes from lucene-core
, such as:
- org.apache.lucene.analysis.Analyzer
- org.apache.lucene.search.Query
- org.apache.lucene.util.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the dependencies to the latest 9.x.x
version as version >=10
needs newer Java.
@@ -0,0 +1,341 @@ | |||
package edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class (and linked classes CustomQueryBuilder and SearchType) make this PR a little bit more complex. If I understood well using query_string in ElasticSearch is possible, meaning to pass query to ElasticSearch and to wait for a response, but it is less powerful than Ivan's approach and it is not a good practice especially for search boxes (check the first warning box at this link, and also this discussion).
I think regarding some advance features of query syntax of ElasticSearch query language over Lucene query_string, we might conclude it is not crucial for us at the moment, because we are expecting end-users wouldn't use advance query syntax elements in the search box, they will use only content they are looking for (meaning we need keywords searching).
However, the question is whether it is a good practice to use query_string via search box due to the following limitations:
- While versatile, the query is strict and returns an error if the query string includes any invalid syntax (due to some brackets, columns, and other elements which might be present in title of work someone is looking for) - source. This maybe might be fixed by using
lenient
flag. - A malicious bot could inject special Lucene syntax like wildcards, range queries, or even malformed expressions, leading to performance degradation (e.g., complex regex/wildcard queries), etc.
Can we discuss here advantages and disadvantages of query_string and ElasticSearch DSL query?
What about simple_query_string? It might be more safe for us.
Moreover, I am wondering whether the listed issues above are also present when we are using Solr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comment! You're absolutely right to bring up the trade-offs between query_string, simple_query_string, and our current custom query parser based approach. Let me add some thoughts on why I’ve stuck with the custom builder and why switching to query_string
(or even simple_query_string
) might not be the best path forward in our case:
-
As Georgy mentioned, users often submit queries like
Deep Learning: A Survey (2023)
, withquery_string
, these inputs must be perfectly formed Lucene syntax which users won’t know. Even a missing quote or special character can cause parsing failures or incorrect behavior. While thelenient
flag can help, it can also mask deeper issues and result in confusing results, often returning 0 results because of parse failure or malformed query (e.g.:something"(
). -
Using
query_string
directly opens the door to Lucene syntax injection. Malicious users or bots could send wildcard-heavy, deeply nested, or regex-based queries that can degrade search performance or cause errors. Our current approach allows us to filter, escape, or block these patterns early before hitting Elasticsearch. -
Elasticsearch’s query DSL gives us the ability to define clear search logic with must, should, boost, and filter. We can control how fields like
title
andauthor
contribute to scoring, or apply different analyzers.query_string
flattens this control, and it becomes harder to evolve the search experience in the future. -
Since we must support both structured and free-text queries in the same engine endpoint (I have no known way to differentiate where the query came from, and not all structured parts are in filters), a naive switch to
query_string
wouldn’t improve our situation. It would require extra parsing or escaping on our side anyway. Our current query builder handles both gracefully and consistently, keeping our logic centralized and testable.simple_query_string
also doesn’t support more advanced query structures or field-level boosting (which we might want to add in the future). It's a safer subset ofquery_string
, but I think it will not be flexible enough even for our current feature set.
@@ -140,25 +197,25 @@ public QueryStringMap(String queryString) { | |||
|
|||
/** | |||
* This is a kluge, but perhaps it will work for now. | |||
* | |||
* <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove those p tags if they are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this pull request do?
Updates current ES 6.x integration to 8.x.
What's new?
Changes in ResponseParser and ES documentation on the first draft.
Example:
src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ResponseParser.java
to be in line with current ES APIsrc/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/Elasticsearch_notes_on_the_first_draft.md
with new mappingHow should this be tested?
Initial setup
{vitro_home}/config/applicationSetup.n3
to use this driver (see below).vitro.local.searchengine.url
configuration property to contain ES index base URL (due to backward compatibility, Solr can also be configured usingvitro.local.solr.url
. This will however result in a warning that is shown in logs, advising the client to switch to a new configuration parameter)vitro.local.searchengine.username
configuration property to contain ES/OS basic auth usernamevitro.local.searchengine.password
configuration property to contain to contain ES/OS basic auth passwordA mapping for the search index
Modify
applicationSetup.n3
Your setup should be completed now 😃 ! After this, you should perform common manual tests that are done for every new release.
Interested parties
@chenejac
Reviewers' expertise
Candidates for reviewing this PR should have some of the following expertises: