-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Fix search substrings #14574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix search substrings #14574
Conversation
|
Please also come up with a test case covering mixed sesrch strings. author=~*dorf AND Düssel See https://docs.jabref.org/finding-sorting-and-cleaning-entries/search for a full syntax description |
Implemented 'searchWithMixedQueryMatchesContentAndMetaData' to verify simultaneous matching of Author regex and PDF content. Updated test data to match 'thesis-example.pdf' and corrected the regex pattern from '~*dorf' to '~.*dorf' to comply with Java Pattern syntax.
|
@koppor Hi! I've pushed the new test case for the mixed search query (Metadata(author) + Fulltext in the file). I made a small adjustment, the original suggestion was 'author=~*dorf', but * is a quantifier that requires a preceding character. I updated it to 'author=~.*dorf' (where . matches any character). I also used the test data to use the actual author name ('Author Name') from thesis-example.pdf. The test is passing and confirms that the searcher correctly filters by both metadata and file content. |
Please be aware of our AI usage guidelines. You may be blocked if you keep using AI to communicate. |
| String query = SearchQueryConversion.searchToLucene(searchQuery); | ||
| if (!query.contains("*") && !query.contains("?")) { | ||
| query = "*" + query + "*"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way makes every query a substring match, even when the query is intended to be an exact match (e.g., content == "dorf").
This logic should be handled during query parsing, where you have access to the search flags, each field, and can decide when to add the surrounding asterisks. Please take a look at the SearchQueryConversion and SearchToLuceneVisitor classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might only test that the search query is being parsed and converted correctly to include asterisks when needed. You can add test cases to SearchQueryLuceneConversionTest.
|
Also, this will make the default Lucene search behave as a substring search, which will cause Lucene to scan the entire index for each search. It's better to keep the default as Lucene's standard lexical search to benefit from features like fuzzy matching, tokenization, and performance optimizations. |
@LoayGhreeb The main motive of this issue was to find the substring in the linked files. But yes if we apply wildcards to every other thing coming, will be highly unoptimal. So should i go for applying asterisk for the 'content' field only in which we have to search in the linked file only? |
Closes #14569
Description
Description
This PR addresses the issue where searching for partial words (e.g., "dorf") in linked file content failed to find documents containing the full word (e.g., "Düsseldorf").
I modified LinkedFilesSearcher.java to:
Enable Leading Wildcards: Updated the Lucene query parser configuration to allow leading wildcards (parser.setAllowLeadingWildcard(true)).
Automate Wildcard Wrapping: Modified the query logic to automatically wrap search terms in asterisks (e.g., transforming query to *query*) if the user hasn't explicitly provided wildcards. This ensures that terms like "dorf" effectively search for *dorf*, enabling substring matches.


Steps to test
Note: Ensure the file link includes the .pdf extension in the BibTeX source (e.g., file = {:Dusseldorf.pdf:PDF}) to ensure the indexer reads the content correctly.
Mandatory checks
CHANGELOG.mdin a way that is understandable for the average user (if change is visible to the user)