Fix search substrings #14574

mayank1008-tech · 2025-12-11T20:23:39Z

Description

Description
This PR addresses the issue where searching for partial words (e.g., "dorf") in linked file content failed to find documents containing the full word (e.g., "Düsseldorf").

I modified LinkedFilesSearcher.java to:

Enable Leading Wildcards: Updated the Lucene query parser configuration to allow leading wildcards (parser.setAllowLeadingWildcard(true)).

Automate Wildcard Wrapping: Modified the query logic to automatically wrap search terms in asterisks (e.g., transforming query to *query*) if the user hasn't explicitly provided wildcards. This ensures that terms like "dorf" effectively search for *dorf*, enabling substring matches.

Steps to test

Have a PDF file Dusseldorf.pdf containing unique text (e.g., the word "Düsseldorf").
Create a BibTeX entry and link this PDF to it.
Note: Ensure the file link includes the .pdf extension in the BibTeX source (e.g., file = {:Dusseldorf.pdf:PDF}) to ensure the indexer reads the content correctly.
In the search bar, type a substring of the word (e.g., eldorf).
Verify that the entry containing "Düsseldorf" appears in the search results.

Mandatory checks

I own the copyright of the code submitted and I license it under the MIT license
I manually tested my changes in running JabRef (always required)
I added JUnit tests for changes (if applicable)
I added screenshots in the PR description (if change is visible to the user)
I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

koppor · 2025-12-12T06:43:24Z

Please also come up with a test case covering mixed sesrch strings.

author=~*dorf AND Düssel

See https://docs.jabref.org/finding-sorting-and-cleaning-entries/search for a full syntax description

Implemented 'searchWithMixedQueryMatchesContentAndMetaData' to verify simultaneous matching of Author regex and PDF content. Updated test data to match 'thesis-example.pdf' and corrected the regex pattern from '~*dorf' to '~.*dorf' to comply with Java Pattern syntax.

mayank1008-tech · 2025-12-12T10:37:14Z

@koppor Hi! I've pushed the new test case for the mixed search query (Metadata(author) + Fulltext in the file).

I made a small adjustment, the original suggestion was 'author=~*dorf', but * is a quantifier that requires a preceding character. I updated it to 'author=~.*dorf' (where . matches any character).

I also used the test data to use the actual author name ('Author Name') from thesis-example.pdf. The test is passing and confirms that the searcher correctly filters by both metadata and file content.

subhramit · 2025-12-12T10:47:38Z

@koppor Hi! I've pushed the new test case for the mixed search query (Metadata(author) + Fulltext in the file).

I made a small adjustment to the regex syntax in the query. The original suggestion was 'author=*dorf', but since JabRef uses Java's regex engine, * is a quantifier that requires a preceding character. I updated it to 'author=.*dorf' (where . matches any character).

I also updated the test data to use the actual author name ('Author Name') from thesis-example.pdf to make the test scenario more realistic. The test is passing and confirms that the searcher correctly filters by both metadata and file content.

Please be aware of our AI usage guidelines. You may be blocked if you keep using AI to communicate.
Ref. https://github.com/JabRef/jabref/blob/main/AI_USAGE_POLICY.md, https://github.com/JabRef/jabref/blob/main/CONTRIBUTING.md#ai-usage-policy

LoayGhreeb · 2025-12-12T12:28:43Z

jablib/src/main/java/org/jabref/logic/search/retrieval/LinkedFilesSearcher.java

        String query = SearchQueryConversion.searchToLucene(searchQuery);
+        if (!query.contains("*") && !query.contains("?")) {
+            query = "*" + query + "*";
+        }


This way makes every query a substring match, even when the query is intended to be an exact match (e.g., content == "dorf").

This logic should be handled during query parsing, where you have access to the search flags, each field, and can decide when to add the surrounding asterisks. Please take a look at the SearchQueryConversion and SearchToLuceneVisitor classes.

LoayGhreeb · 2025-12-12T14:00:21Z

jablib/src/test/java/org/jabref/logic/search/retrieval/LinkedFilesSearcherTest.java

You might only test that the search query is being parsed and converted correctly to include asterisks when needed. You can add test cases to SearchQueryLuceneConversionTest.

LoayGhreeb · 2025-12-12T14:17:36Z

Also, this will make the default Lucene search behave as a substring search, which will cause Lucene to scan the entire index for each search. It's better to keep the default as Lucene's standard lexical search to benefit from features like fuzzy matching, tokenization, and performance optimizations.
For substring searches on linked files, queries like content = *query* can be used.

mayank1008-tech · 2025-12-12T16:41:23Z

Also, this will make the default Lucene search behave as a substring search, which will cause Lucene to scan the entire index for each search. It's better to keep the default as Lucene's standard lexical search to benefit from features like fuzzy matching, tokenization, and performance optimizations. For substring searches on linked files, queries like content = *query* can be used.

@LoayGhreeb The main motive of this issue was to find the substring in the linked files. But yes if we apply wildcards to every other thing coming, will be highly unoptimal. So should i go for applying asterisk for the 'content' field only in which we have to search in the linked file only?

mayank1008-tech added 2 commits December 12, 2025 01:46

Fix: Enable substring search (wildcards) for linked file content

9b50084

Update CHANGELOG for substring search fix

4a190a2

koppor requested a review from LoayGhreeb December 12, 2025 06:37

koppor mentioned this pull request Dec 12, 2025

Fix Full text search does not find substring. #14573

Closed

5 tasks

github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 12, 2025

Fix checkstyle: Correct static variable declaration order

34ddffb

github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 12, 2025

calixtus added the component: search label Dec 12, 2025

LoayGhreeb requested changes Dec 12, 2025

View reviewed changes

github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix search substrings #14574

Fix search substrings #14574

mayank1008-tech commented Dec 11, 2025 •

edited

Loading

Uh oh!

koppor commented Dec 12, 2025

Uh oh!

mayank1008-tech commented Dec 12, 2025 •

edited

Loading

Uh oh!

subhramit commented Dec 12, 2025

Uh oh!

LoayGhreeb Dec 12, 2025

Uh oh!

LoayGhreeb Dec 12, 2025

Uh oh!

LoayGhreeb commented Dec 12, 2025

Uh oh!

mayank1008-tech commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Fix search substrings #14574

Are you sure you want to change the base?

Fix search substrings #14574

Conversation

mayank1008-tech commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps to test

Mandatory checks

Uh oh!

koppor commented Dec 12, 2025

Uh oh!

mayank1008-tech commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subhramit commented Dec 12, 2025

Uh oh!

LoayGhreeb Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

LoayGhreeb Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

LoayGhreeb commented Dec 12, 2025

Uh oh!

mayank1008-tech commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mayank1008-tech commented Dec 11, 2025 •

edited

Loading

mayank1008-tech commented Dec 12, 2025 •

edited

Loading