Skip to content

Conversation

@mayank1008-tech
Copy link
Contributor

@mayank1008-tech mayank1008-tech commented Dec 11, 2025

Closes #14569

Description

Description
This PR addresses the issue where searching for partial words (e.g., "dorf") in linked file content failed to find documents containing the full word (e.g., "Düsseldorf").

I modified LinkedFilesSearcher.java to:

Enable Leading Wildcards: Updated the Lucene query parser configuration to allow leading wildcards (parser.setAllowLeadingWildcard(true)).

Automate Wildcard Wrapping: Modified the query logic to automatically wrap search terms in asterisks (e.g., transforming query to *query*) if the user hasn't explicitly provided wildcards. This ensures that terms like "dorf" effectively search for *dorf*, enabling substring matches.
image
image

Steps to test

  1. Have a PDF file Dusseldorf.pdf containing unique text (e.g., the word "Düsseldorf").
  2. Create a BibTeX entry and link this PDF to it.
    Note: Ensure the file link includes the .pdf extension in the BibTeX source (e.g., file = {:Dusseldorf.pdf:PDF}) to ensure the indexer reads the content correctly.
  3. In the search bar, type a substring of the word (e.g., eldorf).
  4. Verify that the entry containing "Düsseldorf" appears in the search results.

Mandatory checks

@koppor koppor requested a review from LoayGhreeb December 12, 2025 06:37
@koppor
Copy link
Member

koppor commented Dec 12, 2025

Please also come up with a test case covering mixed sesrch strings.

author=~*dorf AND Düssel

See https://docs.jabref.org/finding-sorting-and-cleaning-entries/search for a full syntax description

Implemented 'searchWithMixedQueryMatchesContentAndMetaData' to verify simultaneous matching of Author regex and PDF content. Updated test data to match 'thesis-example.pdf' and corrected the regex pattern from '~*dorf' to '~.*dorf' to comply with Java Pattern syntax.
@mayank1008-tech
Copy link
Contributor Author

mayank1008-tech commented Dec 12, 2025

@koppor Hi! I've pushed the new test case for the mixed search query (Metadata(author) + Fulltext in the file).

I made a small adjustment, the original suggestion was 'author=~*dorf', but * is a quantifier that requires a preceding character. I updated it to 'author=~.*dorf' (where . matches any character).

I also used the test data to use the actual author name ('Author Name') from thesis-example.pdf. The test is passing and confirms that the searcher correctly filters by both metadata and file content.

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 12, 2025
@subhramit
Copy link
Member

@koppor Hi! I've pushed the new test case for the mixed search query (Metadata(author) + Fulltext in the file).

I made a small adjustment to the regex syntax in the query. The original suggestion was 'author=*dorf', but since JabRef uses Java's regex engine, * is a quantifier that requires a preceding character. I updated it to 'author=.*dorf' (where . matches any character).

I also updated the test data to use the actual author name ('Author Name') from thesis-example.pdf to make the test scenario more realistic. The test is passing and confirms that the searcher correctly filters by both metadata and file content.

Please be aware of our AI usage guidelines. You may be blocked if you keep using AI to communicate.
Ref. https://github.com/JabRef/jabref/blob/main/AI_USAGE_POLICY.md, https://github.com/JabRef/jabref/blob/main/CONTRIBUTING.md#ai-usage-policy

@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 12, 2025
String query = SearchQueryConversion.searchToLucene(searchQuery);
if (!query.contains("*") && !query.contains("?")) {
query = "*" + query + "*";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way makes every query a substring match, even when the query is intended to be an exact match (e.g., content == "dorf").

This logic should be handled during query parsing, where you have access to the search flags, each field, and can decide when to add the surrounding asterisks. Please take a look at the SearchQueryConversion and SearchToLuceneVisitor classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might only test that the search query is being parsed and converted correctly to include asterisks when needed. You can add test cases to SearchQueryLuceneConversionTest.

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 12, 2025
@LoayGhreeb
Copy link
Member

Also, this will make the default Lucene search behave as a substring search, which will cause Lucene to scan the entire index for each search. It's better to keep the default as Lucene's standard lexical search to benefit from features like fuzzy matching, tokenization, and performance optimizations.
For substring searches on linked files, queries like content = *query* can be used.

@mayank1008-tech
Copy link
Contributor Author

Also, this will make the default Lucene search behave as a substring search, which will cause Lucene to scan the entire index for each search. It's better to keep the default as Lucene's standard lexical search to benefit from features like fuzzy matching, tokenization, and performance optimizations. For substring searches on linked files, queries like content = *query* can be used.

@LoayGhreeb The main motive of this issue was to find the substring in the linked files. But yes if we apply wildcards to every other thing coming, will be highly unoptimal. So should i go for applying asterisk for the 'content' field only in which we have to search in the linked file only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: search status: changes-required Pull requests that are not yet complete

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full text search does not find substring.

5 participants