Skip to content

Implementation for regex_instr #15928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

nirnayroy
Copy link

Which issue does this PR close?

Rationale for this change

Implements a regex SQL standard function in datafusion

What changes are included in this PR?

Implementation, tests, benches and docs for the regexp_instr function

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

No

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 2, 2025
@2010YOUY01
Copy link
Contributor

Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others)

@nirnayroy
Copy link
Author

Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others)

The reference system for this function's behaviour is postgres.

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Thank you for this PR @nirnayroy

Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928

If you encounter an error, run './dev/update_function_docs.sh' and commit

@nirnayroy
Copy link
Author

Thank you for this PR @nirnayroy

Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928

If you encounter an error, run './dev/update_function_docs.sh' and commit

I ran the bash script, but I’m not sure if the workflow succeeded.

@nirnayroy
Copy link
Author

fixed the cippy errors showing up in the workflow

@nirnayroy
Copy link
Author

fixed formatting error in workflow

@blaginin blaginin self-requested a review May 21, 2025 16:54
@alamb
Copy link
Contributor

alamb commented May 21, 2025

@Omega359 I wonder if you might have time to review this PR?

@Omega359
Copy link
Contributor

Of course @alamb, not sure how I missed this one. It may be a day or two though

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

I've left some comments - a few are just nits, so feel free to dismiss them or raise separate issues on top of this PR.

My main concern right now is that I’d expect this function to be a superset of strpos, but it currently behaves differently in some edge cases. For example:

SELECT regexp_instr('😀abcdef', 'abc');
SELECT strpos('😀abcdef', 'abc');

SELECT strpos(NULL, 'abc');
SELECT regexp_instr(NULL, 'abc');

SELECT strpos('abc', NULL);
SELECT regexp_instr('abc', NULL);

Do you think we should unify them? 🙏

@nirnayroy
Copy link
Author

Hi @blaginin, thanks for the review and regret the delay in reply. I think I have rectified a majority of the concerns raised. Please have a look again.

@blaginin blaginin self-requested a review June 17, 2025 20:25
@blaginin blaginin requested a review from Copilot June 17, 2025 20:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds the regexp_instr SQL function, including documentation, SQL logic tests, benchmarking, and integration into the function registry.

  • Introduce regexp_instr in docs and user guide
  • Add regexp_instr SQL tests and benches
  • Expose regex compilation helpers and register the new UDF

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/source/user-guide/sql/scalar_functions.md Add docs for regexp_instr, including signature and examples
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt Add SQL logic tests covering regexp_instr behavior
datafusion/functions/src/regex/regexpcount.rs Make compile_and_cache_regex and compile_regex public
datafusion/functions/src/regex/mod.rs Register regexp_instr UDF and define its expression builder
datafusion/functions/benches/regx.rs Add benchmark cases for regexp_instr
Comments suppressed due to low confidence (4)

docs/source/user-guide/sql/scalar_functions.md:1844

  • The start argument description is duplicated (- **start**: - **start**:). Remove the extra label for clarity.
- **start**: - **start**: Optional start position (the first position is 1) to search for the regular expression. Can be a constant, column, or function. Defaults to 1

docs/source/user-guide/sql/scalar_functions.md:1837

  • The function signature is missing the subexpr (and return_option) parameters implemented in code. Update the signature to match the actual argument order.
regexp_instr(str, regexp[, start[, N[, flags]]])

datafusion/functions/src/regex/mod.rs:71

  • [nitpick] The parameter name endoption is unclear. Consider renaming it to return_option or another descriptive name consistent with SQL terminology.
        endoption: Option<Expr>,

datafusion/sqllogictest/test_files/regexp/regexp_instr.slt:1

  • There are no tests covering the subexpr argument. Add test cases that use the subexpr parameter to validate capture-group position behavior.
# Licensed to the Apache Software Foundation (ASF) under one

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, @nirnayroy!

The PR is much better now, but I think there's still room for improvement:

  • Tests are failing. If that helps, you can run the CI command locally to debug:
image
  • The code around handling literals and arrays feels a bit complex. I think arrow/df does this in a simpler and more concise way – I’ve attached some links for inspiration :)

I haven’t reviewed the tests yet since I assume things might change as you iterate. But I can see you're covering a lot of cases, which is great ☘️

@nirnayroy
Copy link
Author

Hi @blaginin , thanks for the help and suggestions for improvement.
I have addressed the requested changes. Please have another look.

Tests are failing. If that helps, you can run the CI command locally to debug:

I have tried running it and the tests are passing on my local.

args.push(subexpr);
};
super::regexp_instr().call(args)
}
/// Returns true if a has at least one match in a string, false otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns true if a has at least one match in a string, false otherwise.
/// Returns true if a regex has at least one match in a string, false otherwise.

Some(flags) => {
if flags.contains("g") {
return Err(ArrowError::ComputeError(
"regexp_count() does not support global flag".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"regexp_count() does not support global flag".to_string(),
"regexp_count()/regexp_instr() does not support the global flag".to_string(),

Comment on lines 217 to 220
let (start_array, _is_start_scalar) = start_array.map_or((None, true), |start| {
let (start, is_start_scalar) = start.get();
(Some(start), is_start_scalar)
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let (start_array, _is_start_scalar) = start_array.map_or((None, true), |start| {
let (start, is_start_scalar) = start.get();
(Some(start), is_start_scalar)
});
let start_array = start_array.map_or(None, |start| {
let (start, _is_start_scalar) = start.get();
Some(start)
});

???

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed, removed the unused variables

@Omega359
Copy link
Contributor

Omega359 commented Jul 1, 2025

I'll pull this branch later this week and run the tests but in general this PR is looking pretty good! I left a few comments/suggestions for a few things I found from a quick review.

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my points are resolved, so approving. Thank you!
Will merge after the review round from @Omega359 ☺️

@Omega359
Copy link
Contributor

Omega359 commented Jul 2, 2025

Run extended tests

@alamb
Copy link
Contributor

alamb commented Jul 2, 2025

Run extended tests

It works!

@Omega359
Copy link
Contributor

Omega359 commented Jul 2, 2025

Clippy failures related to rand update (I think #16062)

Edit: looks like the usages of rand for the benchmark was updated in the above commit ... I'm thinking the additions in this PR do not reflect that change.

@Omega359
Copy link
Contributor

Omega359 commented Jul 2, 2025

Run extended tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regexp function - regexp_instr()
5 participants