-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implementation for regex_instr #15928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others) |
The reference system for this function's behaviour is postgres. |
Thank you for this PR @nirnayroy Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928
|
I ran the bash script, but I’m not sure if the workflow succeeded. |
fixed the cippy errors showing up in the workflow |
fixed formatting error in workflow |
@Omega359 I wonder if you might have time to review this PR? |
Of course @alamb, not sure how I missed this one. It may be a day or two though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
I've left some comments - a few are just nits, so feel free to dismiss them or raise separate issues on top of this PR.
My main concern right now is that I’d expect this function to be a superset of strpos
, but it currently behaves differently in some edge cases. For example:
SELECT regexp_instr('😀abcdef', 'abc');
SELECT strpos('😀abcdef', 'abc');
SELECT strpos(NULL, 'abc');
SELECT regexp_instr(NULL, 'abc');
SELECT strpos('abc', NULL);
SELECT regexp_instr('abc', NULL);
Do you think we should unify them? 🙏
Hi @blaginin, thanks for the review and regret the delay in reply. I think I have rectified a majority of the concerns raised. Please have a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds the regexp_instr
SQL function, including documentation, SQL logic tests, benchmarking, and integration into the function registry.
- Introduce
regexp_instr
in docs and user guide - Add
regexp_instr
SQL tests and benches - Expose regex compilation helpers and register the new UDF
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
docs/source/user-guide/sql/scalar_functions.md | Add docs for regexp_instr , including signature and examples |
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt | Add SQL logic tests covering regexp_instr behavior |
datafusion/functions/src/regex/regexpcount.rs | Make compile_and_cache_regex and compile_regex public |
datafusion/functions/src/regex/mod.rs | Register regexp_instr UDF and define its expression builder |
datafusion/functions/benches/regx.rs | Add benchmark cases for regexp_instr |
Comments suppressed due to low confidence (4)
docs/source/user-guide/sql/scalar_functions.md:1844
- The
start
argument description is duplicated (- **start**: - **start**:
). Remove the extra label for clarity.
- **start**: - **start**: Optional start position (the first position is 1) to search for the regular expression. Can be a constant, column, or function. Defaults to 1
docs/source/user-guide/sql/scalar_functions.md:1837
- The function signature is missing the
subexpr
(andreturn_option
) parameters implemented in code. Update the signature to match the actual argument order.
regexp_instr(str, regexp[, start[, N[, flags]]])
datafusion/functions/src/regex/mod.rs:71
- [nitpick] The parameter name
endoption
is unclear. Consider renaming it toreturn_option
or another descriptive name consistent with SQL terminology.
endoption: Option<Expr>,
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt:1
- There are no tests covering the
subexpr
argument. Add test cases that use thesubexpr
parameter to validate capture-group position behavior.
# Licensed to the Apache Software Foundation (ASF) under one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, @nirnayroy!
The PR is much better now, but I think there's still room for improvement:
- Tests are failing. If that helps, you can run the CI command locally to debug:

- The code around handling literals and arrays feels a bit complex. I think arrow/df does this in a simpler and more concise way – I’ve attached some links for inspiration :)
I haven’t reviewed the tests yet since I assume things might change as you iterate. But I can see you're covering a lot of cases, which is great ☘️
add tests for subexp correct function signature for benches
Hi @blaginin , thanks for the help and suggestions for improvement.
I have tried running it and the tests are passing on my local. |
args.push(subexpr); | ||
}; | ||
super::regexp_instr().call(args) | ||
} | ||
/// Returns true if a has at least one match in a string, false otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns true if a has at least one match in a string, false otherwise. | |
/// Returns true if a regex has at least one match in a string, false otherwise. |
Some(flags) => { | ||
if flags.contains("g") { | ||
return Err(ArrowError::ComputeError( | ||
"regexp_count() does not support global flag".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"regexp_count() does not support global flag".to_string(), | |
"regexp_count()/regexp_instr() does not support the global flag".to_string(), |
let (start_array, _is_start_scalar) = start_array.map_or((None, true), |start| { | ||
let (start, is_start_scalar) = start.get(); | ||
(Some(start), is_start_scalar) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let (start_array, _is_start_scalar) = start_array.map_or((None, true), |start| { | |
let (start, is_start_scalar) = start.get(); | |
(Some(start), is_start_scalar) | |
}); | |
let start_array = start_array.map_or(None, |start| { | |
let (start, _is_start_scalar) = start.get(); | |
Some(start) | |
}); |
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed, removed the unused variables
I'll pull this branch later this week and run the tests but in general this PR is looking pretty good! I left a few comments/suggestions for a few things I found from a quick review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my points are resolved, so approving. Thank you!
Will merge after the review round from @Omega359
Run extended tests |
It works! |
Clippy failures related to rand update (I think #16062) Edit: looks like the usages of rand for the benchmark was updated in the above commit ... I'm thinking the additions in this PR do not reflect that change. |
Run extended tests |
Which issue does this PR close?
Rationale for this change
Implements a regex SQL standard function in datafusion
What changes are included in this PR?
Implementation, tests, benches and docs for the regexp_instr function
Are these changes tested?
Yes
Are there any user-facing changes?
Yes
No