Skip to content

[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rchowell
Copy link
Contributor

Summary

While working on try_{de|en}code I wanted to see if I could do the try_cast plumbing, but it was getting out of hand. This is a draft so that I can push the encoding work and revisit this.

NOTE: I intend to remove the TryCast Expr and instead add options to the existing CAST expression to simplify things.

Related Issues

#3989

Changes Made

  • Adds daft cast options for more advanced cast use-cases (for now just permissive casting)
  • Adds a permissive option to the arrow2 cast options

Checklist

  • All tests have passed
  • Documented in API Docs
  • Documented in User Guide
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Mar 25, 2025
@rchowell rchowell changed the title feat: adds try_cast via a permissive option in arrow2 cast kernels [DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels Mar 27, 2025
rchowell added a commit that referenced this pull request Mar 27, 2025
## Summary

**This gives ~3.2x speedup for decoding binary arrays into string
arrays**

This PR adds try_encode and try_decode with utf-8 special-case. You'll
see cases for binary-to-binary transforms like gzip compress and
decompress, as well as binary-to-text and text-to-binary transformations
for things like converting bytes to utf-8 and visa-versa. We can
continue to build from this [with additional
encodings](https://docs.python.org/3/library/codecs.html#standard-encodings)
and I've carved out a special no-copy path for utf-8.

## Performance Results

Three runs with 10 iterations (+1 warmup) on 1 million rows shows ~3.2x
speedup.

```
❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s                                                                                                                                                                                                                                                                            
Native try_decode stats (seconds): {'mean': 0.1969996452331543, 'median': 0.19691014289855957, 'min': 0.1933138370513916, 'max': 0.20042800903320312, 'stdev': 0.0018028098671721037}
UDF try_decode stats (seconds): {'mean': 0.6376919507980346, 'median': 0.6374071836471558, 'min': 0.6186070442199707, 'max': 0.6605658531188965, 'stdev': 0.011603869017790357}
**Average speedup: 3.24x**

❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s                                                                                                                                                                                                                                                                       
Native try_decode stats (seconds): {'mean': 0.19709632396697999, 'median': 0.19748806953430176, 'min': 0.19363689422607422, 'max': 0.1991891860961914, 'stdev': 0.00167838499446807}
UDF try_decode stats (seconds): {'mean': 0.6387589693069458, 'median': 0.639365553855896, 'min': 0.6251809597015381, 'max': 0.651353120803833, 'stdev': 0.0075957305958397415}
**Average speedup: 3.24x**

❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s                                                                                                                                                                                                                                                                           
Native try_decode stats (seconds): {'mean': 0.19655859470367432, 'median': 0.19698894023895264, 'min': 0.19165897369384766, 'max': 0.19891595840454102, 'stdev': 0.0019603584148133366}
UDF try_decode stats (seconds): {'mean': 0.6334790706634521, 'median': 0.6332188844680786, 'min': 0.6258370876312256, 'max': 0.6455898284912109, 'stdev': 0.0063130945873989455}
**Average speedup: 3.22x**
```

## Related Issues

#3989 
#4062

## Changes Made

* Adds codec kind to differentiate between text and binary encodings
* Adds try_encode and try_decode to python expression API (and all
layers beneath)
* Adds a special-case udf for decoding utf-8 since we only need to
validate the bytes

## Checklist

- [x] All tests have passed
- [x] Documented in API Docs
- [x] Documented in User Guide
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant