[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

rchowell · 2025-03-25T17:51:27Z

Summary

While working on try_{de|en}code I wanted to see if I could do the try_cast plumbing, but it was getting out of hand. This is a draft so that I can push the encoding work and revisit this.

NOTE: I intend to remove the TryCast Expr and instead add options to the existing CAST expression to simplify things.

Related Issues

#3989

Changes Made

Adds daft cast options for more advanced cast use-cases (for now just permissive casting)
Adds a permissive option to the arrow2 cast options

Checklist

All tests have passed
Documented in API Docs
Documented in User Guide
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

## Summary **This gives ~3.2x speedup for decoding binary arrays into string arrays** This PR adds try_encode and try_decode with utf-8 special-case. You'll see cases for binary-to-binary transforms like gzip compress and decompress, as well as binary-to-text and text-to-binary transformations for things like converting bytes to utf-8 and visa-versa. We can continue to build from this [with additional encodings](https://docs.python.org/3/library/codecs.html#standard-encodings) and I've carved out a special no-copy path for utf-8. ## Performance Results Three runs with 10 iterations (+1 warmup) on 1 million rows shows ~3.2x speedup. ``` ❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s Native try_decode stats (seconds): {'mean': 0.1969996452331543, 'median': 0.19691014289855957, 'min': 0.1933138370513916, 'max': 0.20042800903320312, 'stdev': 0.0018028098671721037} UDF try_decode stats (seconds): {'mean': 0.6376919507980346, 'median': 0.6374071836471558, 'min': 0.6186070442199707, 'max': 0.6605658531188965, 'stdev': 0.011603869017790357} **Average speedup: 3.24x** ❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s Native try_decode stats (seconds): {'mean': 0.19709632396697999, 'median': 0.19748806953430176, 'min': 0.19363689422607422, 'max': 0.1991891860961914, 'stdev': 0.00167838499446807} UDF try_decode stats (seconds): {'mean': 0.6387589693069458, 'median': 0.639365553855896, 'min': 0.6251809597015381, 'max': 0.651353120803833, 'stdev': 0.0075957305958397415} **Average speedup: 3.24x** ❯ pytest ./tests/functions/test_codecs.py -k test_try_decode_utf8_perf -s Native try_decode stats (seconds): {'mean': 0.19655859470367432, 'median': 0.19698894023895264, 'min': 0.19165897369384766, 'max': 0.19891595840454102, 'stdev': 0.0019603584148133366} UDF try_decode stats (seconds): {'mean': 0.6334790706634521, 'median': 0.6332188844680786, 'min': 0.6258370876312256, 'max': 0.6455898284912109, 'stdev': 0.0063130945873989455} **Average speedup: 3.22x** ``` ## Related Issues #3989 #4062 ## Changes Made * Adds codec kind to differentiate between text and binary encodings * Adds try_encode and try_decode to python expression API (and all layers beneath) * Adds a special-case udf for decoding utf-8 since we only need to validate the bytes ## Checklist - [x] All tests have passed - [x] Documented in API Docs - [x] Documented in User Guide - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

rchowell added 3 commits March 24, 2025 17:56

feat: adds try_encode and try_decode with utf-8 special-case

3571521

implements try_decode and use try_cast for utf8

b1e1f99

backup try_cast

bb97d48

github-actions bot added the feat label Mar 25, 2025

rchowell mentioned this pull request Mar 25, 2025

feat: adds try_encode and try_decode with utf-8 special-case #4060

Merged

5 tasks

rchowell changed the title ~~feat: adds try_cast via a permissive option in arrow2 cast kernels~~ [DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

rchowell commented Mar 25, 2025

[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

Are you sure you want to change the base?

[DRAFT] feat: adds try_cast via a permissive option in arrow2 cast kernels #4062

Conversation

rchowell commented Mar 25, 2025

Summary

Related Issues

Changes Made

Checklist