Skip to content

Conversation

@fangchenli
Copy link
Member

@fangchenli fangchenli commented Jan 22, 2026

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
  • If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

Add a script to generate documentation tracking Arrow method fallbacks in pandas. Requested by Databricks' PySpark team to identify which operations trigger Arrow-to-NumPy conversions in PySpark's pandas UDFs.

@zhengruifeng @Yicong-Huang

fangchenli and others added 5 commits January 11, 2026 15:46
Add documentation and tooling for Arrow method fallback behavior:

- Add scripts/generate_arrow_fallback_table.py: Generator script that
  introspects pandas source to classify methods by their Arrow support
  (ARROW_NATIVE, CONDITIONAL, ELEMENTWISE, OBJECT_FALLBACK, VERSION_GATED,
  NOT_IMPLEMENTED). Includes --check flag for CI validation.

- Add doc/source/user_guide/arrow_string_fallbacks.rst: Generated
  reference documenting which methods use native PyArrow compute vs
  falling back to Python/NumPy. Covers string, arithmetic, datetime,
  aggregation, array, list accessor, and struct accessor methods.

- Add pre-commit hook (arrow-fallback-docs-sync) to ensure documentation
  stays in sync with source code changes.

- Add comprehensive verification tests (204 tests) that validate
  classifications match actual runtime behavior.

- Link new reference from pyarrow.rst user guide.

- Update exclude pattern for private-import check to include scripts/tests.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace the AST-based analysis with runtime observation:
- Actually run all operations on all Arrow dtypes
- Observe return types and errors
- Instrument to_numpy and _apply_elementwise to detect fallbacks

This approach is more accurate because it observes actual behavior
rather than inferring from code analysis.

Changes:
- Rewrite scripts/generate_arrow_fallback_table.py using runtime tests
- Update scripts/tests/test_generate_arrow_fallback_table.py for new API
- Remove scripts/tests/test_arrow_fallback_verification.py (no longer needed)
- Regenerate doc/source/user_guide/arrow_string_fallbacks.rst
- Update pre-commit hook to use manual stage (requires pandas-dev env)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@fangchenli fangchenli added Docs Arrow pyarrow functionality labels Jan 22, 2026
Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on it, it is very helpful!

@zhengruifeng
Copy link
Contributor

cc @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants