Skip to content

Conversation

@sfc-gh-jkew
Copy link
Contributor

@sfc-gh-jkew sfc-gh-jkew commented Oct 24, 2025

df.apply(axis=1) should preserve the original index. Previously we would return a RangeIndex regardless of the original index. This approach passes the index data into the underlying UDTF.

Mostly AI written approach, but with original tests for verification.

Fixes SNOW-1051741

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines

@sfc-gh-jkew sfc-gh-jkew added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Oct 24, 2025
@sfc-gh-jkew sfc-gh-jkew marked this pull request as ready for review October 24, 2025 20:56
@sfc-gh-jkew sfc-gh-jkew requested a review from a team as a code owner October 24, 2025 20:56
Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just had one question.

Comment on lines 469 to 489
if num_index_columns > 0:
# Columns after row position are index columns, then data columns
index_cols = df.iloc[:, 1 : 1 + num_index_columns]
data_cols = df.iloc[:, 1 + num_index_columns :]

# Set the index using the index columns
if num_index_columns == 1:
index = index_cols.iloc[:, 0]
if index_column_pandas_labels:
index.name = index_column_pandas_labels[0]
else:
# Multi-index case
index = native_pd.MultiIndex.from_arrays(
[index_cols.iloc[:, i] for i in range(num_index_columns)],
names=index_column_pandas_labels
if index_column_pandas_labels
else None,
)
data_cols.index = index
df = data_cols
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we use set_index() in both cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you can replace most of the code here with set_index(). See #3979.

Comment on lines 447 to 448
input_types: Snowpark column types of the input data columns (including index columns).
index_column_pandas_labels: The pandas labels for the index columns, if any.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_types: Snowpark column types of the input data columns (including index columns).
index_column_pandas_labels: The pandas labels for the index columns, if any.
input_types: Snowpark column types of the input data columns (including index columns).



@sql_count_checker(query_count=5, join_count=2, udtf_count=1)
def test_apply_axis_1_multiindex_preservation():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also test

  1. func with return type annotations. We'll use vectorized UDFs instead of UDTFs.
  2. func returning a series
  3. apply() on series (with func typed, untyped, or returning a series)

Comment on lines 469 to 489
if num_index_columns > 0:
# Columns after row position are index columns, then data columns
index_cols = df.iloc[:, 1 : 1 + num_index_columns]
data_cols = df.iloc[:, 1 + num_index_columns :]

# Set the index using the index columns
if num_index_columns == 1:
index = index_cols.iloc[:, 0]
if index_column_pandas_labels:
index.name = index_column_pandas_labels[0]
else:
# Multi-index case
index = native_pd.MultiIndex.from_arrays(
[index_cols.iloc[:, i] for i in range(num_index_columns)],
names=index_column_pandas_labels
if index_column_pandas_labels
else None,
)
data_cols.index = index
df = data_cols
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you can replace most of the code here with set_index(). See #3979.

Comment on lines +9812 to +9813
# Determine if we should pass index columns to the UDTF
# We pass index columns when the index is not the row position itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always pass the index column names here. We can keep doing that, but we should update the comment and make the parameter required, since there don't seem to be any other invocations of that function.

column_index: native_pd.Index,
input_types: list[DataType],
session: Session,
index_column_labels: list[Hashable] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that just passing the number of index columns is enough:

# columns. We don't care about the index names because `func`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants