Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1360263: row_number window function raises a ValueError (index is not in list) #1490

Closed
samuelsongsr opened this issue May 2, 2024 · 3 comments
Assignees
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required status-triage_needed This is a new issue, and initial triage is needed

Comments

@samuelsongsr
Copy link

  1. What version of Python are you using?

    Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

  2. What are the Snowpark Python and pandas versions in the environment?

    pandas==2.2.1
    snowflake-snowpark-python==1.15.0

  3. What did you do?
    I'm trying to test a row_number() window function and it's raising a ValueError on this line:
    _plan.py

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),
    Row(id=2, row_date=date(2024, 1, 2), value=1)
]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over id and row_date and get the records with the largest values
window = Window.partition_by("id", "row_date").order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

  1. What did you expect to see?

    The results should include the following 4 records:

Row(id=1, row_date=date(2024, 1, 1), value=1),
Row(id=2, row_date=date(2024, 1, 1), value=1),
Row(id=1, row_date=date(2024, 1, 2), value=100),
Row(id=2, row_date=date(2024, 1, 2), value=1)

Instead of the above results, a ValueError is raised stating (1 is not in list).

@samuelsongsr samuelsongsr added bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required labels May 2, 2024
@github-actions github-actions bot changed the title row_number window function raises a ValueError (index is not in list) SNOW-1360263: row_number window function raises a ValueError (index is not in list) May 2, 2024
@sfc-gh-ashahi sfc-gh-ashahi self-assigned this May 2, 2024
@sfc-gh-ashahi sfc-gh-ashahi added the status-triage_needed This is a new issue, and initial triage is needed label May 2, 2024
@tvdboom
Copy link

tvdboom commented Sep 13, 2024

Any update on this one? I am still encountering this issue on version 1.22.1

@frederiksteiner
Copy link
Contributor

I encountered the same problem and played around with it.
I found a similar example and played a bit around with it.

This one does not work:

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=3, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),

]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over row_date and id and get the records with the largest values
window = Window.partition_by(["row_date", "id"]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

But when swapping "row_date" and "id" in the partition_by, it actually works and gives no such error:

from datetime import date

from snowflake.snowpark import Row, Window, Session
from snowflake.snowpark.functions import row_number, col


# create some test data
data = [
    Row(id=1, row_date=date(2024, 1, 1), value=1),
    Row(id=2, row_date=date(2024, 1, 1), value=1),
    Row(id=3, row_date=date(2024, 1, 2), value=1),
    Row(id=1, row_date=date(2024, 1, 2), value=100),
]

# create a local testing session and dataframe
mock_session = Session.builder.config("local_testing", True).create()
test_data = mock_session.create_dataframe(data)

# partition over id and row_date and get the records with the largest values
window = Window.partition_by(["id", "row_date", ]).order_by(col("value").desc())
df = test_data.with_column("row_num", row_number().over(window)).where(col("row_num") == 1)

results = df.collect()

@sfc-gh-jrose
Copy link
Contributor

This should be fixed in v1.24.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required status-triage_needed This is a new issue, and initial triage is needed
Projects
None yet
Development

No branches or pull requests

5 participants