Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

Merged

Conversation

sfc-gh-vbudati
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1458135

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

  • Implemented functionality to enable creating Series and DataFrame objects with a lazy Index object as the data, index, and/or columns.
  • This also covers creating Series and DataFrames with rows/columns that don't exist in the given data.
  • A special case is when the data is a Series or DataFrame object, the new Series or DataFrame object is creating by filtering the data with provided index and columns.
  • In case some values in index don't exist in data's index, these values are added as new rows and their corresponding data values are NaN.
  • In case some values in columns don't exist in data's columns, these values are added as new NaN columns.
  • I use a right outer join to add the new index values, and create and append the new NaN columns in the logic.

…y-index

# Conflicts:
#	src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
@sfc-gh-vbudati sfc-gh-vbudati added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Aug 21, 2024
@sfc-gh-vbudati
Copy link
Contributor Author

All of the join counts in the tests have increased because during DataFrame/Series creation with a non-Snowpark pandas object as data and a Snowpark pandas Index as index, a join is performed instead of converting the index to pandas (which results in an extra query).

In some cases the join count is a lot higher in tests but this is because of the way they are written - some tests call to_pandas() multiple times which results in this.

Copy link
Collaborator

@sfc-gh-azhan sfc-gh-azhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! It's a lot of work btw.

Please also check

  1. if you identify some test code can be improved, please add todo and track with jira.
  2. please run a jenkins job to see if anything wrong there before merge.

tests/integ/modin/test_concat.py Outdated Show resolved Hide resolved
pytest.param(
"series",
marks=pytest.mark.xfail(
reason="SNOW-1675191 reindex does not work with tuple series"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-vbudati @sfc-gh-azhan mentioned that the main purpose of this pr is to remove a to_pandas materialization, can we just do that in this pr, and move the other refactoring part out of the current pr?

name = data.name
from snowflake.snowpark.modin.plugin.extensions.index import Index

# Setting the query compiler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more general comment here about the change, our orignial code behaves in such way that if both data and query compiler are provided, the data is used.
However, here seems we want to change it to a way that only one of them can be configured. i think that is fine, however, please make sure we update the doc to clear this part.

Here is couple of points:

  1. from the structure point of view, i think we can do parameter check first, for example, where both query_compiler and parameter is provided. Then check init the query_compiler like the original code structure, unless there are case works very differently.
  2. the check message doesn't seem very clear. for example, query_compiler and index can not be provided together, might be better to "index is not supported when query_compiler is provided" etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make the error messages clearer like you pointed out in (2.) --> "index is not supported when query_compiler is provided". But the parameters are right now checked before they are used. I don't think there are any cases in the code where both query compiler and data/index/columns are provided (no tests have failed so far with anything related to this). I think it's also simpler behavior to have it this way.
The doc should also be updated with this behavior.

if hasattr(data, "name") and data.name is not None:
# If data is an object that has a name field, use that as the name of the new Series.
name = data.name
# If any of the values are Snowpark pandas objects, convert them to native pandas objects.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under this case, shouldn't we try to convert other ones to snowpark pandas objects instead of pulling them to local? or maybe we should just error it out.

Do you have one example about this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example where its better to convert it to pandas is this:

data = {"A": pd.Series([1, 2, 3]), "B": pd.Index([4, 5, 6]), "C": 5}
pd.DataFrame(data)
Out[58]: 
   A  B  C
0  1  4  5
1  2  5  5
2  3  6  5

5 is put in every single row even though it's a scalar in the dict

@sfc-gh-vbudati
Copy link
Contributor Author

@sfc-gh-yzou I prefer not making the refactor changes in a new PR since I think this one is very close to merging and it will take a lot more work to separate the index changes from this

@sfc-gh-azhan
Copy link
Collaborator

@sfc-gh-yzou I prefer not making the refactor changes in a new PR since I think this one is very close to merging and it will take a lot more work to separate the index changes from this

I kind agree with @sfc-gh-yun this PR is becoming too big. Can we use this as the PoC draft PR, and we can review smaller PRs one by one. You can either start with refactoring pieces first or fix the lazy index first. Try to make sure refactoring PR only do refactoring and no test changes.

@sfc-gh-vbudati
Copy link
Contributor Author

@sfc-gh-azhan @sfc-gh-yzou I can try to separate this PR into two other PRs - one for the lazy index change and the other for the refactor. It is impossible to avoid test changes in the refactor PR since I introduced functionality to allow passing non-existent columns or index values to the constructor. The constructors should be able to handle any kind of inputs and I added tests for this.

However, that requires me to make a non-trivial amount of redundant code changes, for example, the same set of tests are changed in both PRs where the query count will likely be different due to the refactor. I was hoping to work on IR tickets from Monday, so I still prefer merging this PR as is, please let me know if you both feel strongly about this.

In the future, I'd really appreciate if the feedback about splitting PRs is brought up earlier.

# STEP 2: If columns are provided, set the columns if data is lazy.
# STEP 3: If both the data and index are local (or index is None), create a query compiler from pandas.
# STEP 4: Otherwise, set the index through set_index or reindex.
# STEP 5: The resultant query_compiler is then set as the query_compiler for the DataFrame.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize dtype is not always handled in this new code. Can you add it?

@sfc-gh-azhan
Copy link
Collaborator

@sfc-gh-azhan @sfc-gh-yzou I can try to separate this PR into two other PRs - one for the lazy index change and the other for the refactor. It is impossible to avoid test changes in the refactor PR since I introduced functionality to allow passing non-existent columns or index values to the constructor. The constructors should be able to handle any kind of inputs and I added tests for this.

However, that requires me to make a non-trivial amount of redundant code changes, for example, the same set of tests are changed in both PRs where the query count will likely be different due to the refactor. I was hoping to work on IR tickets from Monday, so I still prefer merging this PR as is, please let me know if you both feel strongly about this.

In the future, I'd really appreciate if the feedback about splitting PRs is brought up earlier.

Then, I'm okay to keep working on this PR.

@pytest.mark.parametrize("limit", [None, 1, 2, 100])
@pytest.mark.parametrize("method", ["bfill", "backfill", "pad", "ffill"])
def test_reindex_index_datetime_with_fill(limit, method):
date_index = native_pd.date_range("1/1/2010", periods=6, freq="D")
native_series = native_pd.Series(
{"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index
{"1/1/2020": [100, 101, np.nan, 100, 89, 88]}, index=date_index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the tests with xfail and add a todo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Collaborator

@sfc-gh-azhan sfc-gh-azhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help!

@@ -1995,3 +1996,68 @@ def create_frame_with_data_columns(
def rindex(lst: list, value: int) -> int:
"""Find the last index in the list of item value."""
return len(lst) - lst[::-1].index(value) - 1


def convert_index_to_qc(index: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from snowflake.snowpark.modin.plugin.compiler.snowflake_query_compiler import SnowflakeQueryCompiler

have you tried this?

src/snowflake/snowpark/modin/plugin/_internal/utils.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/modin/plugin/_internal/utils.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/modin/plugin/_internal/utils.py Outdated Show resolved Hide resolved
tests/integ/modin/frame/test_dtypes.py Show resolved Hide resolved
tests/integ/modin/frame/test_loc.py Outdated Show resolved Hide resolved
tests/integ/modin/series/test_rank.py Outdated Show resolved Hide resolved
@@ -11,6 +11,8 @@
- Added support for using Snowflake Interval constants with `Window.range_between()` when the order by column is TIMESTAMP or DATE type.
- Added support for file writes. This feature is currently in private preview.
- Added support for `DataFrameGroupBy.fillna` and `SeriesGroupBy.fillna`.
- Added support for constructing `Series` and `DataFrame` objects with the lazy `Index` object as `data`, `index`, and `columns` arguments.
- Added support for constructing `Series` and `DataFrame` objects with `index` and `column` values not present in `DataFrame`/`Series` `data`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this new line as well

Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the work.

@sfc-gh-vbudati sfc-gh-vbudati merged commit 6ddffdf into main Sep 25, 2024
35 checks passed
@sfc-gh-vbudati sfc-gh-vbudati deleted the vbudati/SNOW-1458135-df-series-init-with-lazy-index branch September 25, 2024 23:29
@github-actions github-actions bot locked and limited conversation to collaborators Sep 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants