SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

sfc-gh-vbudati · 2024-08-21T18:23:18Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1458135
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
Please describe how your code solves the related issue.

Implemented functionality to enable creating Series and DataFrame objects with a lazy Index object as the data, index, and/or columns.
This also covers creating Series and DataFrames with rows/columns that don't exist in the given data.
A special case is when the data is a Series or DataFrame object, the new Series or DataFrame object is creating by filtering the data with provided index and columns.
In case some values in index don't exist in data's index, these values are added as new rows and their corresponding data values are NaN.
In case some values in columns don't exist in data's columns, these values are added as new NaN columns.
I use a right outer join to add the new index values, and create and append the new NaN columns in the logic.

…, add tests for the same

…y-index # Conflicts: # src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

CHANGELOG.md

tests/integ/modin/index/test_df_series_creation_with_index.py

…ry one join

…y-index

…data is not a Snowpark pandas object

… the constructor tests, rewrite concat tests

…y-index

sfc-gh-vbudati · 2024-08-23T01:16:49Z

All of the join counts in the tests have increased because during DataFrame/Series creation with a non-Snowpark pandas object as data and a Snowpark pandas Index as index, a join is performed instead of converting the index to pandas (which results in an extra query).

In some cases the join count is a lot higher in tests but this is because of the way they are written - some tests call to_pandas() multiple times which results in this.

sfc-gh-azhan

Thanks for doing this! It's a lot of work btw.

Please also check

if you identify some test code can be improved, please add todo and track with jira.
please run a jenkins job to see if anything wrong there before merge.

tests/integ/modin/test_concat.py

…y-index

sfc-gh-vbudati · 2024-09-20T18:32:24Z

tests/integ/modin/index/test_df_series_creation_with_index.py

+        pytest.param(
+            "series",
+            marks=pytest.mark.xfail(
+                reason="SNOW-1675191 reindex does not work with tuple series"


reindex issue https://snowflakecomputing.atlassian.net/browse/SNOW-1675191

sfc-gh-yzou

@sfc-gh-vbudati @sfc-gh-azhan mentioned that the main purpose of this pr is to remove a to_pandas materialization, can we just do that in this pr, and move the other refactoring part out of the current pr?

sfc-gh-yzou · 2024-09-20T18:57:38Z

src/snowflake/snowpark/modin/plugin/extensions/series_overrides.py

-                name = data.name
+    from snowflake.snowpark.modin.plugin.extensions.index import Index
+
+    # Setting the query compiler


One more general comment here about the change, our orignial code behaves in such way that if both data and query compiler are provided, the data is used.
However, here seems we want to change it to a way that only one of them can be configured. i think that is fine, however, please make sure we update the doc to clear this part.

Here is couple of points:

from the structure point of view, i think we can do parameter check first, for example, where both query_compiler and parameter is provided. Then check init the query_compiler like the original code structure, unless there are case works very differently.

the check message doesn't seem very clear. for example, query_compiler and index can not be provided together, might be better to "index is not supported when query_compiler is provided" etc.

I can make the error messages clearer like you pointed out in (2.) --> "index is not supported when query_compiler is provided". But the parameters are right now checked before they are used. I don't think there are any cases in the code where both query compiler and data/index/columns are provided (no tests have failed so far with anything related to this). I think it's also simpler behavior to have it this way.
The doc should also be updated with this behavior.

src/snowflake/snowpark/modin/plugin/_internal/utils.py

sfc-gh-yzou · 2024-09-20T19:06:40Z

src/snowflake/snowpark/modin/plugin/extensions/series_overrides.py

+        if hasattr(data, "name") and data.name is not None:
+            # If data is an object that has a name field, use that as the name of the new Series.
+            name = data.name
+        # If any of the values are Snowpark pandas objects, convert them to native pandas objects.


Under this case, shouldn't we try to convert other ones to snowpark pandas objects instead of pulling them to local? or maybe we should just error it out.

Do you have one example about this case?

One example where its better to convert it to pandas is this:

data = {"A": pd.Series([1, 2, 3]), "B": pd.Index([4, 5, 6]), "C": 5} pd.DataFrame(data) Out[58]: A B C 0 1 4 5 1 2 5 5 2 3 6 5

5 is put in every single row even though it's a scalar in the dict

sfc-gh-vbudati · 2024-09-20T21:36:13Z

@sfc-gh-yzou I prefer not making the refactor changes in a new PR since I think this one is very close to merging and it will take a lot more work to separate the index changes from this

src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py

sfc-gh-azhan · 2024-09-20T22:00:25Z

@sfc-gh-yzou I prefer not making the refactor changes in a new PR since I think this one is very close to merging and it will take a lot more work to separate the index changes from this

I kind agree with @sfc-gh-yun this PR is becoming too big. Can we use this as the PoC draft PR, and we can review smaller PRs one by one. You can either start with refactoring pieces first or fix the lazy index first. Try to make sure refactoring PR only do refactoring and no test changes.

sfc-gh-vbudati · 2024-09-20T23:52:29Z

@sfc-gh-azhan @sfc-gh-yzou I can try to separate this PR into two other PRs - one for the lazy index change and the other for the refactor. It is impossible to avoid test changes in the refactor PR since I introduced functionality to allow passing non-existent columns or index values to the constructor. The constructors should be able to handle any kind of inputs and I added tests for this.

However, that requires me to make a non-trivial amount of redundant code changes, for example, the same set of tests are changed in both PRs where the query count will likely be different due to the refactor. I was hoping to work on IR tickets from Monday, so I still prefer merging this PR as is, please let me know if you both feel strongly about this.

In the future, I'd really appreciate if the feedback about splitting PRs is brought up earlier.

sfc-gh-azhan · 2024-09-23T20:05:34Z

src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py

+    # STEP 2: If columns are provided, set the columns if data is lazy.
+    # STEP 3: If both the data and index are local (or index is None), create a query compiler from pandas.
+    # STEP 4: Otherwise, set the index through set_index or reindex.
+    # STEP 5: The resultant query_compiler is then set as the query_compiler for the DataFrame.


I realize dtype is not always handled in this new code. Can you add it?

sfc-gh-azhan · 2024-09-23T20:06:57Z

@sfc-gh-azhan @sfc-gh-yzou I can try to separate this PR into two other PRs - one for the lazy index change and the other for the refactor. It is impossible to avoid test changes in the refactor PR since I introduced functionality to allow passing non-existent columns or index values to the constructor. The constructors should be able to handle any kind of inputs and I added tests for this.

However, that requires me to make a non-trivial amount of redundant code changes, for example, the same set of tests are changed in both PRs where the query count will likely be different due to the refactor. I was hoping to work on IR tickets from Monday, so I still prefer merging this PR as is, please let me know if you both feel strongly about this.

In the future, I'd really appreciate if the feedback about splitting PRs is brought up earlier.

Then, I'm okay to keep working on this PR.

sfc-gh-azhan · 2024-09-24T20:39:43Z

tests/integ/modin/series/test_reindex.py

 @pytest.mark.parametrize("limit", [None, 1, 2, 100])
 @pytest.mark.parametrize("method", ["bfill", "backfill", "pad", "ffill"])
 def test_reindex_index_datetime_with_fill(limit, method):
    date_index = native_pd.date_range("1/1/2010", periods=6, freq="D")
    native_series = native_pd.Series(
-        {"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index
+        {"1/1/2020": [100, 101, np.nan, 100, 89, 88]}, index=date_index


let's keep the tests with xfail and add a todo.

tests/integ/modin/types/test_timedelta_indexing.py

sfc-gh-azhan

Thanks for your help!

…y-index

sfc-gh-joshi · 2024-09-25T20:18:22Z

src/snowflake/snowpark/modin/plugin/_internal/utils.py

@@ -1995,3 +1996,68 @@ def create_frame_with_data_columns(
 def rindex(lst: list, value: int) -> int:
    """Find the last index in the list of item value."""
    return len(lst) - lst[::-1].index(value) - 1
+
+
+def convert_index_to_qc(index: Any) -> Any:


from typing import TYPE_CHECKING if TYPE_CHECKING: from snowflake.snowpark.modin.plugin.compiler.snowflake_query_compiler import SnowflakeQueryCompiler

have you tried this?

src/snowflake/snowpark/modin/plugin/_internal/utils.py

src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py

tests/integ/modin/frame/test_dtypes.py

tests/integ/modin/frame/test_loc.py

tests/integ/modin/series/test_rank.py

…y-index # Conflicts: # CHANGELOG.md

…ported

sfc-gh-vbudati · 2024-09-25T22:30:47Z

CHANGELOG.md

@@ -11,6 +11,8 @@
 - Added support for using Snowflake Interval constants with `Window.range_between()` when the order by column is TIMESTAMP or DATE type.
 - Added support for file writes. This feature is currently in private preview.
 - Added support for `DataFrameGroupBy.fillna` and `SeriesGroupBy.fillna`.
+- Added support for constructing `Series` and `DataFrame` objects with the lazy `Index` object as `data`, `index`, and `columns` arguments.
+- Added support for constructing `Series` and `DataFrame` objects with `index` and `column` values not present in `DataFrame`/`Series` `data`.


I added this new line as well

sfc-gh-joshi

LGTM! Thanks for the work.

sfc-gh-vbudati added 2 commits August 21, 2024 11:15

Update Series and DataFrame constructors to handle lazy Index objects…

2094c3f

…, add tests for the same

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

97a7229

…y-index # Conflicts: # src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

sfc-gh-vbudati requested a review from a team as a code owner August 21, 2024 18:23

sfc-gh-vbudati requested review from sfc-gh-dpetersohn and sfc-gh-joshi August 21, 2024 18:23

github-actions bot added the snowpark-pandas label Aug 21, 2024

update changelog

1979257

sfc-gh-vbudati added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Aug 21, 2024

sfc-gh-vbudati added 3 commits August 21, 2024 11:49

add more tests

5dbb76d

fix minor bug

7de467f

fix isocalendar docstring error

5dd06fd

sfc-gh-azhan reviewed Aug 21, 2024

View reviewed changes

sfc-gh-vbudati added 3 commits August 21, 2024 17:32

truncate tests, update changelog wording, reduce 2 queries to one que…

8b94462

…ry one join

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

c89dc5d

…y-index

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

a2089b8

…y-index

sfc-gh-vbudati requested a review from sfc-gh-azhan August 22, 2024 00:34

sfc-gh-vbudati added 4 commits August 22, 2024 09:32

Get rid of the join performed when only index is an Index object and …

a9376c1

…data is not a Snowpark pandas object

Add back the index join query to DataFrame/Series constructor, update…

420a5ac

… the constructor tests, rewrite concat tests

Update tests

66d634c

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

f277041

…y-index

sfc-gh-azhan reviewed Aug 23, 2024

View reviewed changes

tests/integ/modin/test_concat.py Outdated Show resolved Hide resolved

sfc-gh-vbudati added 8 commits August 23, 2024 14:09

added edge case logic, fix test query count

6a2cb79

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

df96f4a

…y-index

more test fixes

f971b0d

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

13db956

…y-index

fix dict case

8c78f8d

more test case fixes

7970101

correct the logic for series created with dict and index

f3de1c3

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

2447022

…y-index

sfc-gh-vbudati commented Sep 20, 2024

View reviewed changes

merge main into current branch

8cce409

sfc-gh-yzou reviewed Sep 20, 2024

View reviewed changes

sfc-gh-azhan reviewed Sep 20, 2024

View reviewed changes

sfc-gh-azhan reviewed Sep 23, 2024

View reviewed changes

sfc-gh-vbudati added 3 commits September 23, 2024 17:06

add dtype logic

f37b80a

merge main into current branch

83de657

fix tests

57bf89c

sfc-gh-vbudati requested review from sfc-gh-azhan and sfc-gh-yzou September 24, 2024 19:00

sfc-gh-azhan reviewed Sep 24, 2024

View reviewed changes

reduce git diff, add xfails instead of modifying tests

8ed8e5b

sfc-gh-vbudati requested a review from sfc-gh-azhan September 24, 2024 22:31

sfc-gh-azhan reviewed Sep 24, 2024

View reviewed changes

tests/integ/modin/types/test_timedelta_indexing.py Show resolved Hide resolved

sfc-gh-azhan approved these changes Sep 25, 2024

View reviewed changes

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

cea97e0

…y-index

sfc-gh-joshi requested changes Sep 25, 2024

View reviewed changes

sfc-gh-vbudati added 3 commits September 25, 2024 15:26

Address feedback

03a68dc

Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…

19af4bb

…y-index # Conflicts: # CHANGELOG.md

add changelog entry about non-existent columns/index values being sup…

1eb17dd

…ported

sfc-gh-vbudati commented Sep 25, 2024

View reviewed changes

sfc-gh-vbudati requested a review from sfc-gh-joshi September 25, 2024 22:30

sfc-gh-joshi approved these changes Sep 25, 2024

View reviewed changes

sfc-gh-vbudati merged commit 6ddffdf into main Sep 25, 2024
35 checks passed

sfc-gh-vbudati deleted the vbudati/SNOW-1458135-df-series-init-with-lazy-index branch September 25, 2024 23:29

github-actions bot locked and limited conversation to collaborators Sep 25, 2024

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

Uh oh!

Conversation

sfc-gh-vbudati commented Aug 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-vbudati commented Aug 23, 2024

Uh oh!

sfc-gh-azhan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-yzou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-vbudati commented Sep 20, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-azhan commented Sep 20, 2024

Uh oh!

sfc-gh-vbudati commented Sep 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-azhan commented Sep 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfc-gh-azhan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!