Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
2094c3f
Update Series and DataFrame constructors to handle lazy Index objects…
sfc-gh-vbudati Aug 21, 2024
97a7229
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 21, 2024
1979257
update changelog
sfc-gh-vbudati Aug 21, 2024
5dbb76d
add more tests
sfc-gh-vbudati Aug 21, 2024
7de467f
fix minor bug
sfc-gh-vbudati Aug 21, 2024
5dd06fd
fix isocalendar docstring error
sfc-gh-vbudati Aug 21, 2024
8b94462
truncate tests, update changelog wording, reduce 2 queries to one que…
sfc-gh-vbudati Aug 22, 2024
c89dc5d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a2089b8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a9376c1
Get rid of the join performed when only index is an Index object and …
sfc-gh-vbudati Aug 22, 2024
420a5ac
Add back the index join query to DataFrame/Series constructor, update…
sfc-gh-vbudati Aug 22, 2024
66d634c
Update tests
sfc-gh-vbudati Aug 23, 2024
f277041
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
6a2cb79
added edge case logic, fix test query count
sfc-gh-vbudati Aug 23, 2024
df96f4a
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
f971b0d
more test fixes
sfc-gh-vbudati Aug 23, 2024
13db956
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
8c78f8d
fix dict case
sfc-gh-vbudati Aug 23, 2024
7970101
more test case fixes
sfc-gh-vbudati Aug 24, 2024
f3de1c3
correct the logic for series created with dict and index
sfc-gh-vbudati Aug 26, 2024
2447022
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 26, 2024
82728bf
fix query counts
sfc-gh-vbudati Aug 26, 2024
23587a4
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Aug 26, 2024
1577ddc
fix join count
sfc-gh-vbudati Aug 26, 2024
8903f60
refactor series and df
sfc-gh-vbudati Sep 4, 2024
668c889
merge main into current branch
sfc-gh-vbudati Sep 4, 2024
67a07c1
refactor dataframe and series constructors
sfc-gh-vbudati Sep 4, 2024
a4351ba
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 4, 2024
1453680
fix docstring tests
sfc-gh-vbudati Sep 5, 2024
f39e751
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 5, 2024
b73f027
fix some tests
sfc-gh-vbudati Sep 6, 2024
c6fc05d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 6, 2024
d422f86
replace series constructor
sfc-gh-vbudati Sep 6, 2024
1ea5d00
fix tests
sfc-gh-vbudati Sep 9, 2024
024acd8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 9, 2024
f4a80f3
fix loc and iloc tests
sfc-gh-vbudati Sep 9, 2024
ce1ffa6
fix test
sfc-gh-vbudati Sep 9, 2024
00d2a8b
fix test
sfc-gh-vbudati Sep 9, 2024
cb91849
fix last valid index error
sfc-gh-vbudati Sep 9, 2024
d9fdbb0
remove stuff unnecessarily commented out
sfc-gh-vbudati Sep 9, 2024
3d5b785
explain high query count
sfc-gh-vbudati Sep 9, 2024
7f9dbaa
rewrite binary op test, fix coverage
sfc-gh-vbudati Sep 9, 2024
6de9f49
fix tests
sfc-gh-vbudati Sep 11, 2024
c2fb474
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 11, 2024
2274d1e
remove print statements and unnecessary comments
sfc-gh-vbudati Sep 11, 2024
9eef8d7
fix tests
sfc-gh-vbudati Sep 11, 2024
cc09403
increase coverage
sfc-gh-vbudati Sep 12, 2024
10c3954
try to move out common logic, add more tests
sfc-gh-vbudati Sep 14, 2024
64dda24
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 14, 2024
da56734
update df init
sfc-gh-vbudati Sep 14, 2024
8b47e17
moved common logic out, fixed some tests
sfc-gh-vbudati Sep 16, 2024
fa4eb09
remove unnecessary diffs
sfc-gh-vbudati Sep 16, 2024
db28630
fix doctest and couple of tests
sfc-gh-vbudati Sep 17, 2024
17be4c3
apply feedback to simplify logic
sfc-gh-vbudati Sep 18, 2024
301f47f
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 18, 2024
2eb14a7
update query counts to use constants
sfc-gh-vbudati Sep 18, 2024
95065f7
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Sep 18, 2024
d9bbd9b
remove docstring update, add docstrings for helper functions
sfc-gh-vbudati Sep 18, 2024
f40c5b4
try to break down df init into three steps: data, columns, and index
sfc-gh-vbudati Sep 20, 2024
8cce409
merge main into current branch
sfc-gh-vbudati Sep 20, 2024
f37b80a
add dtype logic
sfc-gh-vbudati Sep 24, 2024
83de657
merge main into current branch
sfc-gh-vbudati Sep 24, 2024
57bf89c
fix tests
sfc-gh-vbudati Sep 24, 2024
8ed8e5b
reduce git diff, add xfails instead of modifying tests
sfc-gh-vbudati Sep 24, 2024
cea97e0
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 25, 2024
03a68dc
Address feedback
sfc-gh-vbudati Sep 25, 2024
19af4bb
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 25, 2024
1eb17dd
add changelog entry about non-existent columns/index values being sup…
sfc-gh-vbudati Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/snowflake/snowpark/modin/plugin/_internal/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import numpy as np
import pandas as native_pd
from pandas._typing import Scalar
from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.common import is_integer_dtype, is_object_dtype, is_scalar
from pandas.core.dtypes.inference import is_list_like

Expand Down Expand Up @@ -1998,6 +1999,28 @@ def rindex(lst: list, value: int) -> int:
return len(lst) - lst[::-1].index(value) - 1


def error_checking_for_init(
index: Any, dtype: Union[str, np.dtype, ExtensionDtype]
) -> None:
"""
Common error messages for the Series and DataFrame constructors.

Parameters
----------
index: Any
The index to check.
dtype: str, numpy.dtype, or ExtensionDtype
The dtype to check.
"""
from modin.pandas import DataFrame

if isinstance(index, DataFrame): # pandas raises the same error
raise ValueError("Index data must be 1-dimensional")

if dtype == "category":
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError("pandas type category is not implemented")


def convert_index_to_qc(index: Any) -> Any:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to make the return type "SnowflakeQueryCompiler" without causing circular import issues

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "SnowflakeQueryCompiler" with quotes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work either - it still causes the issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from snowflake.snowpark.modin.plugin.compiler.snowflake_query_compiler import SnowflakeQueryCompiler

have you tried this?

"""
Method to convert an object representing an index into a query compiler for set_index or reindex.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@
from snowflake.snowpark.modin.plugin._internal.utils import (
convert_index_to_list_of_qcs,
convert_index_to_qc,
error_checking_for_init,
is_repr_truncated,
)
from snowflake.snowpark.modin.plugin._typing import ListLike
Expand Down Expand Up @@ -484,8 +485,7 @@ def __init__(
self._query_compiler = query_compiler
return

if isinstance(index, DataFrame): # pandas raises the same error
raise ValueError("Index data must be 1-dimensional")
error_checking_for_init(index, dtype)

# The logic followed here is:
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
# 1. Create a query_compiler from the provided data. If columns are provided, add/select the columns.
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -500,6 +500,7 @@ def __init__(
# If the data is an Index object, convert it to a DataFrame to make sure that the values are in the
# correct format: the values are a data column, not an index column.
if data.name is None:
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
# If no name is provided, the default name is 0.
new_name = 0 if columns is None else columns[0]
else:
new_name = data.name
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -510,6 +511,7 @@ def __init__(
query_compiler = data._query_compiler.copy()
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
# We set the column name if it is not in the provided Series `data`.
if data.name is None:
# If no name is provided, the default name is 0.
query_compiler = query_compiler.set_columns(columns or [0])
if columns is not None and data.name not in columns:
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
# If the columns provided are not in the named Series, pandas clears
Expand Down Expand Up @@ -607,9 +609,7 @@ def __init__(
if all(isinstance(v, Index) for v in data):
# Special case: if all the values are Index objects, they are always present in the
# final result with the provided column names. Therefore, rename the columns.
new_qc = new_qc.set_columns(
try_convert_index_to_native(columns)
)
new_qc = new_qc.set_columns(columns)
else:
new_qc = new_qc.reindex(axis=1, labels=columns)
self._query_compiler = new_qc
Expand All @@ -618,14 +618,16 @@ def __init__(
# If only some data is a Snowpark pandas object, convert it to pandas objects.
res = []
for v in data:
if isinstance(v, (Index)):
res.append(v.to_pandas())
elif isinstance(v, BasePandasDataset):
if isinstance(v, (Index, BasePandasDataset)):
res.append(v.to_pandas())
# elif is_dict_like(v) or isinstance(v, (native_pd.Series, native_pd.DataFrame, native_pd.Index)):
# res.append(v)
else:
# Need to convert this is a native pandas object since native pandas incorrectly
# tries to perform `get_indexer` on it.
res.append(native_pd.Index(v if is_list_like(v) else [v]))
# # Need to convert this is a native pandas object since native pandas incorrectly
# # tries to perform `get_indexer` on it. Specify dtype=object so that pandas does not
# # cast the data provided. In some cases, None turns to NaN, which is not desired.
# res.append(native_pd.Index(v, dtype=object) if is_list_like(v) else v)
res.append(v)
data = res

query_compiler = from_pandas(
Expand Down Expand Up @@ -662,13 +664,14 @@ def __init__(

# 3. If data is a DataFrame, filter result
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
# ----------------------------------------
if isinstance(data, DataFrame):
# To select the required index and columns for the resultant DataFrame,
# perform .loc[] on the created query compiler.
index = slice(None) if index is None else index
columns = slice(None) if columns is None else columns
if isinstance(data, DataFrame) and columns is not None:
# To select the columns for the resultant DataFrame, perform .loc[] on the created query compiler.
# This step is performed to ensure that the right columns are picked from the InternalFrame since we
# never explicitly drop the unwanted columns.
query_compiler = (
DataFrame(query_compiler=query_compiler).loc[index, columns]._query_compiler
DataFrame(query_compiler=query_compiler)
.loc[slice(None), columns]
._query_compiler
)

# 4. Setting the query compiler
Expand Down Expand Up @@ -1181,6 +1184,9 @@ def insert(
# Dictionary keys are treated as index column and this should be joined with
# index of target dataframe. This behavior is similar to 'value' being DataFrame
# or Series, so we simply create Series from dict data here.
if isinstance(value, set):
raise TypeError(f"'{type(value).__name__}' type is unordered")
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(value, dict):
value = Series(value, name=column)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
from snowflake.snowpark.modin.plugin._internal.utils import (
convert_index_to_list_of_qcs,
convert_index_to_qc,
error_checking_for_init,
)
from snowflake.snowpark.modin.plugin._typing import DropKeep, ListLike
from snowflake.snowpark.modin.plugin.utils.error_message import (
Expand Down Expand Up @@ -367,8 +368,7 @@ def __init__(
self.name = name
return
sfc-gh-azhan marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(index, spd.DataFrame): # pandas raises the same error
raise ValueError("Index data must be 1-dimensional")
error_checking_for_init(index, dtype)

if isinstance(data, spd.DataFrame):
# pandas raises an ambiguous error:
Expand Down Expand Up @@ -398,11 +398,12 @@ def __init__(
else:
# CASE IV: Non-Snowpark pandas data
# If the data is not a Snowpark pandas object, convert it to a query compiler.
# The query compiler uses the '__reduced__' name internally as a column name to represent pandas
# Series objects that are not explicitly assigned a name.
# This helps to distinguish between an N-element Series and 1xN DataFrame.
name = name or MODIN_UNNAMED_SERIES_LABEL
if (
isinstance(data, (native_pd.Series, native_pd.Index))
and data.name is not None
):
if hasattr(data, "name") and data.name is not None:
# If data is an object that has a name field, use that as the name of the new Series.
name = data.name
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
# If any of the values are Snowpark pandas objects, convert them to native pandas objects.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under this case, shouldn't we try to convert other ones to snowpark pandas objects instead of pulling them to local? or maybe we should just error it out.

Do you have one example about this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example where its better to convert it to pandas is this:

data = {"A": pd.Series([1, 2, 3]), "B": pd.Index([4, 5, 6]), "C": 5}
pd.DataFrame(data)
Out[58]: 
   A  B  C
0  1  4  5
1  2  5  5
2  3  6  5

5 is put in every single row even though it's a scalar in the dict

if not isinstance(
Expand All @@ -422,9 +423,9 @@ def __init__(
native_pd.DataFrame(
native_pd.Series(
data=data,
dtype=dtype,
# Handle setting the index, if it is a lazy index, outside this block.
# If the index is a lazy index, handle setting it outside this block.
index=None if isinstance(index, (Index, Series)) else index,
dtype=dtype,
name=name,
copy=copy,
fastpath=fastpath,
Expand Down
11 changes: 6 additions & 5 deletions tests/integ/modin/frame/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,22 +473,23 @@ def test_empty_index(index, expected_index_dtype):


@pytest.mark.parametrize(
"input_data, type_msg",
"input_data, dtype, type_msg",
[
(native_pd.Categorical([1, 2, 3, 1, 2, 3]), "category"),
(native_pd.Categorical(["a", "b", "c", "a", "b", "c"]), "category"),
(native_pd.Categorical([1, 2, 3, 1, 2, 3]), "category", "category"),
(native_pd.Categorical(["a", "b", "c", "a", "b", "c"]), "category", "category"),
(
native_pd.period_range("2015-02-03 11:22:33.4567", periods=5, freq="s"),
None,
r"period\[s\]",
),
],
)
@sql_count_checker(query_count=0)
def test_unsupported_dtype_raises(input_data, type_msg) -> None:
def test_unsupported_dtype_raises(input_data, dtype, type_msg) -> None:
with pytest.raises(
NotImplementedError, match=f"pandas type {type_msg} is not implemented"
):
pd.Series(input_data)
pd.Series(input_data, dtype=dtype)
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize(
Expand Down
2 changes: 1 addition & 1 deletion tests/integ/modin/frame/test_idxmax_idxmin.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ def test_idxmax_idxmin_with_dates(func, axis):
)


@sql_count_checker(query_count=1, join_count=1)
@sql_count_checker(query_count=1)
@pytest.mark.parametrize("func", ["idxmax", "idxmin"])
@pytest.mark.parametrize(
"axis",
Expand Down
6 changes: 2 additions & 4 deletions tests/integ/modin/frame/test_insert.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ def test_insert_dataframe_shape_negative(native_df):
(np.ones((1, 1)), 1),
([1, 2], 1), # len < number of rows
((6, 7, 8, 9), 1), # len > number of rows
({"a", "b", "c"}, 1), # python set
({"a", "b", "c"}, 0), # python set
],
)
def test_insert_value_negative(native_df, value, expected_query_count):
Expand Down Expand Up @@ -725,12 +725,10 @@ def test_insert_multiindex_column_negative(snow_df, columns, insert_label):
[["a", "b", "b", "d", "e"], ["x", "y", "z", "u", "u"], True],
],
)
@sql_count_checker(query_count=1, join_count=3)
@sql_count_checker(query_count=3, join_count=1)
def test_insert_with_unique_and_duplicate_index_values(
index_values, other_index_values, expect_mismatch
):
# Two of the three joins come from creating the DataFrame with non-Snowpark pandas data
# and a Snowpark pandas Index. The third join is from the insert operation.
data = list(range(5))
data1 = {"foo": data}
data2 = {"bar": [val * 10 for val in data]}
Expand Down
16 changes: 8 additions & 8 deletions tests/integ/modin/frame/test_loc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3945,12 +3945,12 @@ def test_raise_set_cell_with_list_like_value_error():
reason="SNOW-1652608 result series name incorrectly set"
),
), # 1 join fron df creation, 1 join from squeeze, 2 joins from to_pandas during eval
(["1 day", "3 days"], 1, 2),
([True, False, False], 1, 2),
(slice(None, "4 days"), 1, 1),
(slice(None, "4 days", 2), 1, 1),
(slice("1 day", "2 days"), 1, 1),
(slice("1 day 1 hour", "2 days 2 hours", -1), 1, 1),
(["1 day", "3 days"], 1, 1),
([True, False, False], 1, 1),
(slice(None, "4 days"), 1, 0),
(slice(None, "4 days", 2), 1, 0),
(slice("1 day", "2 days"), 1, 0),
(slice("1 day 1 hour", "2 days 2 hours", -1), 1, 0),
],
)
def test_df_loc_get_with_timedelta(key, query_count, join_count):
Expand Down Expand Up @@ -4017,7 +4017,7 @@ def test_df_loc_get_with_timedelta(key, query_count, join_count):
),
],
)
@sql_count_checker(query_count=1, join_count=1)
@sql_count_checker(query_count=2)
def test_df_loc_get_with_timedelta_behavior_difference(key, expected_result):
# In these test cases, native pandas raises a KeyError but Snowpark pandas works correctly.
data = {
Expand All @@ -4037,7 +4037,7 @@ def test_df_loc_get_with_timedelta_behavior_difference(key, expected_result):
assert_frame_equal(actual_result, expected_result)


@sql_count_checker(query_count=2, join_count=2)
@sql_count_checker(query_count=3, join_count=1)
def test_df_loc_get_with_timedeltaindex_key():
data = {
"A": [1, 2, 3],
Expand Down
7 changes: 3 additions & 4 deletions tests/integ/modin/frame/test_mask.py
Original file line number Diff line number Diff line change
Expand Up @@ -683,7 +683,7 @@ def test_dataframe_mask_with_duplicated_index_aligned(cond_frame, other):
native_other = other
snow_other = other

expected_join_count = 2 if isinstance(other, int) else 3
expected_join_count = 1 if isinstance(other, int) else 2
with SqlCounter(query_count=1, join_count=expected_join_count):
eval_snowpark_pandas_result(
snow_df,
Expand All @@ -694,9 +694,8 @@ def test_dataframe_mask_with_duplicated_index_aligned(cond_frame, other):
)


# Three extra joins when creating the 3 snowpark pandas dataframes with non-Snowpark pandas
# data and Snowpark pandas Index.
@sql_count_checker(query_count=1, join_count=5)
# Three extra queries to convert to native index for dataframe constructor when creating the 3 snowpark pandas dataframes
@sql_count_checker(query_count=4, join_count=2)
def test_dataframe_mask_with_duplicated_index_unaligned():
data = [3, 4, 5, 2]
df_index = pd.Index([2, 1, 2, 3], name="index")
Expand Down
Loading
Loading