Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137

Merged
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
2094c3f
Update Series and DataFrame constructors to handle lazy Index objects…
sfc-gh-vbudati Aug 21, 2024
97a7229
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 21, 2024
1979257
update changelog
sfc-gh-vbudati Aug 21, 2024
5dbb76d
add more tests
sfc-gh-vbudati Aug 21, 2024
7de467f
fix minor bug
sfc-gh-vbudati Aug 21, 2024
5dd06fd
fix isocalendar docstring error
sfc-gh-vbudati Aug 21, 2024
8b94462
truncate tests, update changelog wording, reduce 2 queries to one que…
sfc-gh-vbudati Aug 22, 2024
c89dc5d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a2089b8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 22, 2024
a9376c1
Get rid of the join performed when only index is an Index object and …
sfc-gh-vbudati Aug 22, 2024
420a5ac
Add back the index join query to DataFrame/Series constructor, update…
sfc-gh-vbudati Aug 22, 2024
66d634c
Update tests
sfc-gh-vbudati Aug 23, 2024
f277041
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
6a2cb79
added edge case logic, fix test query count
sfc-gh-vbudati Aug 23, 2024
df96f4a
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
f971b0d
more test fixes
sfc-gh-vbudati Aug 23, 2024
13db956
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 23, 2024
8c78f8d
fix dict case
sfc-gh-vbudati Aug 23, 2024
7970101
more test case fixes
sfc-gh-vbudati Aug 24, 2024
f3de1c3
correct the logic for series created with dict and index
sfc-gh-vbudati Aug 26, 2024
2447022
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Aug 26, 2024
82728bf
fix query counts
sfc-gh-vbudati Aug 26, 2024
23587a4
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Aug 26, 2024
1577ddc
fix join count
sfc-gh-vbudati Aug 26, 2024
8903f60
refactor series and df
sfc-gh-vbudati Sep 4, 2024
668c889
merge main into current branch
sfc-gh-vbudati Sep 4, 2024
67a07c1
refactor dataframe and series constructors
sfc-gh-vbudati Sep 4, 2024
a4351ba
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 4, 2024
1453680
fix docstring tests
sfc-gh-vbudati Sep 5, 2024
f39e751
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 5, 2024
b73f027
fix some tests
sfc-gh-vbudati Sep 6, 2024
c6fc05d
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 6, 2024
d422f86
replace series constructor
sfc-gh-vbudati Sep 6, 2024
1ea5d00
fix tests
sfc-gh-vbudati Sep 9, 2024
024acd8
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 9, 2024
f4a80f3
fix loc and iloc tests
sfc-gh-vbudati Sep 9, 2024
ce1ffa6
fix test
sfc-gh-vbudati Sep 9, 2024
00d2a8b
fix test
sfc-gh-vbudati Sep 9, 2024
cb91849
fix last valid index error
sfc-gh-vbudati Sep 9, 2024
d9fdbb0
remove stuff unnecessarily commented out
sfc-gh-vbudati Sep 9, 2024
3d5b785
explain high query count
sfc-gh-vbudati Sep 9, 2024
7f9dbaa
rewrite binary op test, fix coverage
sfc-gh-vbudati Sep 9, 2024
6de9f49
fix tests
sfc-gh-vbudati Sep 11, 2024
c2fb474
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 11, 2024
2274d1e
remove print statements and unnecessary comments
sfc-gh-vbudati Sep 11, 2024
9eef8d7
fix tests
sfc-gh-vbudati Sep 11, 2024
cc09403
increase coverage
sfc-gh-vbudati Sep 12, 2024
10c3954
try to move out common logic, add more tests
sfc-gh-vbudati Sep 14, 2024
64dda24
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 14, 2024
da56734
update df init
sfc-gh-vbudati Sep 14, 2024
8b47e17
moved common logic out, fixed some tests
sfc-gh-vbudati Sep 16, 2024
fa4eb09
remove unnecessary diffs
sfc-gh-vbudati Sep 16, 2024
db28630
fix doctest and couple of tests
sfc-gh-vbudati Sep 17, 2024
17be4c3
apply feedback to simplify logic
sfc-gh-vbudati Sep 18, 2024
301f47f
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 18, 2024
2eb14a7
update query counts to use constants
sfc-gh-vbudati Sep 18, 2024
95065f7
Merge branch 'vbudati/SNOW-1458135-df-series-init-with-lazy-index' of…
sfc-gh-vbudati Sep 18, 2024
d9bbd9b
remove docstring update, add docstrings for helper functions
sfc-gh-vbudati Sep 18, 2024
f40c5b4
try to break down df init into three steps: data, columns, and index
sfc-gh-vbudati Sep 20, 2024
8cce409
merge main into current branch
sfc-gh-vbudati Sep 20, 2024
f37b80a
add dtype logic
sfc-gh-vbudati Sep 24, 2024
83de657
merge main into current branch
sfc-gh-vbudati Sep 24, 2024
57bf89c
fix tests
sfc-gh-vbudati Sep 24, 2024
8ed8e5b
reduce git diff, add xfails instead of modifying tests
sfc-gh-vbudati Sep 24, 2024
cea97e0
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 25, 2024
03a68dc
Address feedback
sfc-gh-vbudati Sep 25, 2024
19af4bb
Merge branch 'main' into vbudati/SNOW-1458135-df-series-init-with-laz…
sfc-gh-vbudati Sep 25, 2024
1eb17dd
add changelog entry about non-existent columns/index values being sup…
sfc-gh-vbudati Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
- `make_interval`
- Added support for using Snowflake Interval constants with `Window.range_between()` when the order by column is TIMESTAMP or DATE type.
- Added support for file writes. This feature is currently in private preview.
- Added support for constructing `Series` and `DataFrame` objects with the lazy `Index` object as `data`, `index`, and `columns` arguments.

#### Improvements

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ def get_valid_index_values(
-------
Optional[Row]: The desired index (a Snowpark Row) if it exists, else None.
"""
frame = frame.ensure_row_position_column()
index_quoted_identifier = frame.index_column_snowflake_quoted_identifiers
data_quoted_identifier = frame.data_column_snowflake_quoted_identifiers
row_position_quoted_identifier = frame.row_position_snowflake_quoted_identifier
Expand Down
158 changes: 157 additions & 1 deletion src/snowflake/snowpark/modin/plugin/_internal/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@
import modin.pandas as pd
import numpy as np
import pandas as native_pd
from pandas._typing import Scalar
from pandas._typing import AnyArrayLike, Scalar
from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.common import is_integer_dtype, is_object_dtype, is_scalar
from pandas.core.dtypes.inference import is_list_like

import snowflake.snowpark.modin.plugin._internal.statement_params_constants as STATEMENT_PARAMS
from snowflake.snowpark._internal.analyzer.analyzer_utils import (
Expand Down Expand Up @@ -2007,3 +2009,157 @@ def create_frame_with_data_columns(
def rindex(lst: list, value: int) -> int:
"""Find the last index in the list of item value."""
return len(lst) - lst[::-1].index(value) - 1


def error_checking_for_init(
index: Any, dtype: Union[str, np.dtype, ExtensionDtype]
) -> None:
"""
Common error messages for the Series and DataFrame constructors.

Parameters
----------
index: Any
The index to check.
dtype: str, numpy.dtype, or ExtensionDtype
The dtype to check.
"""
from modin.pandas import DataFrame

if isinstance(index, DataFrame): # pandas raises the same error
raise ValueError("Index data must be 1-dimensional")

if dtype == "category":
sfc-gh-vbudati marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError("pandas type category is not implemented")


def assert_fields_are_none(
class_name: str, data: Any, index: Any, dtype: Any, columns: Any = None
) -> None:
assert (
data is None
), f"Invalid {class_name} construction! The `data` parameter is not supported when `query_compiler` is given."
assert (
index is None
), f"Invalid {class_name} construction! The `index` parameter is not supported when `query_compiler` is given."
assert (
dtype is None
), f"Invalid {class_name} construction! The `dtype` parameter is not supported when `query_compiler` is given."
assert (
columns is None
), f"Invalid {class_name} construction! The `columns` parameter is not supported when `query_compiler` is given."


def convert_index_to_qc(index: Any) -> Any:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to make the return type "SnowflakeQueryCompiler" without causing circular import issues

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "SnowflakeQueryCompiler" with quotes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work either - it still causes the issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from snowflake.snowpark.modin.plugin.compiler.snowflake_query_compiler import SnowflakeQueryCompiler

have you tried this?

"""
Method to convert an object representing an index into a query compiler for set_index or reindex.

Parameters
----------
index: Any
The object to convert to a query compiler.

Returns
-------
SnowflakeQueryCompiler
The converted query compiler.
"""
from modin.pandas import Series

from snowflake.snowpark.modin.plugin.extensions.index import Index

if isinstance(index, Index):
idx_qc = index.to_series()._query_compiler
elif isinstance(index, Series):
# The name of the index comes from the Series' name, not the index name. `reindex` does not handle this,
# so we need to set the name of the index to the name of the Series.
index.index.name = index.name
idx_qc = index._query_compiler
else:
idx_qc = Series(index)._query_compiler
return idx_qc


def convert_index_to_list_of_qcs(index: Any) -> list:
"""
Method to convert an object representing an index into a list of query compilers for set_index.

Parameters
----------
index: Any
The object to convert to a list of query compilers.

Returns
-------
list
The list of query compilers.
"""
from modin.pandas import Series

from snowflake.snowpark.modin.plugin.extensions.index import Index

if (
not isinstance(index, (native_pd.MultiIndex, Series, Index))
and is_list_like(index)
and len(index) > 0
and all((is_list_like(i) and not isinstance(i, tuple)) for i in index)
):
# If given a list of lists, convert it to a MultiIndex.
index = native_pd.MultiIndex.from_arrays(index)
if isinstance(index, native_pd.MultiIndex):
index_qc_list = [
s._query_compiler
for s in [
Series(index.get_level_values(level)) for level in range(index.nlevels)
]
]
else:
index_qc_list = [convert_index_to_qc(index)]
return index_qc_list


def add_extra_columns_and_select_required_columns(
query_compiler: Any,
columns: Union[AnyArrayLike, list],
) -> Any:
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
"""
Method to add extra columns to and select the required columns from the provided query compiler.
This is used in DataFrame construction in the following cases:
- general case when data is a DataFrame
- data is a named Series, and this name is in `columns`

Parameters
----------
query_compiler: Any
The query compiler to select columns from, i.e., data's query compiler.
columns: AnyArrayLike or list
The columns to select from the query compiler.
"""
from modin.pandas import DataFrame

data_columns = query_compiler.get_columns().to_list()
# The `columns` parameter is used to select the columns from `data` that will be in the resultant DataFrame.
# If a value in `columns` is not present in data's columns, it will be added as a new column filled with NaN values.
# These columns are tracked by the `extra_columns` variable.
if data_columns is not None and columns is not None:
extra_columns = [col for col in columns if col not in data_columns]
if extra_columns is not []:
# To add these new columns to the DataFrame, perform `__getitem__` only with the extra columns
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
# and set them to None.
extra_columns_df = DataFrame(query_compiler=query_compiler)
# In the case that the columns are MultiIndex but not all extra columns are tuples, we need to flatten the
# columns to ensure that the columns are a single-level index. If not, `__getitem__` will raise an error
# when trying to add new columns that are not in the expected tuple format.
if not all(isinstance(col, tuple) for col in extra_columns) and isinstance(
query_compiler.get_columns(), native_pd.MultiIndex
):
flattened_columns = extra_columns_df.columns.to_flat_index()
extra_columns_df.columns = flattened_columns
extra_columns_df[extra_columns] = None
query_compiler = extra_columns_df._query_compiler

# To select the columns for the resultant DataFrame, perform `__getitem__` on the created query compiler.
# This step is performed to ensure that the right columns are picked from the InternalFrame since we
# never explicitly drop the unwanted columns. `__getitem__` also ensures that the columns in the resultant
# DataFrame are in the same order as the columns in the `columns` parameter.
return DataFrame(query_compiler=query_compiler)[columns]._query_compiler
sfc-gh-joshi marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -2336,7 +2336,7 @@ def any(
def reindex(
self,
axis: int,
labels: Union[pandas.Index, "pd.Index", list[Any]],
labels: Union[pandas.Index, "pd.Index", list[Any], "SnowflakeQueryCompiler"],
**kwargs: dict[str, Any],
) -> "SnowflakeQueryCompiler":
"""
Expand All @@ -2346,7 +2346,7 @@ def reindex(
----------
axis : {0, 1}
Axis to align labels along. 0 is for index, 1 is for columns.
labels : list-like
labels : list-like, SnowflakeQueryCompiler
Index-labels to align with.
method : {None, "backfill"/"bfill", "pad"/"ffill", "nearest"}
Method to use for filling holes in reindexed frame.
Expand Down Expand Up @@ -2544,15 +2544,15 @@ def _add_columns_for_monotonicity_checks(

def _reindex_axis_0(
self,
labels: Union[pandas.Index, "pd.Index", list[Any]],
labels: Union[pandas.Index, "pd.Index", list[Any], "SnowflakeQueryCompiler"],
**kwargs: dict[str, Any],
) -> "SnowflakeQueryCompiler":
"""
Align QueryCompiler data with a new index.

Parameters
----------
labels : list-like
labels : list-like, SnowflakeQueryCompiler
Index-labels to align with.
method : {None, "backfill"/"bfill", "pad"/"ffill", "nearest"}
Method to use for filling holes in reindexed frame.
Expand All @@ -2570,12 +2570,15 @@ def _reindex_axis_0(
"""
self._raise_not_implemented_error_for_timedelta()

if isinstance(labels, native_pd.Index):
labels = pd.Index(labels)
if isinstance(labels, pd.Index):
new_index_qc = labels.to_series()._query_compiler
if isinstance(labels, SnowflakeQueryCompiler):
new_index_qc = labels
else:
new_index_qc = pd.Series(labels)._query_compiler
if isinstance(labels, native_pd.Index):
labels = pd.Index(labels)
if isinstance(labels, pd.Index):
new_index_qc = labels.to_series()._query_compiler
else:
new_index_qc = pd.Series(labels)._query_compiler

new_index_modin_frame = new_index_qc._modin_frame
modin_frame = self._modin_frame
Expand Down Expand Up @@ -6234,7 +6237,7 @@ def insert(
# 'loc'
def move_last_element(arr: list, index: int) -> None:
if replace:
# swap element at loc with new colun at end, then drop last element
# swap element at loc with new column at end, then drop last element
arr[index], arr[-1] = arr[-1], arr[index]
arr.pop()
else:
Expand Down
15 changes: 6 additions & 9 deletions src/snowflake/snowpark/modin/plugin/docstrings/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,13 @@ class DataFrame(BasePandasDataset):
Notes
-----
``DataFrame`` can be created either from passed `data` or `query_compiler`. If both
parameters are provided, data source will be prioritized in the next order:
parameters are provided, an assertion error will be raised. `query_compiler` can only
be specified when the `data`, `index`, and `columns` are None.

1) Modin ``DataFrame`` or ``Series`` passed with `data` parameter.
2) Query compiler from the `query_compiler` parameter.
3) Various pandas/NumPy/Python data structures passed with `data` parameter.

The last option is less desirable since import of such data structures is very
inefficient, please use previously created Modin structures from the fist two
options or import data using highly efficient Modin IO tools (for example
``pd.read_csv``).
Using pandas/NumPy/Python data structures as the `data` parameter is less desirable since
importing such data structures is very inefficient.
Please use previously created Modin structures or import data using highly efficient Modin IO
tools (for example ``pd.read_csv``).

Examples
--------
Expand Down
Loading
Loading