-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137
SNOW-1458135 Implement DataFrame and Series initialization with lazy Index objects #2137
Conversation
…, add tests for the same
…y-index # Conflicts: # src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py
…data is not a Snowpark pandas object
… the constructor tests, rewrite concat tests
All of the join counts in the tests have increased because during DataFrame/Series creation with a non-Snowpark pandas object as In some cases the join count is a lot higher in tests but this is because of the way they are written - some tests call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! It's a lot of work btw.
Please also check
- if you identify some test code can be improved, please add todo and track with jira.
- please run a jenkins job to see if anything wrong there before merge.
pytest.param( | ||
"series", | ||
marks=pytest.mark.xfail( | ||
reason="SNOW-1675191 reindex does not work with tuple series" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-vbudati @sfc-gh-azhan mentioned that the main purpose of this pr is to remove a to_pandas materialization, can we just do that in this pr, and move the other refactoring part out of the current pr?
name = data.name | ||
from snowflake.snowpark.modin.plugin.extensions.index import Index | ||
|
||
# Setting the query compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more general comment here about the change, our orignial code behaves in such way that if both data and query compiler are provided, the data is used.
However, here seems we want to change it to a way that only one of them can be configured. i think that is fine, however, please make sure we update the doc to clear this part.
Here is couple of points:
- from the structure point of view, i think we can do parameter check first, for example, where both query_compiler and parameter is provided. Then check init the query_compiler like the original code structure, unless there are case works very differently.
- the check message doesn't seem very clear. for example, query_compiler and index can not be provided together, might be better to "index is not supported when query_compiler is provided" etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make the error messages clearer like you pointed out in (2.) --> "index is not supported when query_compiler is provided". But the parameters are right now checked before they are used. I don't think there are any cases in the code where both query compiler and data/index/columns are provided (no tests have failed so far with anything related to this). I think it's also simpler behavior to have it this way.
The doc should also be updated with this behavior.
if hasattr(data, "name") and data.name is not None: | ||
# If data is an object that has a name field, use that as the name of the new Series. | ||
name = data.name | ||
# If any of the values are Snowpark pandas objects, convert them to native pandas objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under this case, shouldn't we try to convert other ones to snowpark pandas objects instead of pulling them to local? or maybe we should just error it out.
Do you have one example about this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example where its better to convert it to pandas is this:
data = {"A": pd.Series([1, 2, 3]), "B": pd.Index([4, 5, 6]), "C": 5}
pd.DataFrame(data)
Out[58]:
A B C
0 1 4 5
1 2 5 5
2 3 6 5
5 is put in every single row even though it's a scalar in the dict
@sfc-gh-yzou I prefer not making the refactor changes in a new PR since I think this one is very close to merging and it will take a lot more work to separate the index changes from this |
src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py
Outdated
Show resolved
Hide resolved
src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py
Outdated
Show resolved
Hide resolved
src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py
Outdated
Show resolved
Hide resolved
I kind agree with @sfc-gh-yun this PR is becoming too big. Can we use this as the PoC draft PR, and we can review smaller PRs one by one. You can either start with refactoring pieces first or fix the lazy index first. Try to make sure refactoring PR only do refactoring and no test changes. |
@sfc-gh-azhan @sfc-gh-yzou I can try to separate this PR into two other PRs - one for the lazy index change and the other for the refactor. It is impossible to avoid test changes in the refactor PR since I introduced functionality to allow passing non-existent columns or index values to the constructor. The constructors should be able to handle any kind of inputs and I added tests for this. However, that requires me to make a non-trivial amount of redundant code changes, for example, the same set of tests are changed in both PRs where the query count will likely be different due to the refactor. I was hoping to work on IR tickets from Monday, so I still prefer merging this PR as is, please let me know if you both feel strongly about this. In the future, I'd really appreciate if the feedback about splitting PRs is brought up earlier. |
# STEP 2: If columns are provided, set the columns if data is lazy. | ||
# STEP 3: If both the data and index are local (or index is None), create a query compiler from pandas. | ||
# STEP 4: Otherwise, set the index through set_index or reindex. | ||
# STEP 5: The resultant query_compiler is then set as the query_compiler for the DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize dtype
is not always handled in this new code. Can you add it?
Then, I'm okay to keep working on this PR. |
@pytest.mark.parametrize("limit", [None, 1, 2, 100]) | ||
@pytest.mark.parametrize("method", ["bfill", "backfill", "pad", "ffill"]) | ||
def test_reindex_index_datetime_with_fill(limit, method): | ||
date_index = native_pd.date_range("1/1/2010", periods=6, freq="D") | ||
native_series = native_pd.Series( | ||
{"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index | ||
{"1/1/2020": [100, 101, np.nan, 100, 89, 88]}, index=date_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the tests with xfail and add a todo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help!
@@ -1995,3 +1996,68 @@ def create_frame_with_data_columns( | |||
def rindex(lst: list, value: int) -> int: | |||
"""Find the last index in the list of item value.""" | |||
return len(lst) - lst[::-1].index(value) - 1 | |||
|
|||
|
|||
def convert_index_to_qc(index: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from snowflake.snowpark.modin.plugin.compiler.snowflake_query_compiler import SnowflakeQueryCompiler
have you tried this?
src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py
Outdated
Show resolved
Hide resolved
src/snowflake/snowpark/modin/plugin/extensions/dataframe_overrides.py
Outdated
Show resolved
Hide resolved
…y-index # Conflicts: # CHANGELOG.md
@@ -11,6 +11,8 @@ | |||
- Added support for using Snowflake Interval constants with `Window.range_between()` when the order by column is TIMESTAMP or DATE type. | |||
- Added support for file writes. This feature is currently in private preview. | |||
- Added support for `DataFrameGroupBy.fillna` and `SeriesGroupBy.fillna`. | |||
- Added support for constructing `Series` and `DataFrame` objects with the lazy `Index` object as `data`, `index`, and `columns` arguments. | |||
- Added support for constructing `Series` and `DataFrame` objects with `index` and `column` values not present in `DataFrame`/`Series` `data`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this new line as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the work.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1458135
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
data
,index
, and/orcolumns
.data
.data
is a Series or DataFrame object, the new Series or DataFrame object is creating by filtering thedata
with providedindex
andcolumns
.index
don't exist indata
's index, these values are added as new rows and their corresponding data values areNaN
.columns
don't exist indata
's columns, these values are added as newNaN
columns.NaN
columns in the logic.