feat(window): window functions SQL #4227

f4t4nt · 2025-04-21T22:12:46Z

SQL support for window functions

…d Window.order_by

…gation

…ric datatypes and other improvements

… nonincremental methods

## Changes Made - fix format of udf docstring (resource requests + >>>) - remove TableSource from Catalogs doc - update headings in Iceberg doc ## Related Issues n/a ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

…ctor pool UDF (#4222) ## Changes Made There's currently an issue where returning extension type from actor pool udf gives a cannot pickle error, see: #4213. This is because the Daft extension type class is a local object (for lazy loading purposes). Instead of pickling it we serialize the data itself to arrow ipc. Also, sending data via pipe is too slow. It is designed for message passing not data transfer. One solution is to use shared memory. Example script (With current implementation using mp.pipe, this takes **2 minutes and 25s**. With shared memory, it takes **16s.**) ``` @daft.udf(return_dtype=daft.DataType.list(daft.DataType.int8()), batch_size=10) def foo(x): data_size = 10 * 1024 * 1024 # 10 MB res = [] for i in range(len(x)): res.append(np.random.randint(0, 100, size=(data_size,), dtype=np.int8)) return res if __name__ == "__main__": # total around 10 gb of data num_rows = 1000 df = daft.from_pydict({"index": [i for i in range(num_rows)]}) foo = foo.with_concurrency(12) df = df.with_column("bar", foo(df["index"])).collect() ``` Additionally, this PR also improves error handling. ## Related Issues  ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

## Changes Made Adds an integration test to ensure that Glue/Iceberg -> Pytorch dataloading works.

## Changes Made This introduces a simple Term IR ([much like pyiceberg](https://py.iceberg.apache.org/reference/pyiceberg/expressions/#pyiceberg.expressions.Term)) which is a model for pushdown expressions. This is scoped to the python pushdown package, and the Term IR comes with a visitor which can be used to translate to other domains. This enable easy consumption of PyExpr / rust expressions. I have an example visitor which produces s-expressions, and can add one to translate to the PyIceberg Expression IR. An alternative design is no IR and visitor-only - this is nice because there is no additional IR, however it comes at the tradeoff of possible tighter coupling to the Rust expressions and we may want an IR anyways if we ever want to serde pushdown expressions. Please let me know what you all think is best here. #4210

…4236) ## Changes Made Simplify the example by using `GlueCatalog.from_session()` instead of `load_glue()`

The commit moves SQLPlanner to use session references instead of owning the Session through Rc. This simplifies lifetime handling and state management. ## Changes Made - `SQLPlanner` now takes `&Session` instead of `Rc<Session>` - Removed `fork()` from Session API - Added bound_tables to `PlannerContext` for local table bindings - Updated all callers to pass session references  ## Related Issues closes #4207 ## Checklist - [ ] Documented in API Docs (if applicable) - [ ] Documented in User Guide (if applicable) - [ ] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [ ] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

## Changes Made Previously `GlueCatalog.from_session()` only accepted a boto3 session. Add a runtime check so that it can accept a botocore session too. ## Checklist - [x] Documented in API Docs (if applicable)

## Changes Made Adds a property map to the `create_table` APIs, context and use cases are linked in #4195 ## Related Issues closes #4195 ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [n/a] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [n/a] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

## Changes Made Add an integration test for pandas to daft to glue iceberg. Currently this does create a new iceberg commit per write, but I'm trying to modify the Glue table so that these get cleaned up after a day.

…cution

codecov · 2025-04-24T22:06:16Z

Codecov Report

Attention: Patch coverage is 69.76744% with 65 lines in your changes missing coverage. Please review.

Project coverage is 78.47%. Comparing base (e99d8cc) to head (acc6cef).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-sql/src/functions.rs	36.36%	56 Missing ⚠️
src/daft-sql/src/modules/window.rs	93.93%	6 Missing ⚠️
src/daft-logical-plan/src/builder/resolve_expr.rs	89.47%	2 Missing ⚠️
src/daft-local-plan/src/plan.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4227      +/-   ##
==========================================
- Coverage   78.48%   78.47%   -0.01%     
==========================================
  Files         798      799       +1     
  Lines      105351   105581     +230     
==========================================
+ Hits        82686    82856     +170     
- Misses      22665    22725      +60

Files with missing lines	Coverage Δ
daft/expressions/expressions.py	`94.91% <100.00%> (ø)`
...ecution/src/sinks/window_partition_and_order_by.rs	`80.00% <ø> (ø)`
src/daft-logical-plan/src/builder/mod.rs	`88.95% <100.00%> (+0.04%)`	⬆️
src/daft-local-plan/src/plan.rs	`93.39% <66.66%> (+0.23%)`	⬆️
src/daft-logical-plan/src/builder/resolve_expr.rs	`81.96% <89.47%> (+0.63%)`	⬆️
src/daft-sql/src/modules/window.rs	`93.93% <93.93%> (ø)`
src/daft-sql/src/functions.rs	`70.77% <36.36%> (-10.89%)`	⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/daft-sql/src/functions.rs

rchowell · 2025-04-24T22:07:11Z

tests/sql/test_window.py

+
+    daft_result = df.with_column("row_num", row_number().over(window_spec)).sort(["category", "value"]).collect()
+
+    assert_df_equals(sql_result.to_pandas(), daft_result.to_pandas(), sort_key=["category", "value"])


any particular reason to use pandas for the comparison?

yep assert_df_equals uses pandas to check equality of dataframes

def assert_df_equals( daft_df: pd.DataFrame, pd_df: pd.DataFrame, sort_key: str | list[str] = "Unique Key", assert_ordering: bool = False, check_dtype: bool = True, ):

…d tests

f4t4nt added 30 commits March 26, 2025 18:25

feat(window): add window function definitions and skeleton API

05363f8

feat(window): add ExtractWindowFunction optimizer rule

ed128e6

feat(window): add window partition execution

862073e

refactor(window): simplify window partition sink implementation

5ef0243

Merge branch 'main' into feat/window-execution

baabd36

Merge branch 'main' into feat/window-definitions

489d09e

Merge branch 'main' into feat/window-optimizer

4255b58

feat(window): support arbitrary expressions in Window.partition_by an…

aa106d9

…d Window.order_by

refactor(window): simplify WindowBoundary API by using Offset directly

20d6234

refactor(window): remove test_partition_only from definitions branch

48ff23a

test(window): update partition-only tests on execution branch

38276fe

test(window): skip tests that require updated window API

dad6b88

Merge branch 'feat/window-optimizer' into feat/window-execution

040ca6f

refactor(window): remove wrapper structs and use direct Python bindings

0fceade

fix(window): moved window to top level expressions

f0577f3

feat(window): merge window-definitions into window-execution

04c488a

feat(window): merge window-definitions into window-optimizer

6b20c5f

Merge branch 'feat/window-optimizer' into feat/window-execution

c9d7580

docs(window): clean up comment in window.rs

e693f8e

refactor(window): simplify window boundary API

f511441

feat(window): merge from feat/window-execution

c2df6e4

docs(window): update Window class docstring with supported examples

be838bf

refactor(window): convert Window class methods to instance methods

45be29e

refactor(window): convert Window class methods to instance methods

cc6fc67

refactor(utils): move column input handling to utils module

0b1edc2

refactor(utils): move column input handling to utils module

9a20502

Merge branch 'main' into feat/window-definitions

94f437e

fix(logical-plan): use expression names for window function fields

c76b50b

feat(optimizer): merge window-definitions

43150df

feat(window): add rust unit tests for extract_window_function

aaa5cbb

f4t4nt and others added 18 commits April 22, 2025 11:18

refactor(window): optimize Sum and Mean window state with batch aggre…

12a36c3

…gation

refactor(window): update window state implementations to support gene…

7810198

…ric datatypes and other improvements

feat(window): merge running-agg implementation into window-sql

400295c

refactor(window): break window_agg_dynamic_frame into incremental and…

91247da

… nonincremental methods

feat(window): merge running-agg implementation into window-sql

2ce89fc

merge(window): merge main into window-sql for easier review

6d25ff2

refactor(window): more cleanup for easier review

2c665a0

test: Add Glue/Iceberg to Pytorch dataloading test (#4202)

4a68192

## Changes Made Adds an integration test to ensure that Glue/Iceberg -> Pytorch dataloading works.

merge(window): more merge stuff

9c27e1c

test: Small cleanup for the Glue/Iceberg to Pytorch dataloading test (#…

49c9ade

…4236) ## Changes Made Simplify the example by using `GlueCatalog.from_session()` instead of `load_glue()`

feat: Allow GlueCatalog to be created from a botocore session (#4237)

b6b95e4

## Changes Made Previously `GlueCatalog.from_session()` only accepted a boto3 session. Add a runtime check so that it can accept a botocore session too. ## Checklist - [x] Documented in API Docs (if applicable)

test: Add integration test for pandas to daft to glue iceberg (#4238)

7a1f30d

## Changes Made Add an integration test for pandas to daft to glue iceberg. Currently this does create a new iceberg commit per write, but I'm trying to modify the Glue table so that these get cleaned up after a day.

merge(window): merge window-rank-execution-2 into window-lag-lead-exe…

1d0c256

…cution

f4t4nt force-pushed the f4t4nt/window-sql branch from 1d0c256 to 2c665a0 Compare April 24, 2025 19:58

test(window): update sql window test

3e82d49

f4t4nt force-pushed the f4t4nt/window-sql branch 3 times, most recently from 2a4fc12 to 3e82d49 Compare April 24, 2025 20:23

f4t4nt added 3 commits April 24, 2025 13:28

Merge branch 'main' into f4t4nt/window-sql

fe3b25c

refactor(window): remove TODO in Expr::Over for to_sql

c1409d9

test(window): fix runner test skip for window sql

0197e94

rchowell reviewed Apr 24, 2025

View reviewed changes

f4t4nt added 2 commits April 25, 2025 16:19

refactor(window): some clean up

fdd3b0a

feat(sql): implement rank/dense rank/lag/lead window functions and ad…

acc6cef

…d tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(window): window functions SQL #4227

feat(window): window functions SQL #4227

f4t4nt commented Apr 21, 2025

codecov bot commented Apr 24, 2025 •

edited

Loading

rchowell Apr 24, 2025

f4t4nt Apr 25, 2025


		daft_result = df.with_column("row_num", row_number().over(window_spec)).sort(["category", "value"]).collect()

		assert_df_equals(sql_result.to_pandas(), daft_result.to_pandas(), sort_key=["category", "value"])

feat(window): window functions SQL #4227

Are you sure you want to change the base?

feat(window): window functions SQL #4227

Conversation

f4t4nt commented Apr 21, 2025

codecov bot commented Apr 24, 2025 • edited Loading

Codecov Report

rchowell Apr 24, 2025

Choose a reason for hiding this comment

f4t4nt Apr 25, 2025

Choose a reason for hiding this comment

codecov bot commented Apr 24, 2025 •

edited

Loading