Grammar for plan constraints #20

aaneja · 2024-07-26T14:55:49Z

This RFC describes the grammar we would use for specifying plan constraints

RFC-0007-plan-constraints-grammar.md

- Add more examples

RFC-0007-plan-constraints-grammar.md

aditi-pandit · 2024-08-16T07:01:52Z

RFC-0007-plan-constraints-grammar.md

+```
+/*! join ((c cte) n) */
+-- 
+with cte as (


We will have an option for constraint within cte also I presume. I won't push on it, but it would be good to add that example as well.

aditi-pandit

Thanks @aaneja

aaneja · 2024-08-20T09:00:20Z

With #3fcc611 I've update the grammar to make the join specification more explicit.

This disallows specifying grouped relations like (A B C) . Instead we need to specify the full join order like ((A B) C). This will make specifying ambiguous group relations like (A LOJ B ROJ C) impossible

At a later stage, we can enhance the grammar so that inner-join-groups like (A B C) can also be specified - this would be useful to constrain the optimizer to reorder the nodes in it separately from the rest of the join graph

jaystarshot · 2024-08-21T20:01:15Z

RFC-0007-plan-constraints-grammar.md

+
+### Examples of constraints
+
+1. Inner Join constraints - 


Why can't we just use replicated or paritioned syntax? Imo the brackets and join ((a c [R]) b [P]) might get complicated quickly

Can you elaborate on what this would look like for the cited example ?

Like dremio or databricks.

Current way also is fine but suggestion is to use explicit names for broadcast/partition etc since the goal of this is to allow users to explictly set the types

I'm not opposed to this; my only gripe with it is that it that the plan constraint string can get quite verbose, e.g for a 4-table join order - join (d ((a c [REPLICATED]) b [PARTITIONED]) [PARTITIONED] )

Do we need this fine grained kind of control. I like the simplicity of just specifying the partitioning of the table, and every other example I found seems to take that approach.

Spark: BROADCAST(T1)

Oracle: PQ_DISTRIBUTE(s BROADCAST, NONE) (which I also find very clunky).

Vertica: you specify inline distrib(L,B)

Dremio: mark BROADCAST inline

You can specify to use syntactic join order if you want to control the join ordering in a more complex way.

I find the existing syntax complex and hard to use.

DB2 supports 'join requests' similar to these join hints. It has variants that specify the join method too
Using a join hint allow us to :

Specify partial join order - you don't need to force syntactic join on the rest of the query to force a specific join shape for part of the query tree

Hardcode join orders for tools that generate SQL and cannot guarantee syntactic SQL generation

Got it. I see the appeal of specifying a partial join order for auto generated queries, but also would like it to be easy to specify join hints for the common case where people just want to mark some table for broadcast or similar. I wonder if there's an alternative approach we can use to achieve both of these goals

We could add Spark style independent hints -

BROADCAST(T1) - Always broadcast T1 in chosen as the build side of a join

BROADCAST(a b) - If the optimizer chooses a logical join of a and b`, and this is a sub-tree of another join graph, use BROADCAST for the join distribution

These will be complementary to the join-order syntax. So the below join hints are equivalent but provide flexibility to the user -

BROADCAST(a b) join(c (ab)) join (c (a b)[R])

Will think of more examples and incorporate them

RFC-0007-plan-constraints-grammar.md

kaikalur · 2024-09-16T14:18:16Z

A high level comment - plan level hints make them not accessible to most users. We should keep them more high level like:

BROADCAST
MAPJOIN
HIGHDISTINCT
LOWDISTINCT 
RANDOM_SHUFFLE
FORCE_SHUFFLE
SHARDING
PAYLOAD
SALTING
...

that tell the planner something about the data shape. Planner is a not concept that most users understand. But some of the others we can easily explain to users. Also things fragments can change based on other options. So not sure it's good to use those.

rschlussel · 2024-09-16T14:09:37Z

RFC-0007-plan-constraints-grammar.md

+
+Most commercially available cost-based optimizers support some form by which a user could express plan constraints that enable these two use-cases. Common terms used by DB vendors to describe this set of features include "plan constraints", "optimization guidelines" or "plan hints".
+
+There are two broad approaches as to how these plan constraints are stored and represented. Vendors such as [Oracle](https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm) & [Vertica](https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/Hints.htm) provide a grammar/syntax where the plan constraint may be expressed inline with the sql query. The parser then picks up the constraint, validates it, and then passes it on to the query optimizer. 


It would helpful to describe the syntax of other systems and how this one compares. like if we're doing something different from everyone else we need a very clear reason why.

I can add a summary section describing the hint syntax, but IMO we can be opinionated about the syntax and make the right tradeoffs for what works best for us.
Happy to discuss this more on a case-by-case basis

I agree we can be opinionated and choose what works best for us, but we should still learn from what others do and where we vary, we should understand why we are going a different route/ are consciously choosing it. Being more similar to other systems people may have encountered can also make it more user friendly.

rschlussel · 2024-09-16T14:10:34Z

RFC-0007-plan-constraints-grammar.md

+    ;
+
+joinAttribute
+    : JOIN_DIST_PARTITIONED


just call this PARTITIONED. JOIN_DIST_PARTITIONED is clunky to write and to remember

I think there's some confusion, the user will use [P] and [R] to specify the constraint specification (see L53-54 for lexer rules), not JOIN_DIST_PARTITIONED or JOIN_DIST_REPLICATED

ah, thanks. missed this. Still would recommend using broadcast/ B vs. R because that's more consistent with the user facing language we use elsewhere (e.g. session property join_distribution_type=BROADCAST).

RFC-0007-plan-constraints-grammar.md

rschlussel · 2024-09-16T14:11:55Z

RFC-0007-plan-constraints-grammar.md

@@ -0,0 +1,168 @@
+# **RFC007 for Presto**
+
+## Grammar for Plan Constraints


I would call this feature Plan Hints rather than Plan Constraints. The optimizer may not follow the hints (though we should issue a warning if we don't)

I would call this feature Plan Hints rather than Plan Constraints.

Sure, will rename
Agreed on providing a warning if some/all of the plan hint is not applied

rschlussel · 2024-09-16T14:14:43Z

RFC-0007-plan-constraints-grammar.md

+    ;
+
+joinType
+    : LOJ


why is this needed?

Had mentioned this in an earlier comment, we plan to support join hints for outer joins as well. This provides support for the same. I am OK removing it for now, since v1 will not have support for using this hint

but the type of join is determined by the query, so even if we support join hints, we shouldn't need to specify if it's an inner or outer join.

If/when we support outer join reordering, we would need these to specify hints for outer-join reordering. Of course, if the planner never evaluates these hint choices, these hints are ignored and there is zero chance of a correctness issue w.r.t what type of join to choose

rschlussel · 2024-09-16T14:24:33Z

RFC-0007-plan-constraints-grammar.md

+
+### Examples of constraints
+
+1. Inner Join constraints - 


Do we need this fine grained kind of control. I like the simplicity of just specifying the partitioning of the table, and every other example I found seems to take that approach.

Spark: BROADCAST(T1)

Oracle: PQ_DISTRIBUTE(s BROADCAST, NONE) (which I also find very clunky).

Vertica: you specify inline distrib(L,B)

Dremio: mark BROADCAST inline

You can specify to use syntactic join order if you want to control the join ordering in a more complex way.

I find the existing syntax complex and hard to use.

rschlussel · 2024-09-16T14:27:08Z

RFC-0007-plan-constraints-grammar.md

+    2. `join ((a c [R]) b [P])` - In addition to the join order, use a  REPLICATED `[R]`  join for sub-plan `(a c)`  and PARTITIONED `[P]`  for `(a c) b`
+    3. If an inner join condition does not exist between nodes a CrossJoin is automatically inferred
+2. Cardinality constraints -
+    1. `card (c 10)` - Set the output row count estimate of `c` to `10` 


I would rather not expose this kind of control. Better to specify what you want to have happen with this table, and otherwise leave it be. Seems pretty risky to encode cardinality estimates in the query text.

Can you elaborate on the risk ? I see this card estimate as a simple way to fixup estimation errors for when -

Presto doesn't have stats, but the user is aware of maxcard for a base table

Presto is unable to figure out the right stats because it has a missing stats rule (e.g see the WITH CTE example at L144 below)

Provides a good debugging aid for devs to quickly test what-if scenarios wrt to cardinality estimates

The risk I see is that if you are specifying a cardinality of x, you are probably doing it because you want the optimizer to do some particular thing about it. But you don't really know what the optimizer will do with that information, and it could do one thing for a while, and then in a new release there's an optimizer change and it does something else. Because you aren't directly controlling what happens when you specify the cardinality, it's hard to say how it might affect the query, and could be hard to debug if the performance degrades (vs. if you e.g. specify broadcast join and your data gets bigger, it's very clear what happens)

It can also get out of sync with the data (there is always a risk with hand tuning that the optimization will no longer be relevan or will perform worse as the data changes, but specifying a specific cardinality estimate can have more varied and unknown effects).

I agree that this is a powerful knob to give to the users, one that we can document as such (with caveats)

To aid debugging, we can add CardEstimateBasedSourceInfo similar to how we have a HistoryBasedSourceInfo that will make it clear to the users how the cardinality was arrived at in EXPLAIN/EXPLAIN ANALYZE (and event listener etc.)

Additional safeguards like warnings & metrics can be incorporated too if the stats estimate differ widely from actual runtime observed cardinality

RFC-0007-plan-constraints-grammar.md

First draft of plan constraints grammar

cac050a

aaneja requested review from aditi-pandit and ZacBlanco July 26, 2024 14:57

aaneja commented Jul 26, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Outdated Show resolved Hide resolved

ZacBlanco reviewed Jul 29, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Outdated Show resolved Hide resolved

ZacBlanco reviewed Jul 29, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Outdated Show resolved Hide resolved

aditi-pandit reviewed Jul 30, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Show resolved Hide resolved

RFC-0007-plan-constraints-grammar.md Show resolved Hide resolved

RFC-0007-plan-constraints-grammar.md Outdated Show resolved Hide resolved

- Modify grammar to be more precise

5ea27d8

- Add more examples

ZacBlanco approved these changes Aug 14, 2024

View reviewed changes

aditi-pandit reviewed Aug 16, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Show resolved Hide resolved

aditi-pandit reviewed Aug 16, 2024

View reviewed changes

aditi-pandit approved these changes Aug 16, 2024

View reviewed changes

Fix to latest grammar

3fcc611

aaneja marked this pull request as ready for review August 20, 2024 08:34

aaneja changed the title ~~[DRAFT] Grammar for plan constraints~~ Grammar for plan constraints Aug 20, 2024

jaystarshot reviewed Aug 21, 2024

View reviewed changes

rschlussel reviewed Aug 23, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Show resolved Hide resolved

Add details about how other DB vendors support plan constraints

ff71b40

kaikalur approved these changes Sep 16, 2024

View reviewed changes

RFC-0007-plan-constraints-grammar.md Show resolved Hide resolved

rschlussel reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar for plan constraints #20

Grammar for plan constraints #20

aaneja commented Jul 26, 2024 •

edited

Loading

aditi-pandit Aug 16, 2024

aditi-pandit left a comment

aaneja commented Aug 20, 2024 •

edited

Loading

jaystarshot Aug 21, 2024

aaneja Aug 23, 2024

jaystarshot Aug 23, 2024

aaneja Aug 26, 2024

rschlussel Sep 16, 2024

aaneja Sep 17, 2024 •

edited

Loading

rschlussel Sep 17, 2024

aaneja Sep 18, 2024 •

edited

Loading

kaikalur commented Sep 16, 2024

rschlussel Sep 16, 2024

aaneja Sep 17, 2024

rschlussel Sep 17, 2024

rschlussel Sep 16, 2024

aaneja Sep 16, 2024

rschlussel Sep 17, 2024

rschlussel Sep 16, 2024

aaneja Sep 17, 2024

rschlussel Sep 16, 2024

aaneja Sep 16, 2024

rschlussel Sep 17, 2024

aaneja Sep 18, 2024 •

edited

Loading

rschlussel Sep 16, 2024

rschlussel Sep 16, 2024

aaneja Sep 17, 2024

rschlussel Sep 17, 2024

aaneja Sep 18, 2024


		Most commercially available cost-based optimizers support some form by which a user could express plan constraints that enable these two use-cases. Common terms used by DB vendors to describe this set of features include "plan constraints", "optimization guidelines" or "plan hints".

		There are two broad approaches as to how these plan constraints are stored and represented. Vendors such as [Oracle](https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm) & [Vertica](https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/Hints.htm) provide a grammar/syntax where the plan constraint may be expressed inline with the sql query. The parser then picks up the constraint, validates it, and then passes it on to the query optimizer.

		@@ -0,0 +1,168 @@
		# RFC007 for Presto

		## Grammar for Plan Constraints

Grammar for plan constraints #20

Are you sure you want to change the base?

Grammar for plan constraints #20

Conversation

aaneja commented Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

aditi-pandit left a comment

Choose a reason for hiding this comment

aaneja commented Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

kaikalur commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaneja commented Jul 26, 2024 •

edited

Loading

aaneja commented Aug 20, 2024 •

edited

Loading

aaneja Sep 17, 2024 •

edited

Loading

aaneja Sep 18, 2024 •

edited

Loading

aaneja Sep 18, 2024 •

edited

Loading