-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-1731783] Refactor node query comparison for Repeated subquery elimination #2437
Conversation
fb74319
to
a6b696f
Compare
Can you give an example of where this does not make sense? |
@sfc-gh-aalam This has been causing problem to our ctc workloads, where we missed some handling of some nodes, I have tried it with our benchmark workloads, the problem has been gone with this refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. a few comments.
query_id = encoded_query_id(query, query_params) | ||
if query_id is not None: | ||
return query_id + node_type_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do f"{query_id}_{node_type_name}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
def encode_id( | ||
node_type_name: str, query: str, query_params: Optional[Sequence[Any]] = None | ||
) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this encode_node_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i actually call this encode_node_id_with_query to be more clear of the encoded id to reducing the chance for people to use it incorrectly
else: | ||
return str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want the same node to have the same encode_id
. If so, we can rely on id(node)
which gives you deterministic result, instead of uuid4()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used for cached property, so with uuid4 generation it will also be deterministitic, but we can use id also.
query_id = encoded_query_id(query, query_params) | ||
if query_id is not None: | ||
return query_id + node_type_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Encode given query, query parameters and the node type into an id. | ||
|
||
If query and query parameters can be encoded successfully using sha256, | ||
return the encoded query id + node_type_name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need node_type_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is following the previous equivalence check, we require the node type to be the same, for example, a snowflake plan node and selectstatement node with the same query is not counted as the same
return encode_id(type(self).__name__, self.original_sql, self.query_params) | ||
|
||
@cached_property | ||
def encoded_query_id(self) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two properties seem having the same docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are slightly different, but i further updated the comment to make it more clear
Also, it sounds like this is a behavioral change. Are there production workloads that can be impacted? Which ones? |
a6b696f
to
87db802
Compare
@sfc-gh-helmeleegy there should be no user facing behavior change, this equivalence check is introduced by the cte transformation, and only used internally for plan node comparison for cte transformation, and no others should rely on this fact. If yes, that would be counted as a bug. This is counted as a refactor work before others misuse this property, so no existing workload should be impacted |
def traverse_plan(plan, plan_id_map): | ||
plan_id = plan._id | ||
plan_type = type(plan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-aalam i am not sure if we are doing the write test here, if two nodes have the same query and same type, there should still be two nodes after deepcopy, because they are two different nodes.
87db802
to
f42030d
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
SNOW-1731783
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
In the previous repeated subquery elimination, we updated the node comparison for SnowflakePlan for Selectable to
where the id is generated based on the query and query parameter, this means two node are treated as the same if they have same type and same query. This make sense when we do repeated subquery elimination, but not expected by other transformations.
Refactor the comparison to make sure that we only use the id comparison for repeated subquery eliminations, not for others.