Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: parse analyze compute statistics #4547

Merged
merged 9 commits into from
Jan 9, 2025

Conversation

@georgesittas
Copy link
Collaborator

Hi @zashroof, thanks for the PR. Can you please share any related documentation? What dialects does this cover?

@zashroof
Copy link
Contributor Author

Hi @zashroof, thanks for the PR. Can you please share any related documentation? What dialects does this cover?

Sorry about that, updated the PR description with links.
https://spark.apache.org/docs/3.5.1/sql-ref-syntax-aux-analyze-table.html
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-analyze-table.html

@georgesittas
Copy link
Collaborator

FYI, the team's off for holidays, we'll take a look in a week or so. Thanks for providing those links. :)

@zashroof
Copy link
Contributor Author

FYI, the team's off for holidays, we'll take a look in a week or so. Thanks for providing those links. :)

No worries, I didn't expect a review during the holidays :) Happy Holidays!

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zashroof, thank you for the PR!

To my knowledge, there are other dialects that support the ANALYZE statement e.g Postgres; If we implement parsing for it, we should make sure that:

  1. We add support for ANALYZE across all dialects
  2. If (1) has a large scope, we add exp.Command fallbacks at any point that the Spark/Databricks syntax is not met.

Otherwise, we risk introducing regressions such as incomplete parsing/generation or errors for other dialects. Check out how self._parse_as_command(...) is used for other statements as well.

sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/tokens.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/generator.py Outdated Show resolved Hide resolved
sqlglot/generator.py Outdated Show resolved Hide resolved
sqlglot/generator.py Outdated Show resolved Hide resolved
sqlglot/generator.py Outdated Show resolved Hide resolved
tests/fixtures/identity.sql Outdated Show resolved Hide resolved
tests/fixtures/identity.sql Outdated Show resolved Hide resolved
@zashroof
Copy link
Contributor Author

zashroof commented Jan 7, 2025

Hey @zashroof, thank you for the PR!

To my knowledge, there are other dialects that support the ANALYZE statement e.g Postgres; If we implement parsing for it, we should make sure that:

  1. We add support for ANALYZE across all dialects
  2. If (1) has a large scope, we add exp.Command fallbacks at any point that the Spark/Databricks syntax is not met.

Otherwise, we risk introducing regressions such as incomplete parsing/generation or errors for other dialects. Check out how self._parse_as_command(...) is used for other statements as well.

Skimming through currently supported dialects docs to see if they define an analyze statement (There is no ANALYZE statement in the SQL standard).

dialect Analyze statement reference
databricks https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-analyze-table.html
doris https://doris.apache.org/docs/2.0/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/ANALYZE
drill https://drill.apache.org/docs/analyze-table-compute-statistics, https://drill.apache.org/docs/analyze-table-refresh-metadata/
duckdb https://duckdb.org/docs/sql/statements/analyze
mysql https://dev.mysql.com/doc/refman/8.4/en/analyze-table.html
oracle https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/ANALYZE.html
postgres https://www.postgresql.org/docs/current/sql-analyze.html
presto https://prestodb.io/docs/current/sql/analyze.html
reshift https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
spark https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html
sqlite https://www.sqlite.org/lang_analyze.html
starrocks https://docs.starrocks.io/docs/sql-reference/sql-statements/cbo_stats/ANALYZE_TABLE/
trino https://trino.io/docs/current/sql/analyze.html

Well tbh this seems a little more than I anticipated. Multiple are already covered by spark impl but let me see if I can add test cases for a few more. In the mean time I should add fallback to parse as command.

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments as well, should be good to go once we have coverage for the remaining dialects, as Vaggelis said.

sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
sqlglot/parser.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, looks much cleaner!

kind = None
this: t.Optional[exp.Expression] = None
partition = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, could we do kind = self._curr and self._curr.text.upper() here i.e before the branches? I think that would remove the hardcoded values in the if/elif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-wrote this part in #4591

@@ -410,6 +410,8 @@ class TokenType(AutoName):
OPTION = auto()
SINK = auto()
SOURCE = auto()
ANALYZE = auto()
COMPUTE_STATISTICS = auto()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now remove the COMPUTE STATISTICS token since it was removed from STATEMENT_PARSERS, right?

It can be consumed by the parser through self._match_text_seq("COMPUTE", "STATISTICS")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +906 to +907
ast = parse_one("ANALYZE TABLE tbl COMPUTE STATISTICS FOR ALL COLUMNS")
self.assertIsInstance(ast, exp.Analyze)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not passing in a specific dialect to parse_one, afaict we can merge each of these 2 lines as:

self.validate_identity(...).assert_is(exp.Command)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea, I am removing this test here and adding assert_is(exp.Analyze) to all dialect specefic test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I then removed it when parse_analyze only returns exp.Analyze. Changes in #4591


class ComputeStatistics(Expression):
arg_types = {
"this": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll always have this here according to _parse_compute_statistics, we can make this True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #4591

Comment on lines +927 to +929
self.validate_identity(
"ANALYZE TABLE ctlg.db.tbl PARTITION(foo = 'foo', bar = 'bar') COMPUTE STATISTICS NOSCAN"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Styling nit, can we move this at the end of this identity chain since it breaks into multiple lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@georgesittas georgesittas merged commit c75016a into tobymao:main Jan 9, 2025
8 checks passed
@georgesittas
Copy link
Collaborator

Thanks for the contrivution @zashroof, we'll take this to the finish line.

@zashroof
Copy link
Contributor Author

zashroof commented Jan 9, 2025

Thanks for the contrivution @zashroof, we'll take this to the finish line.

Thanks for the prompt review, sorry didn't get a chance to respond to comments yesterday. I was planning on supporting the rest of the dialects. I will try to send a follow up PR if you don't mind.

@georgesittas
Copy link
Collaborator

Sounds good, and no worries 👍

@zashroof
Copy link
Contributor Author

FTR - extending the parsing to cover all dialects is carried forward in #4591

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants