-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-28755: Statistics Management Task #6199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
DanielZhu58
wants to merge
37
commits into
apache:master
Choose a base branch
from
DanielZhu58:HIVE-28755
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+341,101
−4,402
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0bdbda0 to
3db2848
Compare
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Show resolved
Hide resolved
...tore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
...etastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java
Outdated
Show resolved
Hide resolved
…without optional high/low info (apache#6208)
… local resource leaks (apache#6161)
…increases overhead (apache#6205)
1. Drop CanAggregateDistinct and refactor dependent code accordingly. 2. Remove isDistinct indicator from all classes extending SqlAggFunction. 3. Move the part handling window functions from SqlFunctionConverter#buildAST to ASTConverter 4. Generalize the generation of TOK_FUNCTIONSTAR for aggregate functions by exploiting SqlOperator#getSqlSyntax 5. Replace CalciteUDAF with SqlBasicAggFunction.create since the former does not bring any additional info (operandTypeInference is removed but it is not used anyways from Hive).
…eadable format (apache#6230) 1. Add new property to control indentation of EXPLAIN FORMATTED result 2. Create the appropriate JsonParser in ExplainTask based on explain configurations 3. Drop now unused and redundant JsonParserFactory 4. Extract logic for augmenting RS outputs in separate method dedicated for this purpose
…enabled in case of variant shredding. (apache#6245)
…sRead metrics for tables with multiple partitions (apache#6253)
…tats in HMS for non-native tables (apache#6232)
…pache#6155) * HIVE-29293: Restrict config 'mapreduce.job.queuename' at tez session * Address review comments * Address test failures * Address review comments * indentation issue
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



What changes were proposed in this pull request?
To add a new StatisticsManagementTask.java to automatically delete the old stats.
Why are the changes needed?
To help reduce the old or stale stats.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual tests and unit tests.
For reviewers: What this PR does
This PR introduces a new “Statistics Management Task” in the Hive metastore which periodically auto-deletes stale column statistics, plus configuration knobs.
In MetastoreConf.java
Three new configuration variables are added:
STATISTICS_MANAGEMENT_TASK_FREQUENCY
Meaning: Controls how often the StatisticsManagementTask runs, for tables that have statistics.auto.deletion=true in their table properties.
STATISTICS_RETENTION_PERIOD
Meaning: The retention period for stats. If a table/partition’s stats are older than this, they become candidates for auto deletion.
STATISTICS_AUTO_DELETION
In StatisticsManagementTask.java
Defines a new StatisticsManagementTask implementing MetastoreTaskThread. Its purpose is to:
Fetch STATISTICS_RETENTION_PERIOD and STATISTICS_AUTO_DELETION from conf. If retention <= 0 or auto deletion is disabled, log and return.
Compute lastAnalyzedThreshold = (now - retentionMillis) / 1000 (in seconds).
Use HMSHandler.getMSForConf(conf) to get RawStore and a PersistenceManager, then query MTableColumnStatistics rows where lastAnalyzed < threshold.
In short, this class implements a background cleanup task that scans MTableColumnStatistics for stale entries and deletes them via the metastore client.
In BenchmarkTool.java
BenchmarkTool can now benchmark the new statistics management task for different numbers of tables.
In HMSBenchmarks.java Test
Constructs a dedicated database name and table prefix based on tableCount and BenchData.
Gets an HMSClient and instantiates a StatisticsManagementTask.
Configures the client Hadoop conf:
hive.metastore.uris = metastore URI
metastore.statistics.management.database.pattern = dbName (so the task focuses on this DB)
Sets the task’s conf and creates the database and tableCount tables:
Simulates old stats:
For each partition, sets lastAnalyzed to now - 400 days in the partition parameters and alters the partition.
Post-run assertion:
Re-scans all partitions; if any partition parameters still contain lastAnalyzed, it throws an AssertionError("Partition stats not deleted for table: " + tableName).
In other words, this is an end-to-end microbenchmark for the new StatisticsManagementTask that both measures performance and verifies that “old” partition stats are actually cleaned up.