diff --git a/docs/en/guides/best-practices/sparse-primary-indexes.md b/docs/en/guides/best-practices/sparse-primary-indexes.md index 44f9bb72275..47df4660900 100644 --- a/docs/en/guides/best-practices/sparse-primary-indexes.md +++ b/docs/en/guides/best-practices/sparse-primary-indexes.md @@ -1,6 +1,6 @@ --- slug: /en/optimize/sparse-primary-indexes -sidebar_label: Sparse Primary Indexes +sidebar_label: Primary Indexes sidebar_position: 1 description: In this guide we are going to do a deep dive into ClickHouse indexing. --- diff --git a/docs/en/integrations/data-ingestion/dbms/mysql/index.md b/docs/en/integrations/data-ingestion/dbms/mysql/index.md index eaa22339981..c3d292f783e 100644 --- a/docs/en/integrations/data-ingestion/dbms/mysql/index.md +++ b/docs/en/integrations/data-ingestion/dbms/mysql/index.md @@ -1,7 +1,7 @@ --- sidebar_label: MySQL sidebar_position: 10 -slug: /en/integrations/mysql +slug: /en/integrations/connecting-to-mysql description: The MySQL table engine allows you to connect ClickHouse to MySQL. keywords: [clickhouse, mysql, connect, integrate, table, engine] --- diff --git a/docs/en/integrations/data-sources/mysql.md b/docs/en/integrations/data-sources/mysql.md new file mode 100644 index 00000000000..7f6bb47ec98 --- /dev/null +++ b/docs/en/integrations/data-sources/mysql.md @@ -0,0 +1,10 @@ +--- +slug: /en/integrations/mysql +sidebar_label: MySQL +title: MySQL +hide_title: true +--- + +import MySQL from '@site/docs/en/integrations/data-ingestion/dbms/mysql/index.md'; + + diff --git a/docs/en/managing-data/core-concepts/parts.md b/docs/en/managing-data/core-concepts/parts.md index cb2bb74c445..cec38dcde18 100644 --- a/docs/en/managing-data/core-concepts/parts.md +++ b/docs/en/managing-data/core-concepts/parts.md @@ -5,10 +5,10 @@ description: What are data parts in ClickHouse keywords: [part] --- - - ## What are table parts in ClickHouse? +
+ The data from each table in the ClickHouse [MergeTree engine family](/docs/en/engines/table-engines/mergetree-family) is organized on disk as a collection of immutable `data parts`. To illustrate this, we use this table (adapted from the [UK property prices dataset](/docs/en/getting-started/example-datasets/uk-price-paid)) tracking the date, town, street, and price for sold properties in the United Kingdom: @@ -30,6 +30,7 @@ ORDER BY (town, street); A data part is created whenever a set of rows is inserted into the table. The following diagram sketches this: INSERT PROCESSING +
When a ClickHouse server processes the example insert with 4 rows (e.g., via an [INSERT INTO statement](/docs/en/sql-reference/statements/insert-into)) sketched in the diagram above, it performs several steps: @@ -48,6 +49,6 @@ Data parts are self-contained, including all metadata needed to interpret their To manage the number of parts per table, a background merge job periodically combines smaller parts into larger ones until they reach a [configurable](/docs/en/operations/settings/merge-tree-settings#max-bytes-to-merge-at-max-space-in-pool) compressed size (typically ~150 GB). Merged parts are marked as inactive and deleted after a [configurable](/docs/en/operations/settings/merge-tree-settings#old-parts-lifetime) time interval. Over time, this process creates a hierarchical structure of merged parts, which is why it’s called a MergeTree table: PART MERGES - +
To minimize the number of initial parts and the overhead of merges, database clients are [encouraged](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse#data-needs-to-be-batched-for-optimal-performance) to either insert tuples in bulk, e.g. 20,000 rows at once, or to use the [asynchronous insert mode](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse), in which ClickHouse buffers rows from multiple incoming INSERTs into the same table and creates a new part only after the buffer size exceeds a configurable threshold, or a timeout expires. diff --git a/docs/en/managing-data/delete_mutations.md b/docs/en/managing-data/delete_mutations.md new file mode 100644 index 00000000000..a118428edf4 --- /dev/null +++ b/docs/en/managing-data/delete_mutations.md @@ -0,0 +1,16 @@ +--- +slug: /en/managing-data/delete_mutations +sidebar_label: Delete Mutations +title: Delete Mutations +hide_title: false +--- + +Delete mutations refers to `ALTER` queries that manipulate table data through delete. Most notably they are queries like `ALTER TABLE DELETE`, etc. Performing such queries will produce new mutated versions of the data parts. This means that such statements would trigger a rewrite of whole data parts for all data that was inserted before the mutation, translating to a large amount of write requests. + +:::info +For deletes, you can avoid these large amounts of write requests by using specialised table engines like [ReplacingMergeTree](/docs/en/guides/replacing-merge-tree) or [CollapsingMergeTree](/docs/en/engines/table-engines/mergetree-family/collapsingmergetree) instead of the default MergeTree table engine. +::: + +import DeleteMutations from '@site/docs/en/sql-reference/statements/alter/delete.md'; + + \ No newline at end of file diff --git a/docs/en/managing-data/drop_partition.md b/docs/en/managing-data/drop_partition.md new file mode 100644 index 00000000000..4d8cd29e996 --- /dev/null +++ b/docs/en/managing-data/drop_partition.md @@ -0,0 +1,76 @@ +--- +slug: /en/managing-data/drop_partition +sidebar_label: Drop Partition +title: Dropping Partitions +hide_title: false +--- + +## Background + +Partitioning is specified on a table when it is initially defined via the `PARTITION BY` clause. This clause can contain a SQL expression on any columns, the results of which will define which partition a row is sent to. + +The data parts are logically associated with each partition on disk and can be queried in isolation. For the example below, we partition the `posts` table by year using the expression `toYear(CreationDate)`. As rows are inserted into ClickHouse, this expression will be evaluated against each row and routed to the resulting partition if it exists (if the row is the first for a year, the partition will be created). + +```sql + CREATE TABLE posts +( + `Id` Int32 CODEC(Delta(4), ZSTD(1)), + `PostTypeId` Enum8('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), + `AcceptedAnswerId` UInt32, + `CreationDate` DateTime64(3, 'UTC'), +... + `ClosedDate` DateTime64(3, 'UTC') +) +ENGINE = MergeTree +ORDER BY (PostTypeId, toDate(CreationDate), CreationDate) +PARTITION BY toYear(CreationDate) +``` + +Read about setting the partition expression in a section [How to set the partition expression](/docs/en/sql-reference/statements/alter/partition/#how-to-set-partition-expression). + +In ClickHouse, users should principally consider partitioning to be a data management feature, not a query optimization technique. By separating data logically based on a key, each partition can be operated on independently e.g. deleted. This allows users to move partitions, and thus subnets, between [storage tiers](/en/integrations/s3#storage-tiers) efficiently on time or [expire data/efficiently delete from the cluster](/en/sql-reference/statements/alter/partition). + +## Drop Partitions + +`ALTER TABLE ... DROP PARTITION` provides a cost-efficient way to drop a whole partition. + +``` sql +ALTER TABLE table_name [ON CLUSTER cluster] DROP PARTITION|PART partition_expr +``` + +This query tags the partition as inactive and deletes data completely, approximately in 10 minutes. The query is replicated – it deletes data on all replicas. + +In example, below we remove posts from 2008 for the earlier table by dropping the associated partition. + +```sql +SELECT DISTINCT partition +FROM system.parts +WHERE `table` = 'posts' + +┌─partition─┐ +│ 2008 │ +│ 2009 │ +│ 2010 │ +│ 2011 │ +│ 2012 │ +│ 2013 │ +│ 2014 │ +│ 2015 │ +│ 2016 │ +│ 2017 │ +│ 2018 │ +│ 2019 │ +│ 2020 │ +│ 2021 │ +│ 2022 │ +│ 2023 │ +│ 2024 │ +└───────────┘ + +17 rows in set. Elapsed: 0.002 sec. + + ALTER TABLE posts + (DROP PARTITION '2008') + +0 rows in set. Elapsed: 0.103 sec. +``` diff --git a/docs/en/managing-data/truncate.md b/docs/en/managing-data/truncate.md new file mode 100644 index 00000000000..7c58076ecd6 --- /dev/null +++ b/docs/en/managing-data/truncate.md @@ -0,0 +1,12 @@ +--- +slug: /en/managing-data/truncate +sidebar_label: Truncate Table +title: Truncate Table +hide_title: false +--- + +Truncate allows the data in a table or database to be removed, while preserving their existence. This is a lightweight operation which cannot be reversed. + +import Truncate from '@site/docs/en/sql-reference/statements/truncate.md'; + + diff --git a/docs/en/managing-data/update_mutations.md b/docs/en/managing-data/update_mutations.md new file mode 100644 index 00000000000..b24c7a0b8b4 --- /dev/null +++ b/docs/en/managing-data/update_mutations.md @@ -0,0 +1,16 @@ +--- +slug: /en/managing-data/update_mutations +sidebar_label: Update Mutations +title: Update Mutations +hide_title: false +--- + +Update mutations refers to `ALTER` queries that manipulate table data through updates. Most notably they are queries like `ALTER TABLE UPDATE`, etc. Performing such queries will produce new mutated versions of the data parts. This means that such statements would trigger a rewrite of whole data parts for all data that was inserted before the mutation, translating to a large amount of write requests. + +:::info +For updates, you can avoid these large amounts of write requests by using specialised table engines like [ReplacingMergeTree](/docs/en/guides/replacing-merge-tree) or [CollapsingMergeTree](/docs/en/engines/table-engines/mergetree-family/collapsingmergetree) instead of the default MergeTree table engine. +::: + +import UpdateMutations from '@site/docs/en/sql-reference/statements/alter/update.md'; + + \ No newline at end of file diff --git a/docs/en/optimize/index.md b/docs/en/optimize/index.md new file mode 100644 index 00000000000..54899f025a5 --- /dev/null +++ b/docs/en/optimize/index.md @@ -0,0 +1,8 @@ +--- +slug: /en/optimize +sidebar_label: Overview +title: Performance and Optimizations +hide_title: false +--- + +This section contains tips and best practices for improving performance with ClickHouse. We recommend users read [Core Concepts](/docs/en/parts) as a precursor to this section, which covers the main concepts required to improve performance, especially [Primary Indices](/docs/en/optimize/sparse-primary-indexes). diff --git a/sidebars.js b/sidebars.js index dfb9d9234d3..b56b3cb5492 100644 --- a/sidebars.js +++ b/sidebars.js @@ -696,11 +696,7 @@ const sidebars = { "en/integrations/data-ingestion/apache-spark/spark-jdbc", ], }, - { - type: "doc", - id: "en/integrations/data-ingestion/dbms/mysql/index", - label: "MySQL", - }, + "en/integrations/data-sources/mysql", "en/integrations/data-sources/cassandra", "en/integrations/data-sources/redis", "en/integrations/data-sources/rabbitmq", @@ -876,6 +872,16 @@ const sidebars = { ], managingData: [ + { + type: "category", + label: "Core concepts", + collapsed: false, + collapsible: false, + items: [ + "en/managing-data/core-concepts/parts", + "en/guides/best-practices/sparse-primary-indexes", + ] + }, { type: "category", label: "Updating Data", @@ -883,11 +889,7 @@ const sidebars = { collapsible: false, items: [ "en/managing-data/updates", - { - type: "link", - label: "Update Mutations", - href: "/en/sql-reference/statements/alter/update" - }, + "en/managing-data/update_mutations", { type: "doc", label: "Lightweight Updates", @@ -916,21 +918,9 @@ const sidebars = { label: "Lightweight Deletes", id: "en/guides/developer/lightweight-delete" }, - { - type: "link", - label: "Delete Mutations", - href: "/en/sql-reference/statements/alter/delete" - }, - { - type: "link", - label: "Truncate Table", - href: "/en/sql-reference/statements/truncate" - }, - { - type: "link", - label: "Drop Partition", - href: "/en/sql-reference/statements/alter/partition#drop-partitionpart" - } + "en/managing-data/delete_mutations", + "en/managing-data/truncate", + "en/managing-data/drop_partition", ] }, { @@ -1001,7 +991,7 @@ const sidebars = { collapsed: false, collapsible: false, items: [ - "en/guides/best-practices/sparse-primary-indexes", + "en/optimize/index", "en/operations/analyzer", "en/guides/best-practices/asyncinserts", "en/guides/best-practices/avoidmutations", diff --git a/src/theme/Navbar/Content/index.js b/src/theme/Navbar/Content/index.js index a84999c6ac2..8741d9c2c1f 100644 --- a/src/theme/Navbar/Content/index.js +++ b/src/theme/Navbar/Content/index.js @@ -127,6 +127,11 @@ const dropdownCategories = [{ sidebar: 'managingData', link: '/docs/en/updating-data', menuItems: [ + { + title: 'Core concepts', + description: 'Understand core concepts in ClickHouse', + link: '/docs/en/parts' + }, { title: 'Updating Data', description: 'Updating and replacing data in ClickHouse', @@ -145,7 +150,7 @@ const dropdownCategories = [{ { title: 'Performance and Optimizations', description: 'Guides to help you optimize ClickHouse', - link: '/docs/en/optimize/sparse-primary-indexes' + link: '/docs/en/optimize' } ] },