You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/backfill-cli.adoc
+29-8Lines changed: 29 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Developers can also use the backfill CLI to trigger change events for downstream
11
11
== Installation
12
12
13
13
The CDC backfill CLI is distributed both as a JAR file and as a Pulsar-admin extension NAR file.
14
-
The Pulsar-admin extension is packaged with the DataStax Luna Streaming distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.
14
+
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar distribution in the `/cliextensions` folder, so you don't need to build from source unless you want to make changes to the code.
15
15
16
16
Both artifacts are built with Gradle.
17
17
To build the CLI, run the following commands:
@@ -50,19 +50,26 @@ Once the artifacts are generated, you can run the backfill CLI tool as either a
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.
64
+
65
+
. Move the generated NAR archive to the /cliextensions folder of your Pulsar installation (e.g. /pulsar/cliextensions).
66
+
. Modify the client.conf file of your Pulsar installation to include: `customCommandFactories=cassandra-cdc`.
67
+
. Run the following command (this assumes the https://docs.datastax.com/en/installing/docs/installTARdse.html[default installation] of DSE Cassandra):
@@ -255,64 +262,78 @@ be exported in subdirectories of the data directory specified here;
255
262
there will be one subdirectory per keyspace inside the data
256
263
directory, then one subdirectory per table inside each keyspace
257
264
directory.
265
+
258
266
|--help, -h
259
267
|Displays this help message
268
+
260
269
|--dsbulk-log-dir=PATH, -l
261
270
|The directory where DSBulk should store its logs. The default is a
262
271
'logs' subdirectory in the current working directory. This
263
272
subdirectory will be created if it does not exist. Each DSBulk
264
273
operation will create a subdirectory inside the log directory
265
274
specified here. This command is not available in the Pulsar-admin extension.
275
+
266
276
|--export-bundle=PATH
267
-
|The path to a secure connect bundle to connect to the Cassandra
268
-
cluster, if that cluster is a DataStax Astra cluster. Options
269
-
--export-host and --export-bundle are mutually exclusive.
277
+
|The path to a Secure Connect Bundle (SCB) to connect to an Astra DB database. Options --export-host and --export-bundle are mutually exclusive.
278
+
270
279
|--export-consistency=CONSISTENCY
271
280
|The consistency level to use when exporting data. The default is
272
281
LOCAL_QUORUM.
282
+
273
283
|--export-max-concurrent-files=NUM\|AUTO
274
284
|The maximum number of concurrent files to write to. Must be a positive
275
285
number or the special value AUTO. The default is AUTO.
286
+
276
287
|--export-max-concurrent-queries=NUM\|AUTO
277
288
|The maximum number of concurrent queries to execute. Must be a
278
289
positive number or the special value AUTO. The default is AUTO.
290
+
279
291
|--export-splits=NUM\|NC
280
292
|The maximum number of token range queries to generate. Use the NC
281
293
syntax to specify a multiple of the number of available cores, e.g.
282
294
8C = 8 times the number of available cores. The default is 8C. This
283
295
is an advanced setting; you should rarely need to modify the default
284
296
value.
297
+
285
298
|--export-dsbulk-option=OPT=VALUE
286
299
|An extra DSBulk option to use when exporting. Any valid DSBulk option
287
300
can be specified here, and it will be passed as-is to the DSBulk
288
301
process. DSBulk options, including driver options, must be passed as
289
302
'--long.option.name=<value>'. Short options are not supported. For more DSBulk options, see https://docs.datastax.com/en/dsbulk/docs/reference/commonOptions.html[here].
303
+
290
304
|--export-host=HOST[:PORT]
291
305
|The host name or IP and, optionally, the port of a node from the
292
306
Cassandra cluster. If the port is not specified, it will default to
293
307
9042. This option can be specified multiple times. Options
294
308
--export-host and --export-bundle are mutually exclusive.
309
+
295
310
|--export-password
296
311
|The password to use to authenticate against the origin cluster.
297
312
Options --export-username and --export-password must be provided
298
313
together, or not at all. Omit the parameter value to be prompted for
299
314
the password interactively.
315
+
300
316
|--export-protocol-version=VERSION
301
317
|The protocol version to use to connect to the Cassandra cluster, e.g.
302
318
'V4'. If not specified, the driver will negotiate the highest
303
319
version supported by both the client and the server.
320
+
304
321
|--export-username=STRING
305
322
|The username to use to authenticate against the origin cluster.
306
323
Options --export-username and --export-password must be provided
307
324
together, or not at all.
325
+
308
326
|--keyspace=<keyspace>, -k
309
327
|The name of the keyspace where the table to be exported exists
328
+
310
329
|--max-rows-per-second=PATH
311
330
|The maximum number of rows per second to read from the Cassandra
312
331
table. Setting this option to any negative value or zero will
313
332
disable it. The default is -1.
333
+
314
334
|--table=<table>, -t
315
335
|The name of the table to export data from for cdc back filling
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/cdc-cassandra-events.adoc
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
= CDC for Cassandra Events
1
+
= CDC for Cassandra Events
2
2
3
-
The DataStax CDC for Cassandra agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]: +
3
+
The {cdc_cass_first}agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]:
4
4
5
5
* The message key is an AVRO record including all the primary key columns of your Cassandra table.
6
6
* The message payload is an AVRO record including regular columns from your Cassandra table.
@@ -18,9 +18,9 @@ Finally, the following CQL data types are encoded as AVRO logical types:
18
18
19
19
See https://avro.apache.org/docs/current/spec.html#Logical+Types[AVRO Logical Types] for more info on AVRO.
20
20
21
-
== Change Event’s Key
21
+
== Change Event's Key
22
22
23
-
For a given table, the change event’s key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.
23
+
For a given table, the change event's key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/faqs.adoc
+20-31Lines changed: 20 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,34 +2,30 @@
2
2
3
3
If you are new to {cdc_cass_first}, these frequently asked questions are for you.
4
4
5
-
== Introduction
6
-
7
-
=== What is {cdc_cass}?
5
+
== What is {cdc_cass}?
8
6
9
7
The {cdc_cass} is a an open-source product from DataStax.
10
8
11
9
With {cdc_cass}, updates to data in Apache Cassandra are put into a Pulsar topic, which in turn can write the data to external targets such as Elasticsearch, Snowflake, and other platforms.
12
10
The {csc_pulsar_first} component is simple, with a 1:1 correspondence between the Cassandra table and a single Pulsar topic.
13
11
14
-
=== What are the requirements for {cdc_pulsar}?
12
+
== What are the requirements for {cdc_pulsar}?
15
13
16
14
Minimum requirements are:
17
15
18
16
* Cassandra version 3.11+ or 4.0+, DSE 6.8.16+ for near real-time event streaming CDC
19
17
* Cassandra version 3.0 to 3.10 for batch CDC
20
-
* Luna Streaming 2.8.0+ or Apache Pulsar 2.8.1+
18
+
* IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) or Apache Pulsar 2.8.1+
21
19
* Additional memory and CPU available on all Cassandra nodes
22
20
23
21
[NOTE]
24
22
====
25
-
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
23
+
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
26
24
====
27
25
28
-
// insert link to pulsar cluster system doc
29
-
30
-
Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.
26
+
Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.
31
27
32
-
=== What is the impact of the C* CDC solution on the existing C* cluster?
28
+
== What is the impact of the C* CDC solution on the existing C* cluster?
33
29
34
30
For each CDC-enabled C* table, C* needs extra processing cycles and storage to process the CDC commit logs. The impact for dealing with a single CDC-enabled table is small, but when there are a large number of C* tables with CDC enabled, the impact within C* increases. The performance impact occurs within C* itself, not the C* CDC solution with Pulsar.
35
31
@@ -39,7 +35,7 @@ For each C* write operation (one detected change-event), the Pulsar CSC connecto
39
35
40
36
In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CDC solution would double the workload by adding the same amount of read workload to C* table. Since the C* read is primary key-based, it will be efficient.
41
37
42
-
=== What are the {cdc_cass} limitations?
38
+
== What are the {cdc_cass} limitations?
43
39
44
40
{cdc_cass} has the following limitations:
45
41
@@ -50,8 +46,7 @@ In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CD
50
46
* Does not support range deletes.
51
47
* CQL column names must not match a Pulsar primitive type name (ex: INT32) below
52
48
53
-
==== Table Pulsar primitive types
54
-
49
+
.Pulsar primitive types
55
50
[cols=2*, options=header]
56
51
[%autowidth]
57
52
|===
@@ -91,9 +86,9 @@ It stores the number of milliseconds since January 1, 1970, 00:00:00 GMT as an I
91
86
92
87
|===
93
88
94
-
=== What happens if Luna Streaming or Apache Pulsar is unavailable?
89
+
== What happens if the Apache Pulsar service is unavailable?
95
90
96
-
If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.
91
+
If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.
97
92
98
93
The CDC agent keeps track of the CDC commitlog segment offsets, so the CDC agent knows where to resume sending the mutation messages when the Pulsar cluster is back online.
To avoid or recover from this situation, increase the `cdc_total_space_in_mb` and restart the node.
111
-
To prevent hitting this new limit, increase the write throughput to Luna Streaming or Apache Pulsar, or decrease the write throughput to your node.
106
+
To prevent hitting this new limit, increase the write throughput to your Apache Pulsar cluster, or decrease the write throughput to your node.
112
107
113
-
Increasing the Luna Streaming or Apache Pulsar write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Luna Streaming or Apache Pulsar configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).
108
+
Increasing the write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Pulsar cluster configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).
114
109
115
110
As a last resort, if losing data is acceptable in your CDC pipeline, remove `commitlog` files from the `cdc_raw` directory.
116
111
Restarting the node is not needed in this case.
117
112
118
-
=== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?
113
+
== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?
119
114
120
115
In a multi-datacenter Cassandra configuration, enable CDC and install the change agent in only one datacenter.
121
116
To ensure the data sent to all datacenters are delivered to the data topic, make sure to configure replication to the datacenter that has CDC enabled on the table.
@@ -125,36 +120,30 @@ To ensure all updates in DC2 and DC3 are propagated to the data topic, configure
125
120
For example, `replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3, 'dc3': 3})`.
126
121
The data replicated to DC1 will be processed by the change agent and eventually end up in the data topic.
127
122
128
-
=== Is {cdc_cass} an open-source project?
123
+
== Is {cdc_cass} an open-source project?
129
124
130
125
Yes, {cdc_cass} is open source using the Apache 2.0 license. You can find the source code on the GitHub repository https://github.com/datastax/cdc-apache-cassandra[datastax/cdc-apache-cassandra].
131
126
132
-
=== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?
127
+
== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?
133
128
134
129
In effect, the {cdc_cass} implements the reverse of Apache Pulsar or DataStax Cassandra Sink Connector.
135
130
With those sink connectors, data is taken from a Pulsar topic and put into Cassandra.
136
131
With {cdc_cass}, updates to a Cassandra table are converted into events and put into a data topic.
137
132
From there, the data can be published to external platforms like Elasticsearch, Snowflake, and other platforms.
138
133
139
-
//=== Does {cdc_cass} support Kubernetes?
140
-
141
-
//Yes.
142
-
//You can run the {cdc_pulsar} on Luna Streaming or Apache Pulsar running on Minikube, Google Kubernetes Engine (GKE), Microsoft Azure Kubernetes Service, // Amazon Kubernetes Service (AKS), and other commonly used platforms.
143
-
//You can deploy the change agent with Cassandra on Kubernetes with the https://github.com/datastax/cass-operator[cass-operator].
144
-
145
-
=== Where is the {cdc_cass} public GitHub repository?
134
+
== Where is the {cdc_cass} public GitHub repository?
146
135
147
136
The source for this FAQs document is co-located with the {cdc_cass} repository code.
148
137
You can access the repository https://github.com/datastax/cdc-apache-cassandra[here].
149
138
150
-
=== How do I install {cdc_cass}?
139
+
== How do I install {cdc_cass}?
151
140
152
141
Follow the xref:install.adoc[install] instructions.
153
142
154
-
=== What is Prometheus?
143
+
== What is Prometheus?
155
144
156
145
https://prometheus.io/docs/introduction/overview/[Prometheus] is an open-source tool to collect metrics on a running app, providing real-time monitoring and alerts.
157
146
158
-
=== What is Grafana?
147
+
== What is Grafana?
159
148
160
-
https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.
149
+
https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.
0 commit comments