Skip to content

Commit bb0a715

Browse files
authored
Merge pull request #208 from datastax/doc-5439
DOC-5439 Update links and the name for Luna
2 parents fde1d37 + f50735f commit bb0a715

File tree

11 files changed

+123
-134
lines changed

11 files changed

+123
-134
lines changed

docs/modules/ROOT/assets/images/cassandra-source-connector.drawio

Lines changed: 0 additions & 1 deletion
This file was deleted.

docs/modules/ROOT/examples/extension-start.sh

Lines changed: 0 additions & 2 deletions
This file was deleted.

docs/modules/ROOT/examples/java-start.sh

Lines changed: 0 additions & 2 deletions
This file was deleted.

docs/modules/ROOT/pages/backfill-cli.adoc

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Developers can also use the backfill CLI to trigger change events for downstream
1111
== Installation
1212

1313
The CDC backfill CLI is distributed both as a JAR file and as a Pulsar-admin extension NAR file.
14-
The Pulsar-admin extension is packaged with the DataStax Luna Streaming distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.
14+
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar distribution in the `/cliextensions` folder, so you don't need to build from source unless you want to make changes to the code.
1515

1616
Both artifacts are built with Gradle.
1717
To build the CLI, run the following commands:
@@ -50,19 +50,26 @@ Once the artifacts are generated, you can run the backfill CLI tool as either a
5050
Java standalone::
5151
+
5252
--
53-
[source,shell,subs="attributes+"]
53+
[source,shell]
5454
----
55-
include::example$java-start.sh[]
55+
java -jar backfill-cli/build/libs/backfill-cli-{version}-all.jar --data-dir target/export --export-host 127.0.0.1:9042 \
56+
--export-username cassandra --export-password cassandra --keyspace ks1 --table table1
5657
----
5758
--
5859

5960
Pulsar-admin extension::
6061
+
6162
--
62-
include::partial$extension.adoc[]
63+
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.
64+
65+
. Move the generated NAR archive to the /cliextensions folder of your Pulsar installation (e.g. /pulsar/cliextensions).
66+
. Modify the client.conf file of your Pulsar installation to include: `customCommandFactories=cassandra-cdc`.
67+
. Run the following command (this assumes the https://docs.datastax.com/en/installing/docs/installTARdse.html[default installation] of DSE Cassandra):
6368
+
69+
[source,shell]
6470
----
65-
include::example$extension-start.sh[]
71+
-data-dir target/export --export-host 127.0.0.1:9042 \
72+
--export-username cassandra --export-password cassandra --keyspace ks1 --table table1
6673
----
6774
--
6875
====
@@ -255,64 +262,78 @@ be exported in subdirectories of the data directory specified here;
255262
there will be one subdirectory per keyspace inside the data
256263
directory, then one subdirectory per table inside each keyspace
257264
directory.
265+
258266
|--help, -h
259267
|Displays this help message
268+
260269
|--dsbulk-log-dir=PATH, -l
261270
|The directory where DSBulk should store its logs. The default is a
262271
'logs' subdirectory in the current working directory. This
263272
subdirectory will be created if it does not exist. Each DSBulk
264273
operation will create a subdirectory inside the log directory
265274
specified here. This command is not available in the Pulsar-admin extension.
275+
266276
|--export-bundle=PATH
267-
|The path to a secure connect bundle to connect to the Cassandra
268-
cluster, if that cluster is a DataStax Astra cluster. Options
269-
--export-host and --export-bundle are mutually exclusive.
277+
|The path to a Secure Connect Bundle (SCB) to connect to an Astra DB database. Options --export-host and --export-bundle are mutually exclusive.
278+
270279
|--export-consistency=CONSISTENCY
271280
|The consistency level to use when exporting data. The default is
272281
LOCAL_QUORUM.
282+
273283
|--export-max-concurrent-files=NUM\|AUTO
274284
|The maximum number of concurrent files to write to. Must be a positive
275285
number or the special value AUTO. The default is AUTO.
286+
276287
|--export-max-concurrent-queries=NUM\|AUTO
277288
|The maximum number of concurrent queries to execute. Must be a
278289
positive number or the special value AUTO. The default is AUTO.
290+
279291
|--export-splits=NUM\|NC
280292
|The maximum number of token range queries to generate. Use the NC
281293
syntax to specify a multiple of the number of available cores, e.g.
282294
8C = 8 times the number of available cores. The default is 8C. This
283295
is an advanced setting; you should rarely need to modify the default
284296
value.
297+
285298
|--export-dsbulk-option=OPT=VALUE
286299
|An extra DSBulk option to use when exporting. Any valid DSBulk option
287300
can be specified here, and it will be passed as-is to the DSBulk
288301
process. DSBulk options, including driver options, must be passed as
289302
'--long.option.name=<value>'. Short options are not supported. For more DSBulk options, see https://docs.datastax.com/en/dsbulk/docs/reference/commonOptions.html[here].
303+
290304
|--export-host=HOST[:PORT]
291305
|The host name or IP and, optionally, the port of a node from the
292306
Cassandra cluster. If the port is not specified, it will default to
293307
9042. This option can be specified multiple times. Options
294308
--export-host and --export-bundle are mutually exclusive.
309+
295310
|--export-password
296311
|The password to use to authenticate against the origin cluster.
297312
Options --export-username and --export-password must be provided
298313
together, or not at all. Omit the parameter value to be prompted for
299314
the password interactively.
315+
300316
|--export-protocol-version=VERSION
301317
|The protocol version to use to connect to the Cassandra cluster, e.g.
302318
'V4'. If not specified, the driver will negotiate the highest
303319
version supported by both the client and the server.
320+
304321
|--export-username=STRING
305322
|The username to use to authenticate against the origin cluster.
306323
Options --export-username and --export-password must be provided
307324
together, or not at all.
325+
308326
|--keyspace=<keyspace>, -k
309327
|The name of the keyspace where the table to be exported exists
328+
310329
|--max-rows-per-second=PATH
311330
|The maximum number of rows per second to read from the Cassandra
312331
table. Setting this option to any negative value or zero will
313332
disable it. The default is -1.
333+
314334
|--table=<table>, -t
315335
|The name of the table to export data from for cdc back filling
336+
316337
|--version, -v
317338
|Displays version info.
318339
|===

docs/modules/ROOT/pages/cdc-cassandra-events.adoc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
= CDC for Cassandra Events
1+
= CDC for Cassandra Events
22

3-
The DataStax CDC for Cassandra agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]: +
3+
The {cdc_cass_first} agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]:
44

55
* The message key is an AVRO record including all the primary key columns of your Cassandra table.
66
* The message payload is an AVRO record including regular columns from your Cassandra table.
@@ -18,9 +18,9 @@ Finally, the following CQL data types are encoded as AVRO logical types:
1818
1919
See https://avro.apache.org/docs/current/spec.html#Logical+Types[AVRO Logical Types] for more info on AVRO.
2020

21-
== Change Events Key
21+
== Change Event's Key
2222

23-
For a given table, the change events key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.
23+
For a given table, the change event's key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.
2424

2525
== `INSERT` Event
2626

docs/modules/ROOT/pages/cdcExample.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This installation requires the following. Latest version artifacts are available
1111
** DSE - use `agent-dse4-<version>-all.jar`
1212
** OSS C* - use `agent-c4-<version>-all.jar`
1313
* Pulsar
14-
** DataStax Luna Streaming - use `agent-dse4-<version>-all.jar`
14+
** IBM Elite Support for Apache Pulsar - use `agent-dse4-<version>-all.jar`
1515
* Pulsar C* source connector (CSC)
1616
** Pulsar Cassandra Source NAR - use `pulsar-cassandra-source-<version>.nar`
1717

docs/modules/ROOT/pages/faqs.adoc

Lines changed: 20 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,34 +2,30 @@
22

33
If you are new to {cdc_cass_first}, these frequently asked questions are for you.
44

5-
== Introduction
6-
7-
=== What is {cdc_cass}?
5+
== What is {cdc_cass}?
86

97
The {cdc_cass} is a an open-source product from DataStax.
108

119
With {cdc_cass}, updates to data in Apache Cassandra are put into a Pulsar topic, which in turn can write the data to external targets such as Elasticsearch, Snowflake, and other platforms.
1210
The {csc_pulsar_first} component is simple, with a 1:1 correspondence between the Cassandra table and a single Pulsar topic.
1311

14-
=== What are the requirements for {cdc_pulsar}?
12+
== What are the requirements for {cdc_pulsar}?
1513

1614
Minimum requirements are:
1715

1816
* Cassandra version 3.11+ or 4.0+, DSE 6.8.16+ for near real-time event streaming CDC
1917
* Cassandra version 3.0 to 3.10 for batch CDC
20-
* Luna Streaming 2.8.0+ or Apache Pulsar 2.8.1+
18+
* IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) or Apache Pulsar 2.8.1+
2119
* Additional memory and CPU available on all Cassandra nodes
2220

2321
[NOTE]
2422
====
25-
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
23+
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
2624
====
2725

28-
// insert link to pulsar cluster system doc
29-
30-
Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.
26+
Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.
3127

32-
=== What is the impact of the C* CDC solution on the existing C* cluster?
28+
== What is the impact of the C* CDC solution on the existing C* cluster?
3329

3430
For each CDC-enabled C* table, C* needs extra processing cycles and storage to process the CDC commit logs. The impact for dealing with a single CDC-enabled table is small, but when there are a large number of C* tables with CDC enabled, the impact within C* increases. The performance impact occurs within C* itself, not the C* CDC solution with Pulsar.
3531

@@ -39,7 +35,7 @@ For each C* write operation (one detected change-event), the Pulsar CSC connecto
3935

4036
In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CDC solution would double the workload by adding the same amount of read workload to C* table. Since the C* read is primary key-based, it will be efficient.
4137

42-
=== What are the {cdc_cass} limitations?
38+
== What are the {cdc_cass} limitations?
4339

4440
{cdc_cass} has the following limitations:
4541

@@ -50,8 +46,7 @@ In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CD
5046
* Does not support range deletes.
5147
* CQL column names must not match a Pulsar primitive type name (ex: INT32) below
5248

53-
==== Table Pulsar primitive types
54-
49+
.Pulsar primitive types
5550
[cols=2*, options=header]
5651
[%autowidth]
5752
|===
@@ -91,9 +86,9 @@ It stores the number of milliseconds since January 1, 1970, 00:00:00 GMT as an I
9186

9287
|===
9388

94-
=== What happens if Luna Streaming or Apache Pulsar is unavailable?
89+
== What happens if the Apache Pulsar service is unavailable?
9590

96-
If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.
91+
If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.
9792

9893
The CDC agent keeps track of the CDC commitlog segment offsets, so the CDC agent knows where to resume sending the mutation messages when the Pulsar cluster is back online.
9994

@@ -108,14 +103,14 @@ WARN [CoreThread-5] 2021-10-29 09:12:52,790 NoSpamLogger.java:98 - Rejecting M
108103
----
109104

110105
To avoid or recover from this situation, increase the `cdc_total_space_in_mb` and restart the node.
111-
To prevent hitting this new limit, increase the write throughput to Luna Streaming or Apache Pulsar, or decrease the write throughput to your node.
106+
To prevent hitting this new limit, increase the write throughput to your Apache Pulsar cluster, or decrease the write throughput to your node.
112107

113-
Increasing the Luna Streaming or Apache Pulsar write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Luna Streaming or Apache Pulsar configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).
108+
Increasing the write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Pulsar cluster configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).
114109

115110
As a last resort, if losing data is acceptable in your CDC pipeline, remove `commitlog` files from the `cdc_raw` directory.
116111
Restarting the node is not needed in this case.
117112

118-
=== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?
113+
== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?
119114

120115
In a multi-datacenter Cassandra configuration, enable CDC and install the change agent in only one datacenter.
121116
To ensure the data sent to all datacenters are delivered to the data topic, make sure to configure replication to the datacenter that has CDC enabled on the table.
@@ -125,36 +120,30 @@ To ensure all updates in DC2 and DC3 are propagated to the data topic, configure
125120
For example, `replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3, 'dc3': 3})`.
126121
The data replicated to DC1 will be processed by the change agent and eventually end up in the data topic.
127122

128-
=== Is {cdc_cass} an open-source project?
123+
== Is {cdc_cass} an open-source project?
129124

130125
Yes, {cdc_cass} is open source using the Apache 2.0 license. You can find the source code on the GitHub repository https://github.com/datastax/cdc-apache-cassandra[datastax/cdc-apache-cassandra].
131126

132-
=== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?
127+
== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?
133128

134129
In effect, the {cdc_cass} implements the reverse of Apache Pulsar or DataStax Cassandra Sink Connector.
135130
With those sink connectors, data is taken from a Pulsar topic and put into Cassandra.
136131
With {cdc_cass}, updates to a Cassandra table are converted into events and put into a data topic.
137132
From there, the data can be published to external platforms like Elasticsearch, Snowflake, and other platforms.
138133

139-
//=== Does {cdc_cass} support Kubernetes?
140-
141-
//Yes.
142-
//You can run the {cdc_pulsar} on Luna Streaming or Apache Pulsar running on Minikube, Google Kubernetes Engine (GKE), Microsoft Azure Kubernetes Service, // Amazon Kubernetes Service (AKS), and other commonly used platforms.
143-
//You can deploy the change agent with Cassandra on Kubernetes with the https://github.com/datastax/cass-operator[cass-operator].
144-
145-
=== Where is the {cdc_cass} public GitHub repository?
134+
== Where is the {cdc_cass} public GitHub repository?
146135

147136
The source for this FAQs document is co-located with the {cdc_cass} repository code.
148137
You can access the repository https://github.com/datastax/cdc-apache-cassandra[here].
149138

150-
=== How do I install {cdc_cass}?
139+
== How do I install {cdc_cass}?
151140

152141
Follow the xref:install.adoc[install] instructions.
153142

154-
=== What is Prometheus?
143+
== What is Prometheus?
155144

156145
https://prometheus.io/docs/introduction/overview/[Prometheus] is an open-source tool to collect metrics on a running app, providing real-time monitoring and alerts.
157146

158-
=== What is Grafana?
147+
== What is Grafana?
159148

160-
https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.
149+
https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.

0 commit comments

Comments
 (0)