hazelcast · TomaszGaweda · Jun 27, 2024 · Jun 27, 2024 · Jun 27, 2024 · Jun 27, 2024
@@ -195,6 +195,7 @@ include::wan:partial$nav.adoc[]
 ** xref:integrate:database-connectors.adoc[Overview]
 ** xref:integrate:jdbc-connector.adoc[]
 ** xref:integrate:cdc-connectors.adoc[]
+** xref:integrate:legacy-cdc-connectors.adoc[]
 ** xref:integrate:elasticsearch-connector.adoc[]
 ** xref:integrate:mongodb-connector.adoc[]
 ** xref:integrate:influxdb-connector.adoc[]

@@ -1,4 +1,5 @@
 = CDC Connector
+[.enterprise]*Enterprise*
 
 Change Data Capture (CDC) refers to the process of observing changes
 made to a database and extracting them in a form usable by other
@@ -8,50 +9,118 @@ Change Data Capture is especially important to Hazelcast, because it allows
 for the _streaming of changes from databases_, which can be efficiently
 processed by the Jet engine.
 
-Implementation of CDC in Hazelcast is based on
-link:https://debezium.io/[Debezium]. Hazelcast offers a generic Debezium source
-which can handle CDC events from link:https://debezium.io/documentation/reference/stable/connectors/index.html[any database supported by Debezium],
-but we're also striving to make CDC sources first class citizens in Hazelcast.
-The ones for MySQL and PostgreSQL already are.
+Implementation of CDC in Hazelcast {enterprise-product-name} is based on
+link:https://debezium.io/[Debezium 2.x, window=_blank]. Hazelcast offers a generic Debezium source
+which can handle CDC events from link:https://debezium.io/documentation/reference/2.7/connectors/index.html[any database supported by Debezium, window=_blank],
+However, we're also striving to make CDC sources first class citizens in Hazelcast,
+as we have done already for MySQL and PostgreSQL already are.
 
 == Installing the Connector
 
-This connector is included in the full and slim distributions of Hazelcast.
+This connector is included in the full distribution of Hazelcast {enterprise-product-name}.
+
+=== Maven
+To use this connector in a Maven project, add the following entries to the `<dependency>` section of your `pom.xml` file:
+
+Generic connector:
+
+[source,xml]
+----
+<dependency>
+    <groupId>com.hazelcast.jet</groupId>
+    <artifactId>hazelcast-enterprise-cdc-debezium</artifactId>
+    <version>{full-version}</version>
+    <classifier>jar-with-dependencies</classifier>
+</dependency>
+----
+
+MySQL-specific connector:
+
+[source,xml]
+----
+<dependency>
+    <groupId>com.hazelcast.jet</groupId>
+    <artifactId>hazelcast-enterprise-cdc-mysql</artifactId>
+    <version>{full-version}</version>
+    <classifier>jar-with-dependencies</classifier>
+</dependency>
+----
+NOTE: MySQL connector does not include the MySQL driver as a dependency.
+
+PostgreSQL-specific connector:
+
+[source,xml]
+----
+<dependency>
+    <groupId>com.hazelcast.jet</groupId>
+    <artifactId>hazelcast-enterprise-cdc-postgres</artifactId>
+    <version>{full-version}</version>
+    <classifier>jar-with-dependencies</classifier>
+</dependency>
+----
 
 == CDC as a Source
 
-We have the following types of CDC sources:
+The Java API supports the following types of CDC source:
 
-* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/DebeziumCdcSources.html[DebeziumCdcSources]:
-  generic source for all databases supported by Debezium
-* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/mysql/MySqlCdcSources.html[MySqlCdcSources]:
-  specific, first class Jet CDC source for MySQL databases (also based
-  on Debezium, but benefiting the full range of convenience Jet can
-  additionally provide)
-* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/postgres/PostgresCdcSources.html[PostgresCdcSources]:
-  specific, first class CDC source for PostgreSQL databases (also based
-  on Debezium, but benefiting the full range of convenience Hazelcast can
-  additionally provide)
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/DebeziumCdcSources.html[DebeziumCdcSources, window=_blank]:
+  a generic source for all databases supported by Debezium
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/mysql/MySqlCdcSources.html[MySqlCdcSources, window=_blank]:
+  a specific, first class Jet CDC source for MySQL databases (also based
+  on Debezium, but with the additional benefits provided by Hazelcast
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/postgres/PostgresCdcSources.html[PostgresCdcSources, window=_blank]:
+  a specific, first class CDC source for PostgreSQL databases (also based
+on Debezium, but with the additional benefits provided by Hazelcast
 
-For the setting up a streaming source of CDC data is just the matter of pointing it at the right database via configuration:
+To set up a streaming source of CDC data, define it using the following configuration:
 
-```java
+[source,java]
+----
 Pipeline pipeline = Pipeline.create();
 pipeline.readFrom(
     MySqlCdcSources.mysql("customers")
-            .setDatabaseAddress("127.0.0.1")
-            .setDatabasePort(3306)
-            .setDatabaseUser("debezium")
-            .setDatabasePassword("dbz")
+            .setDatabaseAddress("127.0.0.1", 3306)
+            .setDatabaseCredentials("debezium", "dbz")
             .setClusterName("dbserver1")
-            .setDatabaseWhitelist("inventory")
-            .setTableWhitelist("inventory.customers")
+            .setDatabaseIncludeList("inventory")
+            .setTableIncludeList("inventory.customers")
             .build())
     .withNativeTimestamps(0)
     .writeTo(Sinks.logger());
-```
+----
 
-For an example of how to use CDC data see xref:pipelines:cdc.adoc[our tutorial].
+MySQL- and PostgreSQL-specific source builders contain methods for all major configuration settings with protection if, for example, mutually exclusive options are not used. If using a generic source builder, refer to the link:https://debezium.io/documentation/reference/stable/index.html[Debezium, window=_blank] documentation
+
+Follow the provided xref:pipelines:cdc.adoc[] tutorial to see how CDC processes change events from a MySQL database.
+
+=== Common source builder functions
+[cols="m,a"]
+|===
+|Method name|Description
+
+|changeRecord()
+| Sets output type to `ChangeRecord` - a wrapper, which provides most of the fields in
+strongly-typed manner.
+
+| json()
+| Sets output type to `JSON` - in the result stage, the type will be set to `Map<String, String>`,
+where map entry's key is the key of `SourceRecord` in JSON format and value is whole `SourceRecord`'s value in JSON format.
+
+|customMapping(RecordMappingFunction<T>)
+| Sets the output type to an arbitrary user type, `T`. Mapping from `SourceRecord` to `T` is done using provided function by the connector.
+
+|withDefaultEngine()
+|Sets the preferred engine to the default (non-async) one. This engine is single-threaded,
+but also older and more tested. Use this engine for most stable results (for example, no async offset restore). For MySQL and PostgreSQL especially this engine makes the most sense, as MySQL and PostgreSQL Debezium connectors are single-threaded only.
+
+|withAsyncEngine()
+|Sets the preferred engine to the async one. This engine is multithreaded (if supported by the connector), but you must be aware of the async nature; for example, offset restore may occur asynchronously after the restart is done, leading to sometimes confusing results.
+
+|setProperty(String, String)
+|Sets connector property to given value. There are multiple overloads, allowing to
+set the value to `long`, `String` or `boolean`.
+
+|===
 
 === Fault Tolerance
 
@@ -79,20 +148,19 @@ For example, a sink mapping CDC data to a `Customer` class and
 maintaining a map view of latest known email addresses per customer
 (identified by ID) would look like this:
 
-```java
+[source,java]
+----
 Pipeline p = Pipeline.create();
 p.readFrom(source)
  .withoutTimestamps()
  .writeTo(CdcSinks.map("customers",
     r -> r.key().toMap().get("id"),
     r -> r.value().toObject(Customer.class).email));
-```
+----
 
 [NOTE]
 ====
 The key and value functions have certain limitations. They can be used to map only to objects which the Hazelcast member can deserialize, which unfortunately doesn't include user code submitted as a part of the job. So in the above example it's OK to have `String` email values, but we wouldn't be able to use `Customer` directly.
 
 If user code has to be used, then the problem can be solved with the help of the User Code Deployment feature. Example configs for that can be seen in our xref:pipelines:cdc-join.adoc#7-start-hazelcast-jet[CDC Join tutorial].
-
-Although User Code Deployment has been deprecated, the replacement User Code Namespaces feature does not yet support Jet jobs or pipelines. For now, continue to use the User Code Deployment solution in this scenario. 
 ====
@@ -115,11 +115,36 @@ The Jet API supports more connectors than SQL.
 |batch
 |N/A
 
-|xref:integrate:cdc-connectors.adoc[DebeziumCdcSources.debezium]
+|xref:integrate:legacy-cdc-connectors.adoc[DebeziumCdcSources.debezium] (Legacy)
 |hazelcast-jet-cdc-debezium
 |streaming
 |at-least-once
 
+|xref:integrate:legacy-cdc-connectors.adoc[MySqlCdcSources.mysql] (Legacy)
+|hazelcast-jet-cdc-mysql
+|streaming
+|exactly-once
+
+|xref:integrate:legacy-cdc-connectors.adoc[PostgresCdcSources.postgres] (Legacy)
+|hazelcast-jet-cdc-postgres
+|streaming
+|exactly-once
+
+|xref:integrate:cdc-connectors.adoc[DebeziumCdcSources.debezium] ([.enterprise]*Enterprise*)
+|hazelcast-enterprise-cdc-debezium
+|streaming
+|at-least-once
+
+|xref:integrate:cdc-connectors.adoc[MySqlCdcSources.mysql]
+|hazelcast-enterprise-cdc-mysql
+|streaming
+|exactly-once
+
+|xref:integrate:cdc-connectors.adoc[PostgresCdcSources.postgres]
+|hazelcast-enterprise-cdc-postgres
+|streaming
+|exactly-once
+
 |xref:integrate:elasticsearch-connector.adoc[ElasticSources.elastic]
 |hazelcast-jet-elasticsearch-7
 |batch
@@ -150,16 +175,6 @@ The Jet API supports more connectors than SQL.
 |streaming
 |exactly-once
 
-|xref:integrate:cdc-connectors.adoc[MySqlCdcSources.mysql]
-|hazelcast-jet-cdc-mysql
-|streaming
-|exactly-once
-
-|xref:integrate:cdc-connectors.adoc[PostgresCdcSources.postgres]
-|hazelcast-jet-cdc-postgres
-|streaming
-|exactly-once
-
 |xref:integrate:pulsar-connector.adoc[PulsarSources.pulsarConsumer]
 |hazelcast-jet-contrib-pulsar
 |streaming
@@ -270,7 +285,11 @@ The Jet API supports more connectors than SQL.
 |N/A
 
 |xref:integrate:cdc-connectors.adoc[CdcSinks.map]
-|hazelcast-jet-cdc-debezium
+|hazelcast-jet-cdc-debezium (legacy, {open-source-product-name})
+
+or
+
+hazelcast-enterprise-cdc-debezium ({enterprise-product-name})
 |streaming
 |at-least-once
 

@@ -0,0 +1,98 @@
+= Legacy CDC Connector
+
+Change Data Capture (CDC) refers to the process of observing changes
+made to a database and extracting them in a form usable by other
+systems, for the purposes of replication, analysis and many more.
+
+Change Data Capture is especially important to Hazelcast, because it allows
+for the _streaming of changes from databases_, which can be efficiently
+processed by the Jet engine.
+
+Implementation of CDC in Hazelcast {open-source-product-name} is based on
+link:https://debezium.io/[Debezium, window=_blank]. Hazelcast offers a generic Debezium source
+which can handle CDC events from link:https://debezium.io/documentation/reference/stable/connectors/index.html[any database supported by Debezium, window=_blank].
+However, we're also striving to make CDC sources first class citizens in Hazelcast,
+as we have done already for MySQL and PostgreSQL already are.
+
+== Installing the Connector
+
+This connector is included in the full distribution of Open Source Hazelcast.
+
+== CDC as a Source
+
+We have the following types of CDC sources:
+
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/DebeziumCdcSources.html[DebeziumCdcSources, window=_blank]:
+  a generic source for all databases supported by Debezium
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/mysql/MySqlCdcSources.html[MySqlCdcSources, window=_blank]:
+  a specific, first class Jet CDC source for MySQL databases (also based
+  on Debezium, but with the additional benefits provided by Hazelcast
+* link:https://docs.hazelcast.org/docs/{full-version}/javadoc/com/hazelcast/jet/cdc/postgres/PostgresCdcSources.html[PostgresCdcSources, window=_blank]:
+  a specific, first class CDC source for PostgreSQL databases (also based
+  on Debezium, but with the additional benefits provided by Hazelcast
+
+To set up a streaming source of CDC data, define it using the following configuration:
+
+[source,java]
+----
+Pipeline pipeline = Pipeline.create();
+pipeline.readFrom(
+    MySqlCdcSources.mysql("customers")
+            .setDatabaseAddress("127.0.0.1")
+            .setDatabasePort(3306)
+            .setDatabaseUser("debezium")
+            .setDatabasePassword("dbz")
+            .setClusterName("dbserver1")
+            .setDatabaseWhitelist("inventory")
+            .setTableWhitelist("inventory.customers")
+            .build())
+    .withNativeTimestamps(0)
+    .writeTo(Sinks.logger());
+----
+
+For an example of how to use CDC data see xref:pipelines:cdc.adoc[our tutorial].
+
+=== Fault Tolerance
+
+CDC sources offer at least-once processing guarantees. The source
+periodically saves the database write ahead log offset for which it had
+dispatched events and in case of a failure/restart it will replay all
+events since the last successfully saved offset.
+
+Unfortunately, however, there is no guarantee that the last saved offset
+is still in the database changelog. Such logs are always finite and
+depending on the DB configuration can be relatively short, so if the CDC
+source has to replay data for a long period of inactivity, then there
+can be a data loss. With careful management though we can say that
+at-least once guarantee can practically be provided.
+
+== CDC as a Sink
+
+Change data capture is a source-side functionality in Jet, but we also
+offer some specialized sinks that simplify applying CDC events to a map, which gives you the ability to reconstruct the contents of the
+original database table. The sinks expect to receive `ChangeRecord`
+objects and apply your custom functions to them that extract the key and
+the value that will be applied to the target map.
+
+For example, a sink mapping CDC data to a `Customer` class and
+maintaining a map view of latest known email addresses per customer
+(identified by ID) would look like this:
+
+[source,java]
+----
+Pipeline p = Pipeline.create();
+p.readFrom(source)
+ .withoutTimestamps()
+ .writeTo(CdcSinks.map("customers",
+    r -> r.key().toMap().get("id"),
+    r -> r.value().toObject(Customer.class).email));
+----
+
+[NOTE]
+====
+The key and value functions have certain limitations. They can be used to map only to objects which the Hazelcast member can deserialize, which unfortunately doesn't include user code submitted as a part of the job. So in the above example it's OK to have `String` email values, but we wouldn't be able to use `Customer` directly.
+
+If user code has to be used, then the problem can be solved with the help of the User Code Deployment feature. Example configs for that can be seen in our xref:pipelines:cdc-join.adoc#7-start-hazelcast-jet[CDC Join tutorial].
+
+Although User Code Deployment has been deprecated, the replacement User Code Namespaces feature does not yet support Jet jobs or pipelines. For now, continue to use the User Code Deployment solution in this scenario. 
+====