From f915db4926e1dea6cb4ea7ff16b291f631b1d39e Mon Sep 17 00:00:00 2001 From: dailidong Date: Wed, 13 Nov 2024 23:20:02 +0800 Subject: [PATCH 1/8] Correct errors in the FAQ doc --- README.md | 3 +- docs/en/faq.md | 379 +++++++++++-------------------------------------- docs/zh/faq.md | 376 ++++++++++++------------------------------------ 3 files changed, 175 insertions(+), 583 deletions(-) diff --git a/README.md b/README.md index 1404587b0b0..168afd28e30 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,7 @@ Yes, SeaTunnel is available under the Apache 2.0 License, allowing commercial us Our [Official Documentation](https://seatunnel.apache.org/docs) includes detailed guides and tutorials to help you get started. -### 7. Is there a community or support channel? +### 6. Is there a community or support channel? Join our Slack community for support and discussions: [SeaTunnel Slack](https://s.apache.org/seatunnel-slack). +more information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). \ No newline at end of file diff --git a/docs/en/faq.md b/docs/en/faq.md index 02c125ad4fd..735dc3f7a69 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -1,332 +1,123 @@ -# FAQs +# FAQ -## Why should I install a computing engine like Spark or Flink? +## What data sources and destinations does SeaTunnel support? +SeaTunnel supports various data sources and destinations. You can find a detailed list on the following list: +- Supported data sources (Source): [Source List](https://seatunnel.apache.org/docs/connector-v2/source) +- Supported data destinations (Sink): [Sink List](https://seatunnel.apache.org/docs/connector-v2/sink) -SeaTunnel now uses computing engines such as Spark and Flink to complete resource scheduling and node communication, so we can focus on the ease of use of data synchronization and the development of high-performance components. But this is only temporary. +## Does SeaTunnel support batch and streaming processing? +SeaTunnel supports both batch and streaming processing modes. You can select the appropriate mode based on your specific business scenarios and needs. Batch processing is suitable for scheduled data integration tasks, while streaming processing is ideal for real-time integration and Change Data Capture (CDC). -## I have a question, and I cannot solve it by myself +## Is it necessary to install engines like Spark or Flink when using SeaTunnel? +Spark and Flink are not mandatory. SeaTunnel supports Zeta, Spark, and Flink as integration engines, allowing you to choose one based on your needs. The community highly recommends Zeta, a new generation high-performance integration engine specifically designed for integration scenarios. Zeta is affectionately called "Ultraman Zeta" by community users! The community offers extensive support for Zeta, making it the most feature-rich option. -I have encountered a problem when using SeaTunnel and I cannot solve it by myself. What should I do? First, search in [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?dev@seatunnel.apache.org) to see if someone has already asked the same question and got an answer. If you cannot find an answer to your question, you can contact community members for help in [These Ways](https://github.com/apache/seatunnel#contact-us). +## What data transformation functions does SeaTunnel provide? +SeaTunnel supports multiple data transformation functions, including field mapping, data filtering, data format conversion, and more. You can implement data transformations through the `transform` module in the configuration file. For more details, refer to the SeaTunnel [Transform Documentation](https://seatunnel.apache.org/docs/transform-v2). -## How do I declare a variable? +## Can SeaTunnel support custom data cleansing rules? +Yes, SeaTunnel supports custom data cleansing rules. You can configure custom rules in the `transform` module, such as cleaning up dirty data, removing invalid records, or converting fields. -Do you want to know how to declare a variable in SeaTunnel's configuration, and then dynamically replace the value of the variable at runtime? +## Does SeaTunnel support real-time incremental integration? +SeaTunnel supports incremental data integration. For example, the CDC connector allows real-time capture of data changes, which is ideal for scenarios requiring real-time data integration. -Since `v1.2.4`, SeaTunnel supports variable substitution in the configuration. This feature is often used for timing or non-timing offline processing to replace variables such as time and date. The usage is as follows: +## What CDC data sources are currently supported by SeaTunnel? +SeaTunnel currently supports MongoDB CDC, MySQL CDC, OpenGauss CDC, Oracle CDC, PostgreSQL CDC, SQL Server CDC, TiDB CDC, and more. For more details, refer to the [Source List](https://seatunnel.apache.org/docs/connector-v2/source). -Configure the variable name in the configuration. Here is an example of sql transform (actually, anywhere in the configuration file the value in `'key = value'` can use the variable substitution): +## How do I enable permissions required for SeaTunnel CDC integration? +Please refer to the official SeaTunnel documentation for the necessary steps to enable permissions for each connector’s CDC functionality. -``` -... -transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" - } -} -... -``` - -Taking Spark Local mode as an example, the startup command is as follows: - -```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ --m local[2] \ --i city=shanghai \ --i date=20190319 -``` - -You can use the parameter `-i` or `--variable` followed by `key=value` to specify the value of the variable, where the key needs to be same as the variable name in the configuration. - -## How do I write a configuration item in multi-line text in the configuration file? +## Does SeaTunnel support CDC from MySQL replicas? How are logs pulled? +Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. -When a configured text is very long and you want to wrap it, you can use three double quotes to indicate its start and end: +## Does SeaTunnel support CDC integration for tables without primary keys? +No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." -``` -var = """ - whatever you want -""" -``` - -## How do I implement variable substitution for multi-line text? - -It is a little troublesome to do variable substitution in multi-line text, because the variable cannot be included in three double quotation marks: - -``` -var = """ -your string 1 -"""${you_var}""" your string 2""" -``` +## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? +When the structure of a data source or destination changes, SeaTunnel provides various mechanisms to adapt, such as automatically detecting and updating the schema or configuring data mapping rules. You can adjust the `schema_save_mode` or `data_save_mode` parameters to control how these changes are handled based on your needs. -Refer to: [lightbend/config#456](https://github.com/lightbend/config/issues/456). +For more details, refer to the answers on `schema_save_mode` and `data_save_mode` below. -## Is SeaTunnel supported in Azkaban, Oozie, DolphinScheduler? +## Does SeaTunnel support automatic table creation? +Before starting an integration task, you can select different handling schemes for existing table structures on the target side, controlled via the `schema_save_mode` parameter. Available options include: +- **`RECREATE_SCHEMA`**: Creates the table if it does not exist; if the table exists, it is deleted and recreated. +- **`CREATE_SCHEMA_WHEN_NOT_EXIST`**: Creates the table if it does not exist; skips creation if the table already exists. +- **`ERROR_WHEN_SCHEMA_NOT_EXIST`**: Throws an error if the table does not exist. +- **`IGNORE`**: Ignores table handling. + Many connectors currently support automatic table creation. Refer to the specific connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#schema_save_mode-enum), for more information. -Of course! See the screenshot below: +## Does SeaTunnel support handling existing data before starting a data integration task? +Yes, you can specify different processing schemes for existing data on the target side before starting an integration task, controlled via the `data_save_mode` parameter. Available options include: +- **`DROP_DATA`**: Retains the database structure but deletes the data. +- **`APPEND_DATA`**: Retains both the database structure and data. +- **`CUSTOM_PROCESSING`**: User-defined processing. +- **`ERROR_WHEN_DATA_EXISTS`**: Throws an error if data already exists. + Many connectors support handling existing data; please refer to the respective connector documentation, such as [Jdbc sink](https://seatunnel.apache.org/docs/connector-v2/sink/Jdbc#data_save_mode-enum). -![workflow.png](../images/workflow.png) +## Does SeaTunnel support exactly-once consistency? +SeaTunnel supports exactly-once consistency for some data sources, such as MySQL and PostgreSQL, ensuring data consistency during integration. Note that exactly-once consistency depends on the capabilities of the underlying database. -![azkaban.png](../images/azkaban.png) +## Can SeaTunnel execute scheduled tasks? +You can use Linux cron jobs to achieve periodic data integration, or leverage scheduling tools like DolphinScheduler to manage complex scheduled tasks. -## Does SeaTunnel have a case for configuring multiple sources, such as configuring elasticsearch and hdfs in source at the same time? +## I encountered an issue with SeaTunnel that I cannot resolve. What should I do? +If you encounter issues with SeaTunnel, here are a few ways to get help: +1. Search the [Issue List](https://github.com/apache/seatunnel/issues) or [Mailing List](https://lists.apache.org/list.html?dev@seatunnel.apache.org) to see if someone else has faced a similar issue. +2. If you cannot find an answer, reach out to the community through [these methods](https://github.com/apache/seatunnel#contact-us). -``` -env { - ... -} +## How do I declare variables? +Would you like to declare a variable in SeaTunnel's configuration and dynamically replace it at runtime? This feature is commonly used in both scheduled and ad-hoc offline processing to replace time, date, or other variables. Here's an example: -source { - hdfs { ... } - elasticsearch { ... } - jdbc {...} -} +Define the variable in the configuration. For example, in an SQL transformation (the value in any "key = value" pair in the configuration file can be replaced with variables): +```plaintext +... transform { - ... -} - -sink { - elasticsearch { ... } -} -``` - -## Are there any HBase plugins? - -There is a HBase input plugin. You can download it from here: https://github.com/garyelephant/waterdrop-input-hbase . - -## How can I use SeaTunnel to write data to Hive? - -``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} -``` - -In addition, SeaTunnel has implemented a `Hive` output plugin after version `1.5.7` in `1.x` branch; in `2.x` branch. The Hive plugin for the Spark engine has been supported from version `2.0.5`: https://github.com/apache/seatunnel/issues/910. - -## How does SeaTunnel write multiple instances of ClickHouse to achieve load balancing? - -1. Write distributed tables directly (not recommended) - -2. Add a proxy or domain name (DNS) in front of multiple instances of ClickHouse: - - ``` - { - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } - } - ``` -3. Configure multiple instances in the configuration: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. Use cluster mode: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## How can I solve OOM when SeaTunnel consumes Kafka? - -In most cases, OOM is caused by not having a rate limit for consumption. The solution is as follows: - -For the current limit of Spark consumption of Kafka: - -1. Suppose the number of partitions of Kafka `Topic 1` you consume with KafkaStream = N. - -2. Assuming that the production speed of the message producer (Producer) of `Topic 1` is K messages/second, the speed of write messages to the partition must be uniform. - -3. Suppose that, after testing, it is found that the processing capacity of Spark Executor per core per second is M. - -The following conclusions can be drawn: - -1. If you want to make Spark's consumption of `Topic 1` keep up with its production speed, then you need `spark.executor.cores` * `spark.executor.instances` >= K / M - -2. When a data delay occurs, if you want the consumption speed not to be too fast, resulting in spark executor OOM, then you need to configure `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N - -3. In general, both M and N are determined, and the conclusion can be drawn from 2: The size of `spark.streaming.kafka.maxRatePerPartition` is positively correlated with the size of `spark.executor.cores` * `spark.executor.instances`, and it can be increased while increasing the resource `maxRatePerPartition` to speed up consumption. - -![Kafka](../images/kafka.png) - -## How can I solve the Error `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? - -The reason is that the version of httpclient.jar that comes with the CDH version of Spark is lower, and The httpclient version that ClickHouse JDBC is based on is 4.5.2, and the package versions conflict. The solution is to replace the jar package that comes with CDH with the httpclient-4.5.2 version. - -## The default JDK of my Spark cluster is JDK7. After I install JDK8, how can I specify that SeaTunnel starts with JDK8? - -In SeaTunnel's config file, specify the following configuration: - -```shell -spark { - ... - spark.executorEnv.JAVA_HOME="/your/java_8_home/directory" - spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory" - ... + Sql { + query = "select * from user_view where city ='${city}' and dt = '${date}'" + } } +... ``` -## What should I do if OOM always appears when running SeaTunnel in Spark local[*] mode? - -If you run in local mode, you need to modify the `start-seatunnel.sh` startup script. After `spark-submit`, add a parameter `--driver-memory 4g` . Under normal circumstances, local mode is not used in the production environment. Therefore, this parameter generally does not need to be set during On YARN. See: [Application Properties](https://spark.apache.org/docs/latest/configuration.html#application-properties) for details. - -## Where can I place self-written plugins or third-party jdbc.jars to be loaded by SeaTunnel? - -Place the Jar package under the specified structure of the plugins directory: +To start SeaTunnel in Zeta Local mode with variables: ```bash -cd SeaTunnel -mkdir -p plugins/my_plugins/lib -cp third-part.jar plugins/my_plugins/lib +$SEATUNNEL_HOME/bin/seatunnel.sh \ +-c $SEATUNNEL_HOME/config/your_app.conf \ +-m local[2] \ +-i city=Singapore \ +-i date=20231110 ``` -`my_plugins` can be any string. - -## How do I configure logging-related parameters in SeaTunnel-V1(Spark)? - -There are three ways to configure logging-related parameters (such as Log Level): - -- [Not recommended] Change the default `$SPARK_HOME/conf/log4j.properties`. - - This will affect all programs submitted via `$SPARK_HOME/bin/spark-submit`. -- [Not recommended] Modify logging related parameters directly in the Spark code of SeaTunnel. - - This is equivalent to hardcoding, and each change needs to be recompiled. -- [Recommended] Use the following methods to change the logging configuration in the SeaTunnel configuration file (The change only takes effect if SeaTunnel >= 1.5.5 ): - - ``` - env { - spark.driver.extraJavaOptions = "-Dlog4j.configuration=file:/log4j.properties" - spark.executor.extraJavaOptions = "-Dlog4j.configuration=file:/log4j.properties" - } - source { - ... - } - transform { - ... - } - sink { - ... - } - ``` - -The contents of the log4j configuration file for reference are as follows: - -``` -$ cat log4j.properties -log4j.rootLogger=ERROR, console +Use the `-i` or `--variable` parameter with `key=value` to specify the variable's value, where `key` matches the variable name in the configuration. For details, see: [SeaTunnel Variable Configuration](https://seatunnel.apache.org/docs/concept/config) -# set the log level for these components -log4j.logger.org=ERROR -log4j.logger.org.apache.spark=ERROR -log4j.logger.org.spark-project=ERROR -log4j.logger.org.apache.hadoop=ERROR -log4j.logger.io.netty=ERROR -log4j.logger.org.apache.zookeeper=ERROR +## How can I write multi-line text in the configuration file? +If the text is long and needs to be wrapped, you can use triple quotes to indicate the beginning and end: -# add a ConsoleAppender to the logger stdout to write to the console -log4j.appender.console=org.apache.log4j.ConsoleAppender -log4j.appender.console.layout=org.apache.log4j.PatternLayout -# use a simple message format -log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n +```plaintext +var = """ +Apache SeaTunnel is a +next-generation high-performance, +distributed, massive data integration tool. +""" ``` -## How do I configure logging related parameters in SeaTunnel-V2(Spark, Flink)? - -Currently, they cannot be set directly. you need to modify the SeaTunnel startup script. The relevant parameters are specified in the task submission command. For specific parameters, please refer to the official documents: - -- Spark official documentation: http://spark.apache.org/docs/latest/configuration.html#configuring-logging -- Flink official documentation: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/logging.html - -Reference: - -https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console - -http://spark.apache.org/docs/latest/configuration.html#configuring-logging - -https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01 - -## How do I configure logging related parameters of SeaTunnel-E2E Test? - -The log4j configuration file of `seatunnel-e2e` existed in `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties`. You can modify logging related parameters directly in the configuration file. - -For example, if you want to output more detailed logs of E2E Test, just downgrade `rootLogger.level` in the configuration file. - -## Error when writing to ClickHouse: ClassCastException - -In SeaTunnel, the data type will not be actively converted. After the Input reads the data, the corresponding -Schema. When writing ClickHouse, the field type needs to be strictly matched, and the mismatch needs to be resolved. - -Data conversion can be achieved through the following two plugins: +## How do I perform variable substitution in multi-line text? +Performing variable substitution in multi-line text can be tricky because variables cannot be enclosed within triple quotes: -1. Filter Convert plugin -2. Filter Sql plugin - -Detailed data type conversion reference: [ClickHouse Data Type Check List](https://interestinglab.github.io/seatunnel-docs/#/en/configuration/output-plugins/Clickhouse?id=clickhouse-data-type-check-list) - -Refer to issue:[#488](https://github.com/apache/seatunnel/issues/488) [#382](https://github.com/apache/seatunnel/issues/382). - -## How does SeaTunnel access kerberos-authenticated HDFS, YARN, Hive and other resources? - -Please refer to: [#590](https://github.com/apache/seatunnel/issues/590). - -## How do I troubleshoot NoClassDefFoundError, ClassNotFoundException and other issues? - -There is a high probability that there are multiple different versions of the corresponding Jar package class loaded in the Java classpath, because of the conflict of the load order, not because the Jar is really missing. Modify this SeaTunnel startup command, adding the following parameters to the spark-submit submission section, and debug in detail through the output log. - -``` -spark-submit --verbose - ... - --conf 'spark.driver.extraJavaOptions=-verbose:class' - --conf 'spark.executor.extraJavaOptions=-verbose:class' - ... +```plaintext +var = """ +your string 1 +"""${your_var}""" your string 2""" ``` -## I want to learn the source code of SeaTunnel. Where should I start? - -SeaTunnel has a completely abstract and structured code implementation, and many people have chosen SeaTunnel As a way to learn Spark. You can learn the source code from the main program entry: SeaTunnel.java - -## When SeaTunnel developers develop their own plugins, do they need to understand the SeaTunnel code? Should these plugins be integrated into the SeaTunnel project? - -The plugin developed by the developer has nothing to do with the SeaTunnel project and does not need to include your plugin code. +For more details, see: [lightbend/config#456](https://github.com/lightbend/config/issues/456). -The plugin can be completely independent from SeaTunnel project, so you can write it using Java, Scala, Maven, sbt, Gradle, or whatever you want. This is also the way we recommend developers to develop plugins. +## How do I configure logging parameters for SeaTunnel E2E Tests? +The log4j configuration file for `seatunnel-e2e` is located at `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties`. You can directly modify logging-related parameters in this configuration file. For example, to produce more detailed E2E Test logs, lower the `rootLogger.level` in the configuration file. -## When I import a project, the compiler has the exception "class not found `org.apache.seatunnel.shade.com.typesafe.config.Config`" +## Where should I start if I want to learn SeaTunnel source code? +SeaTunnel features a highly abstracted and well-structured architecture, making it an excellent choice for learning big data architecture. You can start by exploring and debugging the `seatunnel-examples` module: `SeaTunnelEngineLocalExample.java`. For more details, refer to the [SeaTunnel Contribution Guide](https://seatunnel.apache.org/docs/contribution/setup). -Run `mvn install` first. In the `seatunnel-config/seatunnel-config-base` subproject, the package `com.typesafe.config` has been relocated to `org.apache.seatunnel.shade.com.typesafe.config` and installed to the maven local repository in the subproject `seatunnel-config/seatunnel-config-shade`. +## Do I need to understand all of SeaTunnel’s source code if I want to develop my own source, sink, or transform? +No, you only need to focus on the interfaces for source, sink, and transform. If you want to develop your own connector (Connector V2) for the SeaTunnel API, refer to the **[Connector Development Guide](https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/README.md)**. \ No newline at end of file diff --git a/docs/zh/faq.md b/docs/zh/faq.md index 4fc24e6a3ad..d1a705d6333 100644 --- a/docs/zh/faq.md +++ b/docs/zh/faq.md @@ -1,56 +1,112 @@ # 常见问题解答 -## 为什么要安装Spark或者Flink这样的计算引擎? - -SeaTunnel现在使用Spark、Flink等计算引擎来完成资源调度和节点通信,因此我们可以专注于数据同步的易用性和高性能组件的开发。 但这只是暂时的。 +## SeaTunnel 支持哪些数据来源和数据目的地? +SeaTunnel 支持多种数据源来源和数据目的地,您可以在官网找到详细的列表: +SeaTunnel 支持的数据来源(Source)列表:https://seatunnel.apache.org/docs/connector-v2/source +SeaTunnel 支持的数据目的地(Sink)列表:https://seatunnel.apache.org/docs/connector-v2/sink + +## SeaTunnel 是否支持批处理和流处理? +SeaTunnel 支持批流一体,SeaTunnel 可以设置批处理和流处理两种模式。您可以根据具体的业务场景和需求选择合适的处理模式。批处理适合定时数据同步场景,而流处理适合实时同步和数据变更捕获 (CDC) 场景。 + +## 使用 SeaTunnel 需要安装 Spark 或者 Flink 这样的引擎么? +Spark 和 Flink 不是必需的,SeaTunnel 可以支持 Zeta、Spark 和 Flink 3 种作为同步引擎的选择,您可以选择之一就行,社区尤其推荐使用 Zeta 这种专为同步场景打造的新一代超高性能同步引擎。Zeta 被社区用户亲切的称为 “泽塔奥特曼”! +社区对 Zeta 的支持力度是最大的,功能也更丰富。 + +## SeaTunnel 支持的数据转换功能有哪些? +SeaTunnel 支持多种数据转换功能,包括字段映射、数据过滤、数据格式转换等。可以通过在配置文件中定义 `transform` 模块来实现数据转换。详情请参考 SeaTunnel [Transform 文档](https://seatunnel.apache.org/docs/transform-v2)。 + +## SeaTunnel 是否可以自定义数据清洗规则? +SeaTunnel 支持自定义数据清洗规则。可以在 `transform` 模块中配置自定义规则,例如清理脏数据、删除无效记录或字段转换。 + +## SeaTunnel 是否支持实时增量同步? +SeaTunnel 支持增量数据同步。例如通过 CDC 连接器实现对数据库的增量同步,适用于需要实时捕获数据变更的场景。 + +## SeaTunnel 目前支持哪些数据源的 CDC ? +目前支持 MongoDB CDC、MySQL CDC、Opengauss CDC、Oracle CDC、PostgreSQL CDC、Sql Server CDC、TiDB CDC等,更多请查阅[Source](https://seatunnel.apache.org/docs/connector-v2/source)。 + +## SeaTunnel CDC 同步需要的权限如何开启? +这样就可以了。 +这里多说一句,连接器对应的 cdc 权限开启步骤在官网都有写,请参照 SeaTunnel 对应的官网操作即可 + +## SeaTunnel 支持从 MySQL 备库进行 CDC 么?日志如何拉取? +支持,是通过订阅 MySQL binlog 日志方式到同步服务器上解析 binlog 日志方式进行 + +## SeaTunnel 是否支持无主键表的 CDC 同步? +不支持无主键表的 cdc 同步。原因如下: +比如上游有 2 条一模一样的数据,然后上游删除或修改了一条,下游由于无法区分到底是哪条需要删除或修改,会出现这 2 条都被删除或修改的情况。 +没主键要类似去重的效果本身有点儿自相矛盾,就像辨别西游记里的真假悟空,到底哪个是真的 + +## SeaTunnel 对数据来源(source)或数据目标(sink)发生变更时如何处理? +在数据源或数据目的地结构发生变化时,SeaTunnel 提供多种应对机制,例如自动检测和更新表结构 (schema) 或定制数据映射规则。您可以根据实际需求调整 `schema_save_mode` 或 `data_save_mode` 的配置参数来控制变更处理。 +可以参考下面 2 个问题的回答,了解更多关于 `schema_save_mode` 和 `data_save_mode` 的信息。 + +## SeaTunnel 是否支持自动建表? +在同步任务启动之前,可以为目标端已有的表结构选择不同的处理方案。是通过 `schema_save_mode` 参数来控制的。 +`schema_save_mode` 有以下几种方式可选: +- **`RECREATE_SCHEMA`**:当表不存在时会创建,若表已存在则删除并重新创建。 +- **`CREATE_SCHEMA_WHEN_NOT_EXIST`**:当表不存在时会创建,若表已存在则跳过创建。 +- **`ERROR_WHEN_SCHEMA_NOT_EXIST`**:当表不存在时会报错。 +- **`IGNORE`**:忽略对表的处理。 + 目前很多 connector 已经支持了自动建表,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#schema_save_mode-enum) + +## SeaTunnel 是否支持数据同步任务开始前对已有数据进行处理? +在同步任务启动之前,可以为目标端已有的数据选择不同的处理方案。是通过 `data_save_mode` 参数来控制的。 +`data_save_mode` 有以下几种可选项: +- **`DROP_DATA`**:保留数据库结构,删除数据。 +- **`APPEND_DATA`**:保留数据库结构,保留数据。 +- **`CUSTOM_PROCESSING`**:用户自定义处理。 +- **`ERROR_WHEN_DATA_EXISTS`**:当存在数据时,报错。 + 目前很多 connector 已经支持了对已有数据进行处理,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#data_save_mode-enum) + +## SeaTunnel 是否支持精确一致性管理? +SeaTunnel 支持一部分数据源的精确一致性,例如支持 MySQL、PostgreSQL 等数据库的事务写入,确保数据在同步过程中的一致性,另外精确一致性也要看数据库本身是否可以支持 + +## SeaTunnel 可以定期执行任务吗? +您可以通过使用 linux 自带 cron 能力来实现定时数据同步任务,也可以结合 DolphinScheduler 等调度工具实现复杂的定时任务管理。 ## 我有一个问题,我自己无法解决 - -我在使用SeaTunnel时遇到了问题,无法自行解决。 我应该怎么办? 首先,在[问题列表](https://github.com/apache/seatunnel/issues)或[邮件列表](https://lists.apache.org/list.html?dev@seatunnel.apache.org)中搜索 )看看是否有人已经问过同样的问题并得到答案。 如果您找不到问题的答案,您可以通过[这些方式](https://github.com/apache/seatunnel#contact-us)联系社区成员寻求帮助。 +我在使用 SeaTunnel 时遇到了问题,无法自行解决。 我应该怎么办?有以下几种方式 +1、在[问题列表](https://github.com/apache/seatunnel/issues)或[邮件列表](https://lists.apache.org/list.html?dev@seatunnel.apache.org)中搜索看看是否有人已经问过同样的问题并得到答案。 +2、如果您找不到问题的答案,您可以通过[这些方式](https://github.com/apache/seatunnel#contact-us)联系社区成员寻求帮助。 +3、中国用户可以添加微信群助手:seatunnel1,加入社区交流群,也欢迎大家关注微信公众号:seatunnel。 ## 如何声明变量? - -您想知道如何在 SeaTunnel 的配置中声明一个变量,然后在运行时动态替换该变量的值吗? - -从“v1.2.4”开始,SeaTunnel 支持配置中的变量替换。 该功能常用于定时或非定时离线处理,以替代时间、日期等变量。 用法如下: - +您想知道如何在 SeaTunnel 的配置中声明一个变量,然后在运行时动态替换该变量的值吗? 该功能常用于定时或非定时离线处理,以替代时间、日期等变量。 用法如下: 在配置中配置变量名称。 下面是一个sql转换的例子(实际上,配置文件中任何地方“key = value”中的值都可以使用变量替换): - ``` ... transform { - sql { - query = "select * from user_view where city ='"${city}"' and dt = '"${date}"'" + Sql { + query = "select * from user_view where city ='${city}' and dt = '${date}'" } } ... ``` -以Spark Local模式为例,启动命令如下: +以使用 SeaTunnel Zeta Local模式为例,启动命令如下: ```bash -./bin/start-seatunnel-spark.sh \ --c ./config/your_app.conf \ --e client \ +$SEATUNNEL_HOME/bin/seatunnel.sh \ +-c $SEATUNNEL_HOME/config/your_app.conf \ -m local[2] \ --i city=shanghai \ --i date=20190319 +-i city=Singapore \ +-i date=20231110 ``` -您可以使用参数“-i”或“--variable”后跟“key=value”来指定变量的值,其中key需要与配置中的变量名称相同。 +您可以使用参数“-i”或“--variable”后跟“key=value”来指定变量的值,其中key需要与配置中的变量名称相同。详情可以参考:https://seatunnel.apache.org/docs/concept/config ## 如何在配置文件中写入多行文本的配置项? - -当配置的文本很长并且想要将其换行时,可以使用三个双引号来指示其开始和结束: +当配置的文本很长并且想要将其换行时,您可以使用三个双引号来指示其开始和结束: ``` var = """ - whatever you want +Apache SeaTunnel is a +next-generation high-performance, +distributed, massive data integration tool. """ ``` ## 如何实现多行文本的变量替换? - 在多行文本中进行变量替换有点麻烦,因为变量不能包含在三个双引号中: ``` @@ -61,273 +117,17 @@ your string 1 请参阅:[lightbend/config#456](https://github.com/lightbend/config/issues/456)。 -## Azkaban、Oozie、DolphinScheduler 是否支持 SeaTunnel? - -当然! 请参阅下面的屏幕截图: - -![工作流程.png](../images/workflow.png) - -![azkaban.png](../images/azkaban.png) - -## SeaTunnel是否有配置多个源的情况,例如同时在源中配置elasticsearch和hdfs? - -``` -env { - ... -} - -source { - hdfs { ... } - elasticsearch { ... } - jdbc {...} -} - -transform { - ... -} - -sink { - elasticsearch { ... } -} -``` - -## 有 HBase 插件吗? - -有一个 HBase 输入插件。 您可以从这里下载:https://github.com/garyelephant/waterdrop-input-hbase - -## 如何使用SeaTunnel将数据写入Hive? - -``` -env { - spark.sql.catalogImplementation = "hive" - spark.hadoop.hive.exec.dynamic.partition = "true" - spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict" -} - -source { - sql = "insert into ..." -} - -sink { - // The data has been written to hive through the sql source. This is just a placeholder, it does not actually work. - stdout { - limit = 1 - } -} -``` - -此外,SeaTunnel 在 `1.5.7` 版本之后在 `1.x` 分支中实现了 `Hive` 输出插件; 在“2.x”分支中。 Spark 引擎的 Hive 插件已从版本“2.0.5”开始支持:https://github.com/apache/seatunnel/issues/910。 - -## SeaTunnel如何编写ClickHouse的多个实例来实现负载均衡? - -1.直接写分布式表(不推荐) - -2.在ClickHouse的多个实例前面添加代理或域名(DNS): - -``` -{ - output { - clickhouse { - host = "ck-proxy.xx.xx:8123" - # Local table - table = "table_name" - } - } -} -``` - -3. 在配置文件中配置多个ClickHouse实例: - - ``` - { - output { - clickhouse { - host = "ck1:8123,ck2:8123,ck3:8123" - # Local table - table = "table_name" - } - } - } - ``` -4. 使用集群模式: - - ``` - { - output { - clickhouse { - # Configure only one host - host = "ck1:8123" - cluster = "clickhouse_cluster_name" - # Local table - table = "table_name" - } - } - } - ``` - -## SeaTunnel 消费 Kafka 时如何解决 OOM? - -大多数情况下,OOM是由于没有对消费进行速率限制而导致的。 解决方法如下: - -对于目前Kafka的Spark消费限制: - -1. 假设您使用 KafkaStream 消费的 Kafka `Topic 1` 的分区数量 = N。 - -2. 假设“Topic 1”的消息生产者(Producer)的生产速度为K条消息/秒,则向分区写入消息的速度必须一致。 - -3、假设经过测试发现Spark Executor每核每秒的处理能力为M。 - -可以得出以下结论: - -1、如果想让Spark对`Topic 1`的消耗跟上它的生产速度,那么需要 `spark.executor.cores` * `spark.executor.instances` >= K / M - -2、当出现数据延迟时,如果希望消耗速度不要太快,导致spark执行器OOM,那么需要配置 `spark.streaming.kafka.maxRatePerPartition` <= (`spark.executor.cores` * `spark.executor.instances`) * M / N - -3、一般来说,M和N都确定了,从2可以得出结论:`spark.streaming.kafka.maxRatePerPartition`的大小与`spark.executor.cores` * `spark的大小正相关 .executor.instances`,可以在增加资源`maxRatePerPartition`的同时增加,以加快消耗。 - -![Kafka](../images/kafka.png) - -## 如何解决错误 `Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE`? - -原因是Spark的CDH版本自带的httpclient.jar版本较低,而ClickHouse JDBC基于的httpclient版本是4.5.2,包版本冲突。 解决办法是将CDH自带的jar包替换为httpclient-4.5.2版本。 - -## 我的Spark集群默认的JDK是JDK7。 安装JDK8后,如何指定SeaTunnel以JDK8启动? - -在 SeaTunnel 的配置文件中,指定以下配置: - -```shell -spark { - ... - spark.executorEnv.JAVA_HOME="/your/java_8_home/directory" - spark.yarn.appMasterEnv.JAVA_HOME="/your/java_8_home/directory" - ... -} -``` - -## Spark local[*]模式运行SeaTunnel时总是出现OOM怎么办? - -如果以本地模式运行,则需要修改`start-seatunnel.sh`启动脚本。 在 `spark-submit` 之后添加参数 `--driver-memory 4g` 。 一般情况下,生产环境中不使用本地模式。 因此,On YARN时一般不需要设置该参数。 有关详细信息,请参阅:[应用程序属性](https://spark.apache.org/docs/latest/configuration.html#application-properties)。 - -## 我可以在哪里放置自己编写的插件或第三方 jdbc.jar 以供 SeaTunnel 加载? - -将Jar包放置在plugins目录指定结构下: - -```bash -cd SeaTunnel -mkdir -p plugins/my_plugins/lib -cp third-part.jar plugins/my_plugins/lib -``` - -`my_plugins` 可以是任何字符串。 - -## 如何在 SeaTunnel-V1(Spark) 中配置日志记录相关参数? - -可以通过三种方式配置日志相关参数(例如日志级别): - -- [不推荐] 更改默认的`$SPARK_HOME/conf/log4j.properties`。 - - 这将影响通过 `$SPARK_HOME/bin/spark-submit` 提交的所有程序。 -- [不推荐]直接在SeaTunnel的Spark代码中修改日志相关参数。 - - 这相当于写死了,每次改变都需要重新编译。 -- [推荐] 使用以下方法更改 SeaTunnel 配置文件中的日志记录配置(更改仅在 SeaTunnel >= 1.5.5 时生效): - - ``` - env { - spark.driver.extraJavaOptions = "-Dlog4j.configuration=file:/log4j.properties" - spark.executor.extraJavaOptions = "-Dlog4j.configuration=file:/log4j.properties" - } - source { - ... - } - transform { - ... - } - sink { - ... - } - ``` - -可供参考的log4j配置文件内容如下: - -``` -$ cat log4j.properties -log4j.rootLogger=ERROR, console - -# set the log level for these components -log4j.logger.org=ERROR -log4j.logger.org.apache.spark=ERROR -log4j.logger.org.spark-project=ERROR -log4j.logger.org.apache.hadoop=ERROR -log4j.logger.io.netty=ERROR -log4j.logger.org.apache.zookeeper=ERROR - -# add a ConsoleAppender to the logger stdout to write to the console -log4j.appender.console=org.apache.log4j.ConsoleAppender -log4j.appender.console.layout=org.apache.log4j.PatternLayout -# use a simple message format -log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n -``` - -## 如何在 SeaTunnel-V2(Spark、Flink) 中配置日志记录相关参数? - -目前,无法直接设置它们。 您需要修改SeaTunnel启动脚本。 相关参数在任务提交命令中指定。 具体参数请参考官方文档: - -- Spark官方文档:http://spark.apache.org/docs/latest/configuration.html#configuring-logging -- Flink 官方文档:https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/logging.html - -参考: - -https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console - -http://spark.apache.org/docs/latest/configuration.html#configuring-logging - -https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01 - -## 如何配置SeaTunnel-E2E Test的日志记录相关参数? - +## 如何配置 SeaTunnel-E2E Test 的日志记录相关参数? `seatunnel-e2e` 的 log4j 配置文件位于 `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties` 中。 您可以直接在配置文件中修改日志记录相关参数。 - 例如,如果您想输出更详细的E2E Test日志,只需将配置文件中的“rootLogger.level”降级即可。 -## 写入 ClickHouse 时出错:ClassCastException - -在SeaTunnel中,不会主动转换数据类型。 Input读取数据后,对应的 -架构。 编写ClickHouse时,需要严格匹配字段类型,不匹配的情况需要解决。 - -数据转换可以通过以下两个插件实现: - -1.过滤器转换插件 -2.过滤Sql插件 - -详细数据类型转换参考:[ClickHouse数据类型检查列表](https://interestinglab.github.io/seatunnel-docs/#/en/configuration/output-plugins/Clickhouse?id=clickhouse-data-type-check-list) - -请参阅问题:[#488](https://github.com/apache/seatunnel/issues/488)[#382](https://github.com/apache/seatunnel/issues/382)。 - -## SeaTunnel 如何访问经过 kerberos 验证的 HDFS、YARN、Hive 等资源? - -请参考:[#590](https://github.com/apache/seatunnel/issues/590)。 - -## 如何排查 NoClassDefFoundError、ClassNotFoundException 等问题? - -有很大概率是Java类路径中加载了多个不同版本的对应Jar包类,是因为加载顺序冲突,而不是因为Jar确实丢失了。 修改这条SeaTunnel启动命令,在spark-submit提交部分添加如下参数,通过输出日志进行详细调试。 - -``` -spark-submit --verbose - ... - --conf 'spark.driver.extraJavaOptions=-verbose:class' - --conf 'spark.executor.extraJavaOptions=-verbose:class' - ... -``` - -## 我想学习SeaTunnel的源代码。 我应该从哪里开始? - -SeaTunnel 拥有完全抽象、结构化的代码实现,很多人都选择 SeaTunnel 作为学习 Spark 的方式。 您可以从主程序入口了解源代码:SeaTunnel.java - -## SeaTunnel开发者开发自己的插件时,是否需要了解SeaTunnel代码? 这些插件是否应该集成到 SeaTunnel 项目中? - -开发者开发的插件与SeaTunnel项目无关,不需要包含您的插件代码。 +## 如果想学习 SeaTunnel 的源代码,应该从哪里开始? +SeaTunnel 拥有完全抽象、结构化的非常优秀的架构设计和代码实现,很多用户都选择 SeaTunnel 作为学习大数据架构的方式。 您可以从`seatunnel-examples`模块开始了解和调试源代码:SeaTunnelEngineLocalExample.java +具体参考:https://seatunnel.apache.org/docs/contribution/setup +针对中国用户,如果有伙伴想贡献自己的一份力量让 SeaTunnel 更好,特别欢迎加入社区贡献者种子群,欢迎添加微信:davidzollo,添加时请注明 "参与开源共建"。 -该插件可以完全独立于 SeaTunnel 项目,因此您可以使用 Java、Scala、Maven、sbt、Gradle 或任何您想要的方式编写它。 这也是我们推荐开发者开发插件的方式。 +## 如果想开发自己的 source、sink、transform 时,是否需要了解 SeaTunnel 所有源代码? +不需要,您只需要关注 source、sink、transform 对应的接口即可。 +如果你想针对 SeaTunnel API 开发自己的连接器(Connector V2),请查看**[Connector Development Guide](https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/README.zh.md)** 。 -## 当我导入项目时,编译器出现异常“找不到类`org.apache.seatunnel.shade.com.typesafe.config.Config`” -首先运行“mvn install”。 在 `seatunnel-config/seatunnel-config-base` 子项目中,包 `com.typesafe.config` 已重新定位到 `org.apache.seatunnel.shade.com.typesafe.config` 并安装到 maven 本地存储库 在子项目 `seatunnel-config/seatunnel-config-shade` 中。 From a42df6ff7b1eb05914ce9dd48f6d4a228ccad842 Mon Sep 17 00:00:00 2001 From: David Zollo Date: Wed, 13 Nov 2024 23:25:52 +0800 Subject: [PATCH 2/8] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 168afd28e30..132497c5d6f 100644 --- a/README.md +++ b/README.md @@ -147,4 +147,4 @@ Our [Official Documentation](https://seatunnel.apache.org/docs) includes detaile ### 6. Is there a community or support channel? Join our Slack community for support and discussions: [SeaTunnel Slack](https://s.apache.org/seatunnel-slack). -more information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). \ No newline at end of file +more information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). From 17c3d0da703d91e57c6d0fa634e37ddffe43690b Mon Sep 17 00:00:00 2001 From: David Zollo Date: Fri, 15 Nov 2024 09:52:24 +0800 Subject: [PATCH 3/8] Update README.md Co-authored-by: Jia Fan --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 132497c5d6f..27cb1da56a3 100644 --- a/README.md +++ b/README.md @@ -147,4 +147,4 @@ Our [Official Documentation](https://seatunnel.apache.org/docs) includes detaile ### 6. Is there a community or support channel? Join our Slack community for support and discussions: [SeaTunnel Slack](https://s.apache.org/seatunnel-slack). -more information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). +More information, please refer to [FAQ](https://seatunnel.apache.org/docs/faq). From 3240df6545041db71dcd988d6664f365eec35c9c Mon Sep 17 00:00:00 2001 From: David Zollo Date: Fri, 15 Nov 2024 09:52:45 +0800 Subject: [PATCH 4/8] Update docs/en/faq.md Co-authored-by: Jia Fan --- docs/en/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/faq.md b/docs/en/faq.md index 735dc3f7a69..e7d2766f0a6 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -30,7 +30,7 @@ Please refer to the official SeaTunnel documentation for the necessary steps to Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, which are then parsed on the SeaTunnel server. ## Does SeaTunnel support CDC integration for tables without primary keys? -No, SeaTunnel does not support CDC integration for tables without primary keys. This is because, in cases where two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Having primary keys is essential for ensuring data uniqueness, similar to identifying the real Monkey King in the classic "Journey to the West." +SeaTunnel does not support CDC integration for tables without primary keys. The reason is that if two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Primary keys are essential to ensure data uniqueness. ## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? When the structure of a data source or destination changes, SeaTunnel provides various mechanisms to adapt, such as automatically detecting and updating the schema or configuring data mapping rules. You can adjust the `schema_save_mode` or `data_save_mode` parameters to control how these changes are handled based on your needs. From 7b9de17945adf62c29f9b2f1e3e610c1f196836c Mon Sep 17 00:00:00 2001 From: David Zollo Date: Fri, 15 Nov 2024 09:53:02 +0800 Subject: [PATCH 5/8] Update docs/en/faq.md Co-authored-by: Jia Fan --- docs/en/faq.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/en/faq.md b/docs/en/faq.md index e7d2766f0a6..f4874fae347 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -113,8 +113,6 @@ your string 1 For more details, see: [lightbend/config#456](https://github.com/lightbend/config/issues/456). -## How do I configure logging parameters for SeaTunnel E2E Tests? -The log4j configuration file for `seatunnel-e2e` is located at `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties`. You can directly modify logging-related parameters in this configuration file. For example, to produce more detailed E2E Test logs, lower the `rootLogger.level` in the configuration file. ## Where should I start if I want to learn SeaTunnel source code? SeaTunnel features a highly abstracted and well-structured architecture, making it an excellent choice for learning big data architecture. You can start by exploring and debugging the `seatunnel-examples` module: `SeaTunnelEngineLocalExample.java`. For more details, refer to the [SeaTunnel Contribution Guide](https://seatunnel.apache.org/docs/contribution/setup). From 60981078954bf020327cd85b4bb85a21d86aca0e Mon Sep 17 00:00:00 2001 From: David Zollo Date: Fri, 15 Nov 2024 09:59:21 +0800 Subject: [PATCH 6/8] Update faq.md --- docs/zh/faq.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/docs/zh/faq.md b/docs/zh/faq.md index d1a705d6333..9bd40290bf6 100644 --- a/docs/zh/faq.md +++ b/docs/zh/faq.md @@ -47,7 +47,7 @@ SeaTunnel 支持增量数据同步。例如通过 CDC 连接器实现对数据 - **`CREATE_SCHEMA_WHEN_NOT_EXIST`**:当表不存在时会创建,若表已存在则跳过创建。 - **`ERROR_WHEN_SCHEMA_NOT_EXIST`**:当表不存在时会报错。 - **`IGNORE`**:忽略对表的处理。 - 目前很多 connector 已经支持了自动建表,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#schema_save_mode-enum) + 目前很多 connector 已经支持了自动建表,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/connector-v2/sink/Jdbc#schema_save_mode-enum) ## SeaTunnel 是否支持数据同步任务开始前对已有数据进行处理? 在同步任务启动之前,可以为目标端已有的数据选择不同的处理方案。是通过 `data_save_mode` 参数来控制的。 @@ -56,7 +56,7 @@ SeaTunnel 支持增量数据同步。例如通过 CDC 连接器实现对数据 - **`APPEND_DATA`**:保留数据库结构,保留数据。 - **`CUSTOM_PROCESSING`**:用户自定义处理。 - **`ERROR_WHEN_DATA_EXISTS`**:当存在数据时,报错。 - 目前很多 connector 已经支持了对已有数据进行处理,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/2.3.8/connector-v2/sink/Jdbc#data_save_mode-enum) + 目前很多 connector 已经支持了对已有数据进行处理,请参考对应的 connector 文档,这里拿 Jdbc 举例,请参考 [Jdbc sink](https://seatunnel.apache.org/docs/connector-v2/sink/Jdbc#data_save_mode-enum) ## SeaTunnel 是否支持精确一致性管理? SeaTunnel 支持一部分数据源的精确一致性,例如支持 MySQL、PostgreSQL 等数据库的事务写入,确保数据在同步过程中的一致性,另外精确一致性也要看数据库本身是否可以支持 @@ -117,14 +117,11 @@ your string 1 请参阅:[lightbend/config#456](https://github.com/lightbend/config/issues/456)。 -## 如何配置 SeaTunnel-E2E Test 的日志记录相关参数? -`seatunnel-e2e` 的 log4j 配置文件位于 `seatunnel-e2e/seatunnel-e2e-common/src/test/resources/log4j2.properties` 中。 您可以直接在配置文件中修改日志记录相关参数。 -例如,如果您想输出更详细的E2E Test日志,只需将配置文件中的“rootLogger.level”降级即可。 ## 如果想学习 SeaTunnel 的源代码,应该从哪里开始? SeaTunnel 拥有完全抽象、结构化的非常优秀的架构设计和代码实现,很多用户都选择 SeaTunnel 作为学习大数据架构的方式。 您可以从`seatunnel-examples`模块开始了解和调试源代码:SeaTunnelEngineLocalExample.java 具体参考:https://seatunnel.apache.org/docs/contribution/setup -针对中国用户,如果有伙伴想贡献自己的一份力量让 SeaTunnel 更好,特别欢迎加入社区贡献者种子群,欢迎添加微信:davidzollo,添加时请注明 "参与开源共建"。 +针对中国用户,如果有伙伴想贡献自己的一份力量让 SeaTunnel 更好,特别欢迎加入社区贡献者种子群,欢迎添加微信:davidzollo,添加时请注明 "参与开源共建", 群仅仅用于技术交流, 重要的事情讨论还请发到 dev@seatunnel.apache.org 邮件里进行讨论。 ## 如果想开发自己的 source、sink、transform 时,是否需要了解 SeaTunnel 所有源代码? 不需要,您只需要关注 source、sink、transform 对应的接口即可。 From 3c00baf6348bf97e1a50a05d4bb8691d83929bed Mon Sep 17 00:00:00 2001 From: David Zollo Date: Fri, 15 Nov 2024 10:01:19 +0800 Subject: [PATCH 7/8] Update faq.md --- docs/en/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/faq.md b/docs/en/faq.md index f4874fae347..4a06c6a91ae 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -57,7 +57,7 @@ Yes, you can specify different processing schemes for existing data on the targe SeaTunnel supports exactly-once consistency for some data sources, such as MySQL and PostgreSQL, ensuring data consistency during integration. Note that exactly-once consistency depends on the capabilities of the underlying database. ## Can SeaTunnel execute scheduled tasks? -You can use Linux cron jobs to achieve periodic data integration, or leverage scheduling tools like DolphinScheduler to manage complex scheduled tasks. +You can use Linux cron jobs to achieve periodic data integration, or leverage scheduling tools like Apache DolphinScheduler or Apache Airflow to manage complex scheduled tasks. ## I encountered an issue with SeaTunnel that I cannot resolve. What should I do? If you encounter issues with SeaTunnel, here are a few ways to get help: @@ -118,4 +118,4 @@ For more details, see: [lightbend/config#456](https://github.com/lightbend/confi SeaTunnel features a highly abstracted and well-structured architecture, making it an excellent choice for learning big data architecture. You can start by exploring and debugging the `seatunnel-examples` module: `SeaTunnelEngineLocalExample.java`. For more details, refer to the [SeaTunnel Contribution Guide](https://seatunnel.apache.org/docs/contribution/setup). ## Do I need to understand all of SeaTunnel’s source code if I want to develop my own source, sink, or transform? -No, you only need to focus on the interfaces for source, sink, and transform. If you want to develop your own connector (Connector V2) for the SeaTunnel API, refer to the **[Connector Development Guide](https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/README.md)**. \ No newline at end of file +No, you only need to focus on the interfaces for source, sink, and transform. If you want to develop your own connector (Connector V2) for the SeaTunnel API, refer to the **[Connector Development Guide](https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/README.md)**. From 61cf37013044218f56c8bf7b8db15801e45c471c Mon Sep 17 00:00:00 2001 From: Jia Fan Date: Fri, 15 Nov 2024 19:11:55 +0800 Subject: [PATCH 8/8] update --- docs/en/faq.md | 5 ----- docs/zh/faq.md | 4 ---- 2 files changed, 9 deletions(-) diff --git a/docs/en/faq.md b/docs/en/faq.md index 4a06c6a91ae..6a4e838eaed 100644 --- a/docs/en/faq.md +++ b/docs/en/faq.md @@ -32,11 +32,6 @@ Yes, SeaTunnel supports CDC from MySQL replicas by subscribing to binlog logs, w ## Does SeaTunnel support CDC integration for tables without primary keys? SeaTunnel does not support CDC integration for tables without primary keys. The reason is that if two identical records exist in the upstream and one is deleted or modified, the downstream cannot determine which record to delete or modify, leading to potential issues. Primary keys are essential to ensure data uniqueness. -## How does SeaTunnel handle changes in data sources (source) or data destinations (sink)? -When the structure of a data source or destination changes, SeaTunnel provides various mechanisms to adapt, such as automatically detecting and updating the schema or configuring data mapping rules. You can adjust the `schema_save_mode` or `data_save_mode` parameters to control how these changes are handled based on your needs. - -For more details, refer to the answers on `schema_save_mode` and `data_save_mode` below. - ## Does SeaTunnel support automatic table creation? Before starting an integration task, you can select different handling schemes for existing table structures on the target side, controlled via the `schema_save_mode` parameter. Available options include: - **`RECREATE_SCHEMA`**: Creates the table if it does not exist; if the table exists, it is deleted and recreated. diff --git a/docs/zh/faq.md b/docs/zh/faq.md index 9bd40290bf6..26867e4a188 100644 --- a/docs/zh/faq.md +++ b/docs/zh/faq.md @@ -36,10 +36,6 @@ SeaTunnel 支持增量数据同步。例如通过 CDC 连接器实现对数据 比如上游有 2 条一模一样的数据,然后上游删除或修改了一条,下游由于无法区分到底是哪条需要删除或修改,会出现这 2 条都被删除或修改的情况。 没主键要类似去重的效果本身有点儿自相矛盾,就像辨别西游记里的真假悟空,到底哪个是真的 -## SeaTunnel 对数据来源(source)或数据目标(sink)发生变更时如何处理? -在数据源或数据目的地结构发生变化时,SeaTunnel 提供多种应对机制,例如自动检测和更新表结构 (schema) 或定制数据映射规则。您可以根据实际需求调整 `schema_save_mode` 或 `data_save_mode` 的配置参数来控制变更处理。 -可以参考下面 2 个问题的回答,了解更多关于 `schema_save_mode` 和 `data_save_mode` 的信息。 - ## SeaTunnel 是否支持自动建表? 在同步任务启动之前,可以为目标端已有的表结构选择不同的处理方案。是通过 `schema_save_mode` 参数来控制的。 `schema_save_mode` 有以下几种方式可选: