[Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure. #8566

kwonder0926 · 2025-01-21T03:24:46Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

This issue occurs on both versions 2.3.8 and 2.3.9.
By observing the logs, I found that when I execute a task, my task starts executing normally for a short period, and then it suddenly disconnects, and I don't know the reason for this, leading to a timeout exception. The worker nodes then reinitialize the cluster, and the master node resubmits the task to the worker nodes. At this point, an exception is thrown, indicating that the task already exists, resulting in task failure.

I will attach the key parts of my logs below for further investigation.

SeaTunnel Version

2.3.8、2.3.9

SeaTunnel Config

env {
  # You can set SeaTunnel environment configuration here
  parallelism = 2
  job.mode = "STREAMING"
  checkpoint.interval = 2000
}

source {
  MongoDB-CDC {
    hosts = "10.xxx.xxx.xx:27017,10.xxx.xxx.xx:27017,10.xxx.xxx.xxx:27017"
    database = ["letterMsg"]
    collection = ["letterMsg.2024-05-send_detail","letterMsg.2024-04-send_detail"]
    username = ******
    password = "*************"
    connection.options = "replicaSet=rep1&connectTimeoutMS=300000" 
    tables_configs = [
    {
      schema {
      table = "letterMsg.2024-05-send_detail"
      fields {
	_id = STRING,
        order_id = STRING,
        phone_num = String,
        message_context = STRING,
        channel_no = STRING,
        send_status = STRING,
        send_time = TIMESTAMP,
        back_status = STRING,
        create_time = TIMESTAMP,
        urgent_lv = STRING,
        sign = STRING,
        sign_id = INT,
        query_report_flag = INT,
        isp_id = INT,
        user_id = BIGINT,
        enterprise_id = BIGINT,
        mongo_submit_id = STRING,
        link_id = STRING,
        batch_id = STRING,
        push_submit_time = TIMESTAMP,
        back_time = TIMESTAMP,
        number_id = BIGINT,
        receive_report_time = TIMESTAMP,
        report_msg_id = STRING,
        push_report_time = STRING,
        seq_id = STRING,
        seq_order_id = STRING
            }
	  }
	},
      {
       schema {
       table = "letterMsg.2024-04-send_detail"
       fields {
        _id = STRING,
        order_id = STRING,
        phone_num = String,
        message_context = STRING,
        channel_no = STRING,
        send_status = STRING,
        send_time = TIMESTAMP,
        back_status = STRING,
        create_time = TIMESTAMP,
        urgent_lv = STRING,
        sign = STRING,
        sign_id = INT,
        query_report_flag = INT,
        isp_id = INT,
        user_id = BIGINT,
        enterprise_id = BIGINT,
        mongo_submit_id = STRING,
        link_id = STRING,
        batch_id = STRING, 
        push_submit_time = TIMESTAMP,
        back_time = TIMESTAMP,
        number_id = BIGINT, 
        receive_report_time = TIMESTAMP,
        report_msg_id = STRING,
        push_report_time = STRING,
        seq_id = STRING,
        seq_order_id = STRING
	}
      }
    }
   ]
 }
  # please go to https://seatunnel.apache.org/docs/connector-v2/source
}

sink {
  Clickhouse {
    host = "10.xxx.xxx.xxx:8123"
    database = "xlsms_test"
    table = "sms_send_detail"
    username = "******"
    password = "*********"
    primary_key = "_id"
    support_upsert = true
    allow_experimental_lightweight_delete = true
  }
}

Running Command

bin/seatunnel.sh --config ./config/streaming-mongocdc-clickhouse.conf -n syncSendDetailJob

Error Exception

Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: org.apache.seatunnel.engine.common.exception.TaskGroupDeployException: java.lang.RuntimeException: TaskGroupLocation: TaskGroupLocation{jobId=933914021243912194, pipelineId=1, taskGroupId=1} already exists

Zeta or Flink or Spark Version

use SeaTunnel Engine In Separated Cluster Mode

Java or Scala Version

java 1.8.0_171

Screenshots

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

kwonder0926 · 2025-01-21T06:16:36Z

I saw a statement online that if a table to be transferred is too large, the checkpoint timeout needs to be set longer. But what value should I set it to? I would like to ask if there is any reference for this value?

lphbdfj · 2025-01-24T15:41:35Z

I have also encountered this issue and have not yet found a solution. I hope the official team can provide a resolution as soon as possible. -> [Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure.

kwonder0926 added the bug label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure. #8566

[Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure. #8566

kwonder0926 commented Jan 21, 2025 •

edited

Loading

kwonder0926 commented Jan 21, 2025

lphbdfj commented Jan 24, 2025

[Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure. #8566

[Bug] [Engine] Frequent "TaskGroupLocation: TaskGroupLocation xxx already exists" error during task execution, causing task failure. #8566

Comments

kwonder0926 commented Jan 21, 2025 • edited Loading

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

kwonder0926 commented Jan 21, 2025

lphbdfj commented Jan 24, 2025

kwonder0926 commented Jan 21, 2025 •

edited

Loading