Skip to content

[Bug] data lose when doris2hive using dataFrame #314

@chuang-wang-pre

Description

@chuang-wang-pre

Search before asking

  • I had searched in the issues and found no similar issues.

Version

spark-doris-connector: 25.0.1
doris: 3.0.0
spark: 3.0.1

What's Wrong?

val dorisTableIdentifier = "doris_db.doris_table"
    val hiveTableName = "hive_db.hive_table"
    val timeColumn = "ctime"
    val selectedColumnsStr = args(5).trim
    val startTime = "2025-05-06 00:00:00"
    val endTime = "2025-05-07 00:00:00"

      val appName = s"doris-to-hive-$hiveTableName"
      val spark = SparkSession.builder()
        .appName(appName)
        .enableHiveSupport()
        .getOrCreate()

      // 1. read data from doris
      val dorisDF = spark.read
        .format("doris")
        .option("doris.fenodes", feNodes)
        .option("doris.table.identifier", dorisTableIdentifier)
        .option("user", user)
        .option("password", password)
        .load()
        .filter(col(timeColumn) >= lit(startTime) && col(timeColumn) < lit(endTime)) // limit timespan
        .select(selectedColumns.map(col): _*) // select columns

      log.info("doris data count: {}", dorisDF.count()) 

      Thread.sleep(1000)
      log.info("doris data count: {}", dorisDF.count())

      Thread.sleep(5000)
      log.info("doris data count: {}", dorisDF.count())

      dorisDF.createOrReplaceTempView("doris_data_detail")

      // 2. write to hive
      val insertSql =
        s"""
           |INSERT OVERWRITE TABLE $hiveTableName PARTITION (pt='20250410000000')
           |SELECT
           |$selectedColumnsStr
           |FROM doris_data_detail
           |""".stripMargin
      log.info("insert hive sql: {}", insertSql)
      spark.sql(insertSql)

      spark.stop()

I used this code to implement doris2hive, and I found that the amount of data in the hive table was smaller than that in the doris table, so I added some logs to record the number of dataframes. The log is as follows:

25/05/07 19:43:56 INFO Doris2HiveTask$: doris data count: 68684
25/05/07 19:43:59 INFO Doris2HiveTask$: doris data count: 97918
25/05/07 19:44:05 INFO Doris2HiveTask$: doris data count: 99903

the amount in doris:

Image

I am certain that the data count of the doris table has not changed during this period.

Why did this happen , is this a bug?

What You Expected?

The reason for this situation

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions