clickhouse-offline-build-plugin

Why new clickhouse plugin?

The build-in clickhouse waterdrop plugin uses jdbc to insert data into clickhouse cluster.
This can cause huge pressure to the clickhouse cluster and also data duplication due to retries.

How this plugin work?

Basically, this plugin build clickhouse part locally with clickhouse local command, and upload the built part files to hdfs.
Then the clickhouse node can pull the part from hdfs and call alter table t attach part 'part_name' to load data.

How to use?

build this plugin by mvn clean package and get the clickhouse-offline-build-1.0.jar
build the plugins.tar.gz with structure

plugins/
plugins/clickhouse-offline-build/
plugins/clickhouse-offline-build/files/
plugins/clickhouse-offline-build/files/clickhouse
plugins/clickhouse-offline-build/files/config.xml
plugins/clickhouse-offline-build/files/build.sh
plugins/clickhouse-offline-build/lib/clickhouse-offline-build-1.0.jar

follow the steps on https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/quick-start
with config

output {
  io.github.interestinglab.waterdrop.output.clickhouse.ClickhouseOfflineBuild {
    host = "${clickhouse_cluster}:8023"
    database = "${database}"
    table = "${tableName}"
    username = "${account}"
    password = "${password}"
    hdfsUser = "${hdfsUser}"
    tmpUploadPath = "${tmpUploadPath}"
    defaultValues = {"a":1} #optional
  }
}

Local run example

clickhouse is ready, and test.ontime table is ready
follow https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/installation to get waterdrop
prepare some ontime data: https://clickhouse.tech/docs/en/getting-started/example-datasets/ontime/
create config/application.conf

spark {
    spark.streaming.batchDuration = 5
    spark.app.name = "Waterdrop-sample"
    spark.ui.port = 13000
    spark.executor.instances = 2
    spark.executor.cores = 1
    spark.executor.memory = "1g"
}

input {
    file {
        path = "path_to_ontime_data.csv.gz"
        format = "csv"
        options.header = "true"
        result_table_name = "ontime"
    }
}

filter {
}

output {
    io.github.interestinglab.waterdrop.output.clickhouse.ClickhouseOfflineBuild {
        host = "127.0.0.1:8023"
        database = "test"
        table = "ontime"
        username = "default"
        password = "default_password"
        hdfsUser = "test_user"
        defaultValues = {"a":"b"}
        tmpUploadPath = "hdfs://remote_hdfs_host:9000/clickhouse-build/test/ontime/"
    }
}

run

./bin/start-waterdrop.sh --config config/application.conf --deploy-mode client --master "local[2]"

check data on hdfs://remote_hdfs_host:9000/clickhouse-build/test/ontime/
run script on clickhouse node: sh attach.sh test ontime hdfs://remote_hdfs_host:9000/clickhouse-build/test/ontime/
verify select count(*) from test.ontime

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
files		files
shell_for_clickhouse		shell_for_clickhouse
src/main/java/io/github/interestinglab/waterdrop/output/clickhouse		src/main/java/io/github/interestinglab/waterdrop/output/clickhouse
.gitignore		.gitignore
LICENSE		LICENSE
README-cn.md		README-cn.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

clickhouse-offline-build-plugin

Why new clickhouse plugin?

How this plugin work?

How to use?

Local run example

About

Uh oh!

Releases

Packages

Languages

License

kaijianding/clickhouse-offline-build-plugin

Folders and files

Latest commit

History

Repository files navigation

clickhouse-offline-build-plugin

Why new clickhouse plugin?

How this plugin work?

How to use?

Local run example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages