Skip to content

Lineage Displays S3 Path Instead of Hive/Delta Table Name with Spark Pipeline/Plugin #13239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
anilreddygollapalli opened this issue Apr 16, 2025 · 1 comment
Labels
bug Bug report

Comments

@anilreddygollapalli
Copy link

Describe the bug
When we use the spark connector/plugin to generate the lineage from spark applications, we noticed that generated lineage is having s3 path as edges instead of hive/delta table name in the lineage flow. please refer the attached screenshot for the same.

To Reproduce
Steps to reproduce the behavior:

  1. Launch pyspark with spark plugin jars, properties as below
    pyspark --packages io.acryl:acryl-spark-lineage:0.2.16 --conf "spark.datahub.rest.server=http://XXXX:8080" --conf "spark.datahub.rest.token=XXXXXX" --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" --conf "spark.jars.packages=io.acryl:acryl-spark-lineage:0.2.16" --conf "spark.datahub.metadata.dataset.include_schema_metadata=true" --conf "spark.datahub.metadata.dataset.materialize=true" --conf "spark.datahub.disableSymlinkResolution=true" --conf "spark.datahub.metadata.dataset.hivePlatformAlias=hive" --name datahub_lineage_test_20250415

Expected behavior
Ideally lineage should be shown as spark job --> hive/delta table name as an edge but in this case it is showing s3 path of that hive/delta table in the lineage visualization which is not expected. fyi, we are using hive metastore(RDS mysql) and spark is using hive metastore as metastore catalog. so that metastore will have metadata about hive table/delta table etc.

Screenshots
Attached

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version Datahub 0.14.0

Additional context
Attached pyspark application log for your reference.

Image
@anilreddygollapalli anilreddygollapalli added the bug Bug report label Apr 16, 2025
@anilreddygollapalli
Copy link
Author

eventLogs-application_1741977607953_0005.zip
Please find attached pyspark application log as requested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

1 participant