-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Apache Iceberg version
1.10.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
I was trying out Iceberg in PySpark by running the following command to start the PySpark shell:
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0The PySpark version is 3.5.5.
I'd like to write data in Iceberg format without catalog involved. Here is the code I have:
spark.range(10).write.format("iceberg").save("/tmp/test")The above code failed. Here is a sample of the error message (with the lengthy Java stacktrace removed and personal information redacted):
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no possible candidates
Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MTableColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MConstraint and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".
NestedThrowablesStackTrace:
java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
Caused by: ERROR 42X05: Table/View 'DBS' does not exist.
... 108 more
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in no possible candidates
Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/venv/lib/python3.11/site-packages/pyspark/sql/readwriter.py", line 1463, in save
self._jwrite.save(path)
File "/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.save.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
... 52 more
Caused by: java.lang.reflect.InvocationTargetException
... 64 more
Caused by: MetaException(message:Version information not found in metastore. )
... 70 more
Caused by: MetaException(message:Version information not found in metastore. )
... 73 more
I can see that the examples in the documentation mostly involves catalog table names, instead of working with table paths directly. However it seems reading a table path directly works for me:
spark.read.format("iceberg").load("/some/existing/table").show()So I'm wondering why the writer does not work out of the box.
Also, in the documentation I noticed this sentence:
The v1 DataFrame
writeAPI is still supported, but is not recommended.
Is there a reason for this?
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time