PySpark `.save()` method does not work

### Apache Iceberg version

1.10.0 (latest release)

### Query engine

Spark

### Please describe the bug 🐞

I was trying out Iceberg in PySpark by running the following command to start the PySpark shell:

```bash
pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0
```

The PySpark version is 3.5.5.

I'd like to write data in Iceberg format without catalog involved. Here is the code I have:

```python
spark.range(10).write.format("iceberg").save("/tmp/test")
```

The above code failed. Here is a sample of the error message (with the lengthy Java stacktrace removed and personal information redacted):

```text
WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no possible candidates
Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MTableColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MConstraint and subclasses resulted in no possible candidates
Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "CDS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

WARN MetaStoreDirectSql: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".
NestedThrowablesStackTrace:
java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
Caused by: ERROR 42X05: Table/View 'DBS' does not exist.
	... 108 more

WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in no possible candidates
Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv/lib/python3.11/site-packages/pyspark/sql/readwriter.py", line 1463, in save
    self._jwrite.save(path)
  File "/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.save.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
	... 52 more
Caused by: java.lang.reflect.InvocationTargetException
	... 64 more
Caused by: MetaException(message:Version information not found in metastore. )
	... 70 more
Caused by: MetaException(message:Version information not found in metastore. )
	... 73 more
```

I can see that the examples in the documentation mostly involves catalog table names, instead of working with table paths directly. However it seems reading a table path directly works for me:

```python
spark.read.format("iceberg").load("/some/existing/table").show()
```

So I'm wondering why the writer does not work out of the box.

Also, in the [documentation](https://iceberg.apache.org/docs/nightly/spark-writes/#writing-with-dataframes) I noticed this sentence:

> The v1 DataFrame `write` API is still supported, but is not recommended.

Is there a reason for this?

### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [x] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PySpark `.save()` method does not work #14529

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PySpark .save() method does not work #14529

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PySpark `.save()` method does not work #14529