-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Summary
When DuckDB creates an Excel file, it is missing a xl/sharedStrings.xml
file inside the .xlsx
ZIP archive. This causes certain Excel reading programs (such as OpenPyXL) to throw an error when trying to read the file as invalid.
To improve compatibility, we may be able to just add a xl/sharedStrings.xml
file to the ZIP with no items, I.e.:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="0" uniqueCount="0">
</sst>
Repro
For reference, here's how I'm creating the xlsx
file.
import duckdb
from duckdb_extensions import extension_importer
extension_importer.import_extension("postgres_scanner")
duckdb.sql(f"""
ATTACH 'host= port= user= dbname= AS pg_connection (TYPE POSTGRES);
""")
extension_importer.import_extension("excel")
duckdb.sql(f"""
LOAD excel;
COPY (SELECT * FROM postgres_query('pg_connection','SELECT * FROM my_table'))
TO '/tmp/duckdb-test.xlsx' WITH (FORMAT xlsx, HEADER true);
""")
Then to trigger the KeyError: "There is no item named 'xl/sharedStrings.xml' in the archive"
with OpenPyXL:
from openpyxl import load_workbook
wb=load_workbook("/tmp/duckdb-test.xlsx")
Possible Solution?
I'm far outside my realm of expertise here, but doing this makes it so OpenPyXL can successfully open the xlsx
file:
mkdir xl
cat > xl/sharedStrings.xml <<EOL
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="0" uniqueCount="0">
</sst>
EOL
zip -ur /tmp/duckdb-test.xlsx xl
After doing this, openpyxl
opens the xlsx
file without issue:
from openpyxl import load_workbook
wb=load_workbook("/tmp/duckdb-test.xlsx")
s=wb.active
print(s.max_row)
# prints "942883"
Metadata
Metadata
Assignees
Labels
No labels