-
Notifications
You must be signed in to change notification settings - Fork 13
Description
I'm receiving the following error from DuckDB when COPYing a query from PostgreSQL to xlsx
. This code works for 99% of the SQL queries we throw at it, but we've found one which produces a large result which is kicking this error. I've searched high and low but can't find any reference to this error in the source (or the web in general!) that matches. I'm hoping someone can point me in the right direction. Here's the stacktrace from Python:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/__main__.py", line 6, in <module>
PostgreSQLWorker(argv[1], argv[2], argv[3]).run()
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/worker/__init__.py", line 509, in run
raise (e)
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/worker/__init__.py", line 439, in run
results_files = self.run_query(identifiers, method_2_document_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 827, in run_query
self.write_query_to_file(cursor, query_runner, pg_format, ext)
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 292, in write_query_to_file
self.write_duckdb_file(cursor, query_runner, ext, copy_sql)
File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 400, in write_duckdb_file
duckdb.sql(f"""
duckdb.duckdb.IOException: IO Error: Failed to close entry
Here's the pertinent part of the code - the error triggers from duckdb.sql()
:
duckdb.sql(f"SET extension_directory = '{output_path}';")
duckdb.sql(f"SET temp_directory = '{output_path}';")
extension_importer.import_extension("postgres_scanner")
duckdb.sql(f"""
ATTACH 'host={DB["HOST"]} port={DB["PORT"]} user={username}
dbname={DB["DBNAME"]}' AS pg_connection (TYPE POSTGRES);
""")
extension_importer.import_extension("excel")
duckdb.sql(f"""
LOAD excel;
COPY (SELECT * FROM postgres_query('pg_connection','{ddb_query}'))
TO '{output_file}'
WITH (FORMAT xlsx, HEADER true, SHEET 'Results');
""")
The same basic code works when exporting to Parquet: it runs successfully, outputting 676 columns and 717,039 rows when COPY'd to Parquet (or CSV). Here's the code that works for the problematic query to Parquet:
duckdb.sql(f"SET extension_directory = '{output_path}';")
duckdb.sql(f"SET temp_directory = '{output_path}';")
extension_importer.import_extension("postgres_scanner")
duckdb.sql(f"""
ATTACH 'host={DB["HOST"]} port={DB["PORT"]} user={username}
dbname={DB["DBNAME"]}' AS pg_connection (TYPE POSTGRES);
""")
duckdb.sql(f"""
COPY (SELECT * FROM postgres_query('pg_connection','{ddb_query}'))
TO '{output_file}'
WITH (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 3);
""")
Here's the SQL for the problematic query (ddb_query
); it runs fine to Parquet:
SELECT
comp_na_daily_all.fundq.costat AS "(costat) Active/Inactive Status Marker",
id_table.add1 AS "(add1) Address Line 1",
id_table.add2 AS "(add2) Address Line 2",
[...trimmed for length...]
comp_na_daily_all.fundq.prchq AS "(prchq) Price High - Quarter",
comp_na_daily_all.fundq.prclq AS "(prclq) Price Low - Quarter"
FROM comp_na_daily_all.fundq
LEFT JOIN (
SELECT
gvkey,
add1,
add2,
add3,
add4,
addzip,
busdesc,
city,
conml,
county,
dldte,
dlrsn,
ein,
fax,
fyrc,
ggroup,
gind,
gsector,
gsubind,
idbflag,
incorp,
ipodate,
loc,
naics,
phone,
prican,
prirow,
priusa,
sic,
spcindcd,
spcseccd,
spcsrc,
state,
stko,
weburl
FROM comp_na_daily_all.company
) AS id_table
ON comp_na_daily_all.fundq.gvkey = id_table.gvkey
WHERE comp_na_daily_all.fundq.datadate BETWEEN
%s AND %s
AND ("comp_na_daily_all"."fundq"."consol" = ANY (ARRAY['C','N','R','P','D']) AND "comp_na_daily_all"."fundq"."indfmt" = ANY (ARRAY['INDL','FS']) AND "comp_na_daily_all"."fundq"."datafmt" = ANY (ARRAY['STD','SUMM_STD','PRE_AMENDS','PRE_AMENDSS']) AND "comp_na_daily_all"."fundq"."curcdq" = ANY (ARRAY['USD','CAD']) AND "comp_na_daily_all"."fundq"."costat" = ANY (ARRAY['A','I'])) AND (datafqtr IS NOT NULL) AND (datacqtr IS NOT NULL)
PARAMETERS: [datetime.date(2010, 1, 1), datetime.date(2025, 6, 30)]
If anyone with more expertise in DuckDB could help point me in the right direction, that would be very much appreciated!