Skip to content

"IO Error: Failed to close entry" when COPYing from PostgreSQL to EXCEL #53

@FlipperPA

Description

@FlipperPA

I'm receiving the following error from DuckDB when COPYing a query from PostgreSQL to xlsx. This code works for 99% of the SQL queries we throw at it, but we've found one which produces a large result which is kicking this error. I've searched high and low but can't find any reference to this error in the source (or the web in general!) that matches. I'm hoping someone can point me in the right direction. Here's the stacktrace from Python:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/__main__.py", line 6, in <module>
    PostgreSQLWorker(argv[1], argv[2], argv[3]).run()
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/worker/__init__.py", line 509, in run
    raise (e)
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/worker/__init__.py", line 439, in run
    results_files = self.run_query(identifiers, method_2_document_id)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 827, in run_query
    self.write_query_to_file(cursor, query_runner, pg_format, ext)
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 292, in write_query_to_file
    self.write_duckdb_file(cursor, query_runner, ext, copy_sql)
  File "/var/query-manager/workers/postgres_env/lib/python3.12/site-packages/postgres_query/utils.py", line 400, in write_duckdb_file
    duckdb.sql(f"""
duckdb.duckdb.IOException: IO Error: Failed to close entry

Here's the pertinent part of the code - the error triggers from duckdb.sql():

duckdb.sql(f"SET extension_directory = '{output_path}';")
duckdb.sql(f"SET temp_directory = '{output_path}';")
extension_importer.import_extension("postgres_scanner")

duckdb.sql(f"""
    ATTACH 'host={DB["HOST"]} port={DB["PORT"]} user={username}
    dbname={DB["DBNAME"]}' AS pg_connection (TYPE POSTGRES);
""")

extension_importer.import_extension("excel")
duckdb.sql(f"""
    LOAD excel;
    COPY (SELECT * FROM postgres_query('pg_connection','{ddb_query}'))
    TO '{output_file}'
    WITH (FORMAT xlsx, HEADER true, SHEET 'Results');
""")

The same basic code works when exporting to Parquet: it runs successfully, outputting 676 columns and 717,039 rows when COPY'd to Parquet (or CSV). Here's the code that works for the problematic query to Parquet:

duckdb.sql(f"SET extension_directory = '{output_path}';")
duckdb.sql(f"SET temp_directory = '{output_path}';")
extension_importer.import_extension("postgres_scanner")

duckdb.sql(f"""
    ATTACH 'host={DB["HOST"]} port={DB["PORT"]} user={username}
    dbname={DB["DBNAME"]}' AS pg_connection (TYPE POSTGRES);
""")

duckdb.sql(f"""
    COPY (SELECT * FROM postgres_query('pg_connection','{ddb_query}'))
    TO '{output_file}'
    WITH (FORMAT parquet, COMPRESSION zstd, COMPRESSION_LEVEL 3);
""")

Here's the SQL for the problematic query (ddb_query); it runs fine to Parquet:

SELECT
    comp_na_daily_all.fundq.costat AS "(costat) Active/Inactive Status Marker",
    id_table.add1 AS "(add1) Address Line 1",
    id_table.add2 AS "(add2) Address Line 2",
    [...trimmed for length...]
    comp_na_daily_all.fundq.prchq AS "(prchq) Price High - Quarter",
    comp_na_daily_all.fundq.prclq AS "(prclq) Price Low - Quarter"

FROM comp_na_daily_all.fundq
LEFT JOIN (
    SELECT
        gvkey,
        add1,
        add2,
        add3,
        add4,
        addzip,
        busdesc,
        city,
        conml,
        county,
        dldte,
        dlrsn,
        ein,
        fax,
        fyrc,
        ggroup,
        gind,
        gsector,
        gsubind,
        idbflag,
        incorp,
        ipodate,
        loc,
        naics,
        phone,
        prican,
        prirow,
        priusa,
        sic,
        spcindcd,
        spcseccd,
        spcsrc,
        state,
        stko,
        weburl
    FROM comp_na_daily_all.company
) AS id_table 

ON comp_na_daily_all.fundq.gvkey = id_table.gvkey

WHERE comp_na_daily_all.fundq.datadate BETWEEN
    %s AND %s

AND ("comp_na_daily_all"."fundq"."consol" = ANY (ARRAY['C','N','R','P','D']) AND "comp_na_daily_all"."fundq"."indfmt" = ANY (ARRAY['INDL','FS']) AND "comp_na_daily_all"."fundq"."datafmt" = ANY (ARRAY['STD','SUMM_STD','PRE_AMENDS','PRE_AMENDSS']) AND "comp_na_daily_all"."fundq"."curcdq" = ANY (ARRAY['USD','CAD']) AND "comp_na_daily_all"."fundq"."costat" = ANY (ARRAY['A','I'])) AND (datafqtr IS NOT NULL) AND (datacqtr IS NOT NULL)

PARAMETERS: [datetime.date(2010, 1, 1), datetime.date(2025, 6, 30)]

If anyone with more expertise in DuckDB could help point me in the right direction, that would be very much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions