Skip to content

XML parse error "not well-formed (invalid token)" for column alias with end-of-file character #50

@FlipperPA

Description

@FlipperPA

I've found an edge case: if a sum character (aka end-of-file, \x1a) appears in a column alias, it will embed the character into the xlsx file, causing many potential issues (such as IO Error: XML parse error at line 13, column 258: not well-formed (invalid token)). We have an automated data pipeline from SAS to create column descriptions as aliases, and one from SAS turned a curly single-quote into \x1a, if you're wondering how this was uncovered.

This would be absolutely fair to tag as "won't fix", but might be worth considering for data safety. Other Excel file creators (such as csv2xlsx) have support built in to drop unknown characters like these.

Example repro in Python (though the SQL query is the only really pertinent part):

extension_importer.import_extension("postgres_scanner")
query = "SELECT column_name AS 'Broker' || E'\x1A' || 's' FROM table_name"
duckdb.sql("ATTACH 'host= user= dbname= AS pg_connection (TYPE POSTGRES);")

duckdb.sql(f"""
    LOAD excel;
    COPY (SELECT * FROM postgres_query('pg_connection','{query}'))
    TO '/tmp/output.xlsx'
    WITH (FORMAT xlsx, HEADER true);
""")

Thanks for the excellent work on this exporter. If my C skills were better, I'd take a hack at this myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions