Ensure uniqueness of column names to avoid losing data during serialization #38

svlandeg · 2025-04-03T12:13:01Z

When a table is parsed with duplicate column names, these will both be preserved in the pandas DataFrame format. However, when converting this to a dictionary in encode_df, pandas will raise a warning:

\spacy_layout\util.py:36: UserWarning: DataFrame columns are not unique, some columns will be omitted.

And the "duplicate" columns will be removed, even if the values in the column are different than the other column with the same heading.

This may be somewhat surprising for users, so this PR instead ensures that columns are unique by appending (2), (3) etc to non-unique column names, effectively making them unique, before serializing them to the dictionary.

Open questions:

Instead of doing the conversion automatically, should this functionality be behind a feature flag? If so - what should be the default?
When doing the conversion automatically, should spacy-layout issue a warning about the renaming of the columns?

svlandeg · 2025-04-03T12:27:45Z

tests/test_general.py

+    doc_bin = DocBin(docs=[old_doc], store_user_data=True)
+    new_doc = list(doc_bin.get_docs(nlp.vocab))[0]
+    new_table = new_doc._.tables[0]._.data
+    assert list(new_table.columns) == ['Index', 'Value', 'Value (2)', 'Index (2)', 'Value (3)', 'Value (4)']


on main, the result would be

['Index', 'Value']

i.e. 4 columns would have just been removed from the serialized output.

svlandeg added 3 commits April 3, 2025 14:03

PDF with a table with duplicate column names

8f653db

unit test failing on main

85d3024

unit test failing on main

118432b

svlandeg commented Apr 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure uniqueness of column names to avoid losing data during serialization #38

Ensure uniqueness of column names to avoid losing data during serialization #38

Uh oh!

svlandeg commented Apr 3, 2025 •

edited

Loading

Uh oh!

svlandeg Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Ensure uniqueness of column names to avoid losing data during serialization #38

Are you sure you want to change the base?

Ensure uniqueness of column names to avoid losing data during serialization #38

Uh oh!

Conversation

svlandeg commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svlandeg Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

svlandeg commented Apr 3, 2025 •

edited

Loading

svlandeg Apr 3, 2025 •

edited

Loading