Skip to content

BUG: Inconsistent BOM handling in pd.read_csv with encoding='utf-8' #63787

@MikiPAUL

Description

@MikiPAUL

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

# CSV with UTF‑8 BOM
bytes_data = b"\xef\xbb\xbfName,Age\nJohn,25"
bio = io.BytesIO(bytes_data)


df_c = pd.read_csv(bio, encoding="utf-8")
print(repr(df_c.columns[0]))  # 'Name'

bio.seek(0)
# but different behaviour in text wrapper
stream = io.TextIOWrapper(bio, encoding='utf-8')

print(repr(stream.read())) # '\ufeffName,Age\nJohn,25'

Issue Description

When reading a UTF-8 encoded CSV that contains a BOM, specifying encoding='utf-8' pandas strips the BOM from the first column header and textwrapper leaves it as an invisible \ufeff character. The outcome depends on the parsing, but this behaviour is undocumented and inconsistent.

Expected Behavior

I expected encoding='utf-8' either to always keep the BOM (requiring utf-8-sig to remove it) or to always strip it, regardless of parser. At minimum, the documentation should clarify how BOMs are handled across engines when using encoding='utf-8'.

Installed Versions

Details

Replace this line with the output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions