-
-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import io
# CSV with UTF‑8 BOM
bytes_data = b"\xef\xbb\xbfName,Age\nJohn,25"
bio = io.BytesIO(bytes_data)
df_c = pd.read_csv(bio, encoding="utf-8")
print(repr(df_c.columns[0])) # 'Name'
bio.seek(0)
# but different behaviour in text wrapper
stream = io.TextIOWrapper(bio, encoding='utf-8')
print(repr(stream.read())) # '\ufeffName,Age\nJohn,25'Issue Description
When reading a UTF-8 encoded CSV that contains a BOM, specifying encoding='utf-8' pandas strips the BOM from the first column header and textwrapper leaves it as an invisible \ufeff character. The outcome depends on the parsing, but this behaviour is undocumented and inconsistent.
Expected Behavior
I expected encoding='utf-8' either to always keep the BOM (requiring utf-8-sig to remove it) or to always strip it, regardless of parser. At minimum, the documentation should clarify how BOMs are handled across engines when using encoding='utf-8'.
Installed Versions
Details
Replace this line with the output of pd.show_versions()