-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import io
names = [f"head{x:02}" for x in range(1, 11)]
# NOT WORKING
with pd.read_csv(
"_data/random_data.csv",
sep=";",
names=names,
encoding="utf-8",
index_col=False,
on_bad_lines="skip",
chunksize=2,
engine="python",
) as reader:
for chunk in reader:
print(chunk)
# head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
# 0 99 Wg1H2ivHFZ BDpXoeOhKH VohJYsnCV8 BtTuebP0nT fmBHwP4IFV TOg9YJp1h6 ooVM44HkzP DSZqukVH3K hU3NZQBuri
# 1 99 gzxEh5ieKn HCAIPudvKj YqTUuDKH8O 5383zSS6E6 7Nr9Ckatuo tqfCuCh52l JFK0cfq9mz yyQsQGC6t3 Xc44lIK4BQ
# head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
# 2 99 M1XNbLOYG9 px78EDlwlW gHdirv59k9 VRJgi4m1H0 vSFkaCbImk IM9V0UCLBa vjnpAidejp chcpZKpn48 UlAzuehJo5
# 3 99 diWUN45qqP 16HJxD3wdU 0WvoDOwKBx XHO9L6qVWX 94DhLCUEA7 vdQ0wFx2u3 ZeF0SOPSsc gJfA44ZSdQ y7rHFlT77G
# head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
# 4 99 XsqKrPi1eO AouPwLJ8cx qERFA7G6oE 2xcUukUfKQ TWXUS2GNWQ wEJ5Xz6Bzf 8G5eEJDsEo 84Gm40s4nh wvZixCSZ5X
# 5 99 ul1YLwdMLJ 9zE2XgrLmV LVccZLrNGl dE6PWSqbYB 3ltSdpDsTf 5QfymfMUM7 KkxipJLtLE hoWZps7wS6 oCrfsk9CsV
# /Users/thomas/Projekte/Entwicklung/Python/pandas_verifier/pandas_reader.py:52: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
# for chunk in reader:
...
# WORKING
sim_csv = io.StringIO(
"""99;Wg1H2ivHFZ;BDpXoeOhKH;VohJYsnCV8;BtTuebP0nT;fmBHwP4IFV;TOg9YJp1h6;ooVM44HkzP;DSZqukVH3K;hU3NZQBuri
99;gzxEh5ieKn;HCAIPudvKj;YqTUuDKH8O;5383zSS6E6;7Nr9Ckatuo;tqfCuCh52l;JFK0cfq9mz;yyQsQGC6t3;Xc44lIK4BQ
99;M1XNbLOYG9;px78EDlwlW;gHdirv59k9;VRJgi4m1H0;vSFkaCbImk;IM9V0UCLBa;vjnpAidejp;chcpZKpn48;UlAzuehJo5
99;diWUN45qqP;16HJxD3wdU;0WvoDOwKBx;XHO9L6qVWX;94DhLCUEA7;vdQ0wFx2u3;ZeF0SOPSsc;gJfA44ZSdQ;y7rHFlT77G
99;XsqKrPi1eO;AouPwLJ8cx;qERFA7G6oE;2xcUukUfKQ;TWXUS2GNWQ;wEJ5Xz6Bzf;8G5eEJDsEo;84Gm40s4nh;wvZixCSZ5X
99;ul1YLwdMLJ;9zE2XgrLmV;LVccZLrNGl;dE6PWSqbYB;3ltSdpDsTf;5QfymfMUM7;KkxipJLtLE;hoWZps7wS6;oCrfsk9CsV;
99;cQuqcgc9az;XyE3OYqhRw;HPELcHKBtt;PRR5qLpw1H;FZrXAWdRSZ;gJPL5W6C0Z;uFKnbdtpvS;4j1qBslPc0;imCvulSmhS
99;NzVF74lO9E;M28U9jb3oA;oAAlFQVUVt;6fkOztILHW;MZm20agksL;O0Yik187u6;ZgZMQMkjZc;yHMeT4HPEe;dppbphuT4b;;
99;bnNrfhGWri;HUxtRlvdKU;gyEjO0V1a3;xHh4SgJIfC;lawQnZfiAP;6FiB0bfmh2;shxKCWvV4Z;LmA6ZOidGv;rS8ZGBXQsx;;NF4cRa7bVJ
99;nuziDImo99;arsFldtXRS;DQpoylF0mE;qCh4S3O8hG;PdUexdXCwW;C9GUnzSXi0;ygMAcHTUCp;vH03yILzGm;1m3pSV7Eg0"""
)
names = [f"head{x:02}" for x in range(1, 11)]
with pd.read_csv(
sim_csv,
names=names,
chunksize=2,
on_bad_lines="warn",
engine="python",
delimiter=";",
) as reader:
for chunk in reader:
print(chunk)
# head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
# 0 99 Wg1H2ivHFZ BDpXoeOhKH VohJYsnCV8 BtTuebP0nT fmBHwP4IFV TOg9YJp1h6 ooVM44HkzP DSZqukVH3K hU3NZQBuri
# 1 99 gzxEh5ieKn HCAIPudvKj YqTUuDKH8O 5383zSS6E6 7Nr9Ckatuo tqfCuCh52l JFK0cfq9mz yyQsQGC6t3 Xc44lIK4BQ
# head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
# 2 99 M1XNbLOYG9 px78EDlwlW gHdirv59k9 VRJgi4m1H0 vSFkaCbImk IM9V0UCLBa vjnpAidejp chcpZKpn48 UlAzuehJo5
# 3 99 diWUN45qqP 16HJxD3wdU 0WvoDOwKBx XHO9L6qVWX 94DhLCUEA7 vdQ0wFx2u3 ZeF0SOPSsc gJfA44ZSdQ y7rHFlT77G
# /Users/thomas/Projekte/Entwicklung/Python/pandas_verifier/pandas_reader_2.py:36: ParserWarning: Skipping line 6: Expected 10 fields in line 6, saw 11
# for chunk in reader:
...
Issue Description
pandas_reader_2.py
pandas_reader.py
random_data.csv
I tried to reproduce a fault in one of my works project at home on my MacOSX 15.6 and stumbled across a really strange effect. For the test I created a CSV (random_data.csv) file with 10 rows having 10 cols of data each, separator is ';' (semicolon). I modified 3 lines and added additional fields at 3 lines to simulate faulty data.
When trying to import from a CSV file (pandas_reader.py), I only get a normal warning, regardless the value I provide at on_bad_lines
, even a callable is ignored. Compared to the same example (pandas_reader_2.py) using io.StringIO the behavior is as expected, warning are display or lines are skipped, depending again on the option's value. Also when providing a callable for cutting down the bad line to the expected size is working as normal.
When using my Ubuntu 22.04 on my works laptop the behavior with reading a CSV file is as expected, so I encountered this phenomenon only on my MacOS.
Expected Behavior
head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
0 99 Wg1H2ivHFZ BDpXoeOhKH VohJYsnCV8 BtTuebP0nT fmBHwP4IFV TOg9YJp1h6 ooVM44HkzP DSZqukVH3K hU3NZQBuri
1 99 gzxEh5ieKn HCAIPudvKj YqTUuDKH8O 5383zSS6E6 7Nr9Ckatuo tqfCuCh52l JFK0cfq9mz yyQsQGC6t3 Xc44lIK4BQ
head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
2 99 M1XNbLOYG9 px78EDlwlW gHdirv59k9 VRJgi4m1H0 vSFkaCbImk IM9V0UCLBa vjnpAidejp chcpZKpn48 UlAzuehJo5
3 99 diWUN45qqP 16HJxD3wdU 0WvoDOwKBx XHO9L6qVWX 94DhLCUEA7 vdQ0wFx2u3 ZeF0SOPSsc gJfA44ZSdQ y7rHFlT77G
/Users/thomas/Projekte/Entwicklung/Python/pandas_verifier/pandas_reader_2.py:36: ParserWarning: Skipping line 6: Expected 10 fields in line 6, saw 11
for chunk in reader:
head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
4 99 XsqKrPi1eO AouPwLJ8cx qERFA7G6oE 2xcUukUfKQ TWXUS2GNWQ wEJ5Xz6Bzf 8G5eEJDsEo 84Gm40s4nh wvZixCSZ5X
/Users/thomas/Projekte/Entwicklung/Python/pandas_verifier/pandas_reader_2.py:36: ParserWarning: Skipping line 8: Expected 10 fields in line 8, saw 12
for chunk in reader:
head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
5 99 cQuqcgc9az XyE3OYqhRw HPELcHKBtt PRR5qLpw1H FZrXAWdRSZ gJPL5W6C0Z uFKnbdtpvS 4j1qBslPc0 imCvulSmhS
/Users/thomas/Projekte/Entwicklung/Python/pandas_verifier/pandas_reader_2.py:36: ParserWarning: Skipping line 9: Expected 10 fields in line 9, saw 12
for chunk in reader:
head01 head02 head03 head04 head05 head06 head07 head08 head09 head10
6 99 nuziDImo99 arsFldtXRS DQpoylF0mE qCh4S3O8hG PdUexdXCwW C9GUnzSXi0 ygMAcHTUCp vH03yILzGm 1m3pSV7Eg0
Installed Versions
pandas : 2.3.3
numpy : 2.2.6
pytz : 2025.2
dateutil : 2.9.0.post0
pip : None
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 21.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None
</details>