You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can reproduce with blank-rows.xlsx, I observe similarly a performance 5x times worse with skip_errors=['blank-row']. Seems xlsx related as I can't reproduce with a csv file, even with many blank lines.
I hadn't had the time to investigate why but profiling gives row.errors() as the culprit (10 000 calls with skipped errors, instead of 1000 without), despite it being a @cached_property. Not sure how this may be related with xlsx vs csv, need to look further.
(note for myself : check if there is some copy involved which may invalidate the cache).
The difference on blank-row.xlsx is actually that the run with the checks stops because of an error limit, and does not check all >10k lines. With the blank-row check ignored, the error limit is never reached so all 10k lines are processed.
It's likely to be the same for your plans-barcelona-small.xlsx. You can test it at the command line :
Overview
Adding
skip_errors
to thevalidate()
function has a dramatic impact in the performance of the function:The following is an output on some data I'm working on and it goes from
140ms
to60s
when skipping errors:A small blank-rows.xlsx file can change from
100ms
to500ms
by just skipping errors.The text was updated successfully, but these errors were encountered: