Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip_errors parameter is impacting validate() performance #1699

Open
pdelboca opened this issue Oct 30, 2024 · 2 comments
Open

skip_errors parameter is impacting validate() performance #1699

pdelboca opened this issue Oct 30, 2024 · 2 comments

Comments

@pdelboca
Copy link
Contributor

Overview

Adding skip_errors to the validate() function has a dramatic impact in the performance of the function:

The following is an output on some data I'm working on and it goes from 140ms to 60s when skipping errors:

In [12]: %time report = validate('./data/plans-barcelona-small.xlsx', skip_errors=['blank-row'])
CPU times: user 60 s, sys: 119 μs, total: 60 s
Wall time: 60 s

In [13]: %time report = validate('./data/plans-barcelona-small.xlsx')
CPU times: user 141 ms, sys: 12 μs, total: 141 ms
Wall time: 140 ms

In [14]: %time report = validate('./data/plans-barcelona-small.xlsx', skip_errors=['blank-row'])
CPU times: user 1min 2s, sys: 4 ms, total: 1min 2s
Wall time: 1min 2s

A small blank-rows.xlsx file can change from 100ms to 500ms by just skipping errors.

In [29]: %time report = validate('./data/blank-rows.xlsx', skip_errors=['blank-row'])
CPU times: user 590 ms, sys: 3.98 ms, total: 594 ms
Wall time: 593 ms

In [30]: %time report = validate('./data/blank-rows.xlsx')
CPU times: user 117 ms, sys: 5 μs, total: 117 ms
Wall time: 116 ms
@pierrecamilleri
Copy link
Collaborator

Can reproduce with blank-rows.xlsx, I observe similarly a performance 5x times worse with skip_errors=['blank-row']. Seems xlsx related as I can't reproduce with a csv file, even with many blank lines.

I hadn't had the time to investigate why but profiling gives row.errors() as the culprit (10 000 calls with skipped errors, instead of 1000 without), despite it being a @cached_property. Not sure how this may be related with xlsx vs csv, need to look further.

(note for myself : check if there is some copy involved which may invalidate the cache).

@pierrecamilleri
Copy link
Collaborator

pierrecamilleri commented Jan 8, 2025

The difference on blank-row.xlsx is actually that the run with the checks stops because of an error limit, and does not check all >10k lines. With the blank-row check ignored, the error limit is never reached so all 10k lines are processed.

It's likely to be the same for your plans-barcelona-small.xlsx. You can test it at the command line :

frictionless validate blank-rows.xlsx --json --skip-errors blank-row | grep '"rows"'
        "rows": 10019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants