Skip to content

skip_errors parameter is impacting validate() performance #1699

Open
@pdelboca

Description

@pdelboca

Overview

Adding skip_errors to the validate() function has a dramatic impact in the performance of the function:

The following is an output on some data I'm working on and it goes from 140ms to 60s when skipping errors:

In [12]: %time report = validate('./data/plans-barcelona-small.xlsx', skip_errors=['blank-row'])
CPU times: user 60 s, sys: 119 μs, total: 60 s
Wall time: 60 s

In [13]: %time report = validate('./data/plans-barcelona-small.xlsx')
CPU times: user 141 ms, sys: 12 μs, total: 141 ms
Wall time: 140 ms

In [14]: %time report = validate('./data/plans-barcelona-small.xlsx', skip_errors=['blank-row'])
CPU times: user 1min 2s, sys: 4 ms, total: 1min 2s
Wall time: 1min 2s

A small blank-rows.xlsx file can change from 100ms to 500ms by just skipping errors.

In [29]: %time report = validate('./data/blank-rows.xlsx', skip_errors=['blank-row'])
CPU times: user 590 ms, sys: 3.98 ms, total: 594 ms
Wall time: 593 ms

In [30]: %time report = validate('./data/blank-rows.xlsx')
CPU times: user 117 ms, sys: 5 μs, total: 117 ms
Wall time: 116 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions