Optimize memory usage #6

cre-os · 2024-06-05T17:00:13Z

Currently, XML file is parsed in memory, then document tree is extracted, then hash are computed, then the data is extracted into a flat data model. This PR proposes to switch to iterative event based parsing as a default in order to optimize memory use, while keeping also the original approach, which can be faster even if it uses more memory.

parsing is done in an iterative fashion
document tree is built during the parsing
hash are computed right after a node has been parsed
deduplication based on hash is done as it goes

Add configurable metadata Add max_lines in insert

Add configurable metadata and record hash column name

Add configurable hash method

cre-os added 9 commits June 5, 2024 18:55

Change XML parsing to iterparse

1d51986

Move hash compute and deduplication to parsing stage

74e0fe2

Add element.clear after parsing a node

5c74f0b

Split transactions

4a070b6

Add configurable metadata Add max_lines in insert

Add iterparse as an optional behavior

e236e35

Add configurable metadata and record hash column name

Fixes and version bump

626d12d

Switch document tree from dict to tuple

e4ef20e

Add configurable hash method

Allow custom indices to be added

6152da6

Deduplicate index name in tests

14bc249

cre-os merged commit ec5ef05 into main Jun 27, 2024
9 checks passed

cre-os deleted the feature/switch_to_lxml_iterparse branch June 27, 2024 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize memory usage #6

Optimize memory usage #6

Uh oh!

cre-os commented Jun 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Optimize memory usage #6

Optimize memory usage #6

Uh oh!

Conversation

cre-os commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cre-os commented Jun 5, 2024 •

edited

Loading