Read and process mutation record in batches to reduce memory usage #288

leexgh · 2025-05-16T21:10:31Z

Fix: genome-nexus/genome-nexus#796
Problem
gnResponseVariantKeyMap grows with VariantAnnotation added for each variant. It consumes a lot of memory when annotating a large file.
Updates

This pr tries to process and write in smaller batches to avoid high memory usage.
Updates for Annotation Summary. Previously, the annotation summary of invalid or failed annotations was a count for unique variants, now it shows the actual variant count, including duplicated variants. For example:

Annotation Summary:
        Records with ambiguous SNP and INDEL allele changes:  0

        Failed annotations summary:  5 total failed annotations
                Records with HGVSp null variant classification:  1
                Records that failed due to other unknown reason: 4

This makes more sense to me because when annotating a large file, usually people don't know how many variants are unique, including duplicated variants makes it clearer to tell people how many rows have invalid annotations.

leexgh · 2025-05-19T18:39:50Z

@inodb @onursumer @ao508 @rmadupuri @callachennault Not sure who should review this pr, feel free to leave comments or unassign yourself from review 🙂

ao508

lgtm but I'd like someone from the pipelines team to confirm that these changes will not impact the ETLs. Looks like the POST method is the same so when the annotator dependency is updated in say, the DMP data fetcher, it should be fine @callachennault

leexgh added 2 commits May 16, 2025 17:10

Read and process mutation record in chunk to reduce memory usage

cfe3132

Remove the warning log and duplicated code

358d73e

leexgh changed the title ~~Read and process mutation record in chunk to reduce memory usage~~ Read and process mutation record in batches to reduce memory usage May 19, 2025

leexgh requested review from inodb, onursumer, ao508, rmadupuri and callachennault May 19, 2025 18:37

leexgh added the bug label May 19, 2025

ao508 reviewed May 20, 2025

View reviewed changes

callachennault mentioned this pull request May 27, 2025

Review and test GNAP batch annotation updates knowledgesystems/pipelines-scrum#1521

Open

2 tasks

leexgh added 3 commits June 17, 2025 16:48

Fix summaryStatistics over 100% and lost tracking counts

2fa6107

Fix summaryStatistics over 100% and lost tracking counts

5bb62e1

Fix summaryStatistics for failed batch

b846d0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read and process mutation record in batches to reduce memory usage #288

Read and process mutation record in batches to reduce memory usage #288

Uh oh!

leexgh commented May 16, 2025 •

edited

Loading

Uh oh!

leexgh commented May 19, 2025

Uh oh!

ao508 left a comment

Uh oh!

Uh oh!

Read and process mutation record in batches to reduce memory usage #288

Are you sure you want to change the base?

Read and process mutation record in batches to reduce memory usage #288

Uh oh!

Conversation

leexgh commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leexgh commented May 19, 2025

Uh oh!

ao508 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leexgh commented May 16, 2025 •

edited

Loading