Skip to content

Read and process mutation record in batches to reduce memory usage #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

leexgh
Copy link
Member

@leexgh leexgh commented May 16, 2025

Fix: genome-nexus/genome-nexus#796
Problem
gnResponseVariantKeyMap grows with VariantAnnotation added for each variant. It consumes a lot of memory when annotating a large file.
Updates

  1. This pr tries to process and write in smaller batches to avoid high memory usage.
  2. Updates for Annotation Summary. Previously, the annotation summary of invalid or failed annotations was a count for unique variants, now it shows the actual variant count, including duplicated variants. For example:
Annotation Summary:
        Records with ambiguous SNP and INDEL allele changes:  0

        Failed annotations summary:  5 total failed annotations
                Records with HGVSp null variant classification:  1
                Records that failed due to other unknown reason: 4

This makes more sense to me because when annotating a large file, usually people don't know how many variants are unique, including duplicated variants makes it clearer to tell people how many rows have invalid annotations.

@leexgh leexgh changed the title Read and process mutation record in chunk to reduce memory usage Read and process mutation record in batches to reduce memory usage May 19, 2025
@leexgh
Copy link
Member Author

leexgh commented May 19, 2025

@inodb @onursumer @ao508 @rmadupuri @callachennault Not sure who should review this pr, feel free to leave comments or unassign yourself from review 🙂

@leexgh leexgh added the bug label May 19, 2025
Copy link
Contributor

@ao508 ao508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but I'd like someone from the pipelines team to confirm that these changes will not impact the ETLs. Looks like the POST method is the same so when the annotator dependency is updated in say, the DMP data fetcher, it should be fine @callachennault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Genome Nexus Annotation Pipeline out of memory issue
2 participants