"Splitting" the data

Hi :) 

I am annotating VCF files from SAVANA CNA caller using annotSV and am a bit curious as to how the split function works. 

It seems very useful to have one row in the df corresponding to each gene effect, instead of each SV, however I seem to have som problem getting AnnotSV to split the data like that: 

![Image](https://github.com/user-attachments/assets/9de7f59f-ed2d-45ff-a1c1-155b30a54a76)

With each SV having numerous gene names listed. 

Is this simply because I run larger CNAs through the algorithm, and so it cannot separate them into singular genes? Or am I not setting it up correctly? 

Here is the script I have run: 

#!/bin/bash
#SBATCH --job-name=annotsv_81_split
#SBATCH --output=annotsv_81_split.out
#SBATCH --error=annotsv_81_split.err
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --account=Renal_long_read

####################################### File Paths and Directories #######################################
INPUT_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number.vcf"
FIXED_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number_fixed.vcf"
BASE_OUTPUT_DIR="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/annotation/Patient_81_split"
OUTPUT_DIR="$BASE_OUTPUT_DIR"
OUTPUT_TSV="$OUTPUT_DIR/81_CNA_annotated.tsv"
OUTPUT_VCF="$OUTPUT_DIR/81_CNA_annotated.vcf"
GENOME_BUILD="GRCh38"

# Directory for annotation files
ANNOTSV_ANNOTATIONS="/home/jesperjespersen/AnnotSV_annotations"

####################################### Create Output Directory #######################################
mkdir -p "$OUTPUT_DIR"

####################################### Input File Check and VCF Header Fix #######################################
if [[ ! -f "$INPUT_VCF" ]]; then
    echo "Error: Input VCF file not found!"
    exit 1
fi

# Fix VCF header if the FORMAT field is missing
grep "^#CHROM" "$INPUT_VCF" | grep -q "FORMAT" || \
awk 'BEGIN {OFS="\t"}
     /^#CHROM/ { print $0, "FORMAT", "SAMPLE"; next }
     !/^#/ { print $0, "GT", "0/1" }' "$INPUT_VCF" > "$FIXED_VCF"

# If no change was made, copy the original file to FIXED_VCF
if [[ ! -s "$FIXED_VCF" ]]; then
    cp "$INPUT_VCF" "$FIXED_VCF"
fi

####################################### Run AnnotSV with Split Analysis #######################################
AnnotSV \
  -SVinputFile "$FIXED_VCF" \
  -genomeBuild "$GENOME_BUILD" \
  -annotationsDir "$ANNOTSV_ANNOTATIONS" \
  -outputFile "$OUTPUT_TSV" \
  -outputDir "$OUTPUT_DIR" \
  -annotationMode split \
  -vcf 1

# Move the generated VCF to the final location (if created)
GENERATED_VCF="${OUTPUT_DIR}/$(basename "$FIXED_VCF" .vcf).annotated.vcf"

if [[ -f "$GENERATED_VCF" ]]; then
    mv "$GENERATED_VCF" "$OUTPUT_VCF"
else
    echo "Warning: AnnotSV did not generate a VCF file as expected."
fi



Best regards 
Jesper :) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Splitting" the data #273

Directory for annotation files

Fix VCF header if the FORMAT field is missing

If no change was made, copy the original file to FIXED_VCF

Move the generated VCF to the final location (if created)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

"Splitting" the data #273

Description

Directory for annotation files

Fix VCF header if the FORMAT field is missing

If no change was made, copy the original file to FIXED_VCF

Move the generated VCF to the final location (if created)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions