Skip to content

"Splitting" the data #273

@JesperHJespersen

Description

@JesperHJespersen

Hi :)

I am annotating VCF files from SAVANA CNA caller using annotSV and am a bit curious as to how the split function works.

It seems very useful to have one row in the df corresponding to each gene effect, instead of each SV, however I seem to have som problem getting AnnotSV to split the data like that:

Image

With each SV having numerous gene names listed.

Is this simply because I run larger CNAs through the algorithm, and so it cannot separate them into singular genes? Or am I not setting it up correctly?

Here is the script I have run:

#!/bin/bash
#SBATCH --job-name=annotsv_81_split
#SBATCH --output=annotsv_81_split.out
#SBATCH --error=annotsv_81_split.err
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --account=Renal_long_read

####################################### File Paths and Directories #######################################
INPUT_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number.vcf"
FIXED_VCF="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/CNA_analysis/SAVANA_CNA_output/patient_81/81_segmented_absolute_copy_number_fixed.vcf"
BASE_OUTPUT_DIR="/faststorage/project/Renal_long_read/derived_data/jesperj/savana/annotation/Patient_81_split"
OUTPUT_DIR="$BASE_OUTPUT_DIR"
OUTPUT_TSV="$OUTPUT_DIR/81_CNA_annotated.tsv"
OUTPUT_VCF="$OUTPUT_DIR/81_CNA_annotated.vcf"
GENOME_BUILD="GRCh38"

Directory for annotation files

ANNOTSV_ANNOTATIONS="/home/jesperjespersen/AnnotSV_annotations"

####################################### Create Output Directory #######################################
mkdir -p "$OUTPUT_DIR"

####################################### Input File Check and VCF Header Fix #######################################
if [[ ! -f "$INPUT_VCF" ]]; then
echo "Error: Input VCF file not found!"
exit 1
fi

Fix VCF header if the FORMAT field is missing

grep "^#CHROM" "$INPUT_VCF" | grep -q "FORMAT" ||
awk 'BEGIN {OFS="\t"}
/^#CHROM/ { print $0, "FORMAT", "SAMPLE"; next }
!/^#/ { print $0, "GT", "0/1" }' "$INPUT_VCF" > "$FIXED_VCF"

If no change was made, copy the original file to FIXED_VCF

if [[ ! -s "$FIXED_VCF" ]]; then
cp "$INPUT_VCF" "$FIXED_VCF"
fi

####################################### Run AnnotSV with Split Analysis #######################################
AnnotSV
-SVinputFile "$FIXED_VCF"
-genomeBuild "$GENOME_BUILD"
-annotationsDir "$ANNOTSV_ANNOTATIONS"
-outputFile "$OUTPUT_TSV"
-outputDir "$OUTPUT_DIR"
-annotationMode split
-vcf 1

Move the generated VCF to the final location (if created)

GENERATED_VCF="${OUTPUT_DIR}/$(basename "$FIXED_VCF" .vcf).annotated.vcf"

if [[ -f "$GENERATED_VCF" ]]; then
mv "$GENERATED_VCF" "$OUTPUT_VCF"
else
echo "Warning: AnnotSV did not generate a VCF file as expected."
fi

Best regards
Jesper :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions