From 508c8c650f75a0368d81954bb1e01d54ba7e9cd8 Mon Sep 17 00:00:00 2001 From: Mike Lin Date: Sun, 6 Sep 2020 16:53:17 -1000 Subject: [PATCH] document convention for project VCF "QC squeezing" --- VCFv4.4.draft.tex | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/VCFv4.4.draft.tex b/VCFv4.4.draft.tex index 63ce548e2..edbc1df95 100644 --- a/VCFv4.4.draft.tex +++ b/VCFv4.4.draft.tex @@ -1409,6 +1409,11 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)} \end{flushleft} \normalsize +\subsection{Selective genotype fields in many-sample VCF} +In VCF data representing SNPs and small indels jointly discovered and genotyped across a population, typically the vast majority of genotype fields are homozygous for the reference allele (0/0). When these GT entries are accompanied by the full array of quality-control FORMAT fields supporting variant genotypes (e.g. AD, SB, PL), the resulting file size may grow disproportionately to the practical utility of these fields. + +To ameliorate this, joint variant calling tools may opt to order FORMAT fields so that most of them can be omitted in most entries (invoking the previous clause, ``Trailing [FORMAT] fields can be dropped, with the exception of the GT field''). For example, the FORMAT fields might be ordered {\tt GT:DP:AD:SB:PL} so that entries lacking significant evidence of variation may write GT and DP only (e.g. {\tt 0/0:32}). DP might furthermore be binned in some way, to improve compressibility. Tools consuming many-sample VCF files should accommodate this convention. + \pagebreak \section{BCF specification}