-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to extend header parameters from tables/output/annotation_fields for 'vembrane table --header "{params.config[header]}" "{params.config[expr]}' #333
Comments
Thanks for digging into this and for taking the time to report this thoroughly! We recently changed how this works, and it seems we haven't gotten to a clean solution,yet.
So do you have a clear idea of what you would want in this case, and what you would want in a more general case? Maybe this can help us figure out how exactly we want to eventually resolve this. |
Vembrane table creates multiple lines for each variant based on annotation information. Because the files can be so large, I read the tsv and turn it into a zarr file then process it in sections while also doing some initial filtering and grouping. I don't generally use the config.yaml file for filtering because I have to change the criteria so often. I have had to use it to remove some chromosomal locations which have caused downstream processing issues as described in this article about problematic regions. There can be multiple annotations for a single variant which means some variants have dozens of lines after vembrane is used to create the table. To get around this, I groupby each each individual variant for the file (subject or family) and include all columns which are unique for each variant ("chromosome", "position", "allele", max_af, etc). This ensures that each unique variant has its own line. Any additional annotation that has multiple values is aggregated into a list. I currently process data using the flow chart method below. It can require allocating a lot of memory due to the size of the tsv files, but that is about the only issue. I had thought about some alternative methods but those would have used the vcf file. The vcf files are already formatted for vembrane and I just needed to get the data out. The vembrane table rule already did this, and I can easily reformat using config.yaml. graph TD
A[BCF File] -->|vembrane| B[TSV File]
B -->|xarray| C[Zarr Folder]
C -->|reread| H[Process by chromosome]
H -->|filter, groupby | D[Processed Chr 1]
H -->|filter, groupby| E[Processed Chr ..]
D -->|aggregate| F[Save to TSV]
E -->|aggregate| F[Save to TSV]
F -->|join| G[Final TSV File]
I[Additional TSV files] -->|join| G[Final TSV File]
I think this answers your questions? I attached a copy of one of my config.yaml files for reference since I use a lot more columns than just BIOTYPE. I have also included a test file as an example of the final data exported. |
A couple of thoughts of how you might get most of this done within the existing workflow:
So if you haven't, maybe you can look into the in-workflow vembrane filtering and the report, and see if this meets some of your needs and if you need more things amended in that setup? Also, if you have suggestions on making any of what I explained here clearer when deploying and configuring the workflow, please let me know. (Useful) documentation is hard... |
In table.smk, the rule vembrane_table calls get_vembrane_config to get the parameters for the vembrane header table. This should add the annotation fields from config.yaml.
Example from config.yaml
get_vembrane_config appends the fields with the following code. The fields being BIOTYPE, Loftool, and REVEL in this example.
sort_order is created and contains "ANN['SYMBOL']", "ANN['IMPACT']", etc, but is a limited list. It is extended with additional values but the list in not a complete list of all values available in vembrane ANN types.
Complete list of custome ANN types.
https://github.com/vembrane/vembrane/blob/main/docs/ann_types.md
Finally, the sorted_columns_dict is created to return the values used in the vembrane_table rule. This removes any values which were added to columns_dict if they are not in sort_order. It would remove BIOTYPE and Loftool, but not REVEL as it is in sort_order.
For a variants.fdr-controlled.tsv file, the final output of the tsv files consists only of the following columns and can't be extended.
symbol
impact
hgvsp
hgvsc
consequence
clinical significance
gnomad genome af
exon
revel
chromosome
position
reference allele
alternative allele
protein position
protein alteration (short)
canonical
mane_plus_clinical
prob: absent
prob: artifact
prob: variant
patient: allele frequency
patient: read depth
patient: short ref observations
patient: short alt observations
patient: observations
id
gene
hgvsg
feature
end position
The text was updated successfully, but these errors were encountered: