theme | colorSchema | layout | highlighter | lineNumbers | title | info | drawings | css | themeConfig | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
./theme |
bright |
intro |
shiki |
false |
Analyzing genomics data |
## Omics Data Analysis: Genomics
How to obtain biological information from the measurements.
|
|
unocss |
|
- Life and Medical Sciences in Bonn 🇩🇪
- PhD in leukemia epigenetics in Münster 🇩🇪
- Founder (& liquidator) of start-up Nucleotidy
- Bioinformatician at the NGI, Stockholm 🇸🇪
An exciting day in the lab...
layout: text-image media: '/assets/fun/phd021315s.gif' caption: 'https://phdcomics.com/comics/archive.php?comicid=1780'
layout: text-image reverse: true media: '/assets/tools/excel/nameerrors.png' caption: 'https://doi.org/10.1186/s13059-016-1044-7'
In 2020, Human Genome Gene Nomenclature Committee (HGNC) renamed genes that were auto-converted to dates in Excel.
layout: text-image media: '/assets/tools/excel/conversionoptional.png' caption: 'https://gizmodo.com/microsoft-fixes-excel-feature-that-forced-scientists-to-1850949443'
- They might not have a GUI.
- They might not run on your machine.
- For remote compute, mind data privacy!
- GNU / Linux
- MacOS
- Windows Subsystem for Linux
::window::
_ _ ____ ____ __ __ _ __ __
| | | | _ \| _ \| \/ | / \ \ \/ / | System: rackham3
| | | | |_) | |_) | |\/| | / _ \ \ / | User: bioinfomagician
| |_| | __/| __/| | | |/ ___ \ / \ |
\___/|_| |_| |_| |_/_/ \_\/_/\_\ |
########################################################################
User Guides: http://www.uppmax.uu.se/support/user-guides
FAQ: http://www.uppmax.uu.se/support/faq
Write to [email protected], if you have questions or comments.
(base) [bioinfomagician@rackham3 ~]$
layout: text-image media: '/assets/tools/issues/PIIS1934590923002886.png' caption: 'https://doi.org/10.1016/j.stem.2023.08.005'
Example from j.stem.2023.08.005:
- Findings backed by wet lab results.
- Distances in 2D projections of UMAP / t-SNE are not directly interpretable.
- Their loss functions are invariant with respect to rotations.
- More details at Understanding UMAP
Example from 10.1038/s41586-020-2095-1:
- In this case, both the analysis strategy and the understanding of the used methods was inadequate.
- Overconfident broad generalization of the findings.
- More details at 10.1128/mbio.01607-23
- Plain text format
- Each read is represented by four consecutive lines:
- Sequence identifier and an optional description
- The sequence
- + (optional)
- The base call quality
@SCILIFELAB:500:NGISTLM:1:1101:32832:1016 1:N:0:GCTTCAGGGT+AAGGTAGCGT
TCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF
::left::
Base call quality is high along the full read
::right::
The base composition is balanced
::left::
Base call quality drops off dramatically
::right::
The base composition is heavily skewed
Find the exact origin of a short fragment in a long reference
Which reference is the most-likely origin?
Create a long reference from short fragments
::window::
>NC_001422.1 Escherichia phage phiX174
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG
TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA
GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC
TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT
CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT
TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC
GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA
CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG
TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT
AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC
CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA
TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC
TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA
CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTT
GGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACA
ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC
TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT
[...]
- Unique: ng spirits brig ghing all the w
- Multi-mapper: Jingle bel
- Base error: what pun it is
- Indels: Jingggge bls
- Within scaffold: Bells on bob tail open sleigh. Hey!
- Within reference: Sankta Lucia
::window::
Dashing through the snow
In a one-horse open sleigh
O'er the fields we go
Laughing all the way
Bells on bob tail ring
Making spirits bright
What fun it is to ride and sing
A sleighing song tonight! Oh!
Jingle bells, jingle bells,
Jingle all the way.
Oh! what fun it is to ride
In a one-horse open sleigh. Hey!
Jingle bells, jingle bells,
Jingle all the way;
Oh! what fun it is to ride
In a one-horse open sleigh.
- Plain text format (SAM)
- Binary & compressed format (BAM/CRAM)
- Contains a header with metadata about reference and aligner
- Prints one alignment per line
- May contain secondary alignments
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:197195432
[...]
@PG ID:Bowtie VN:1.1.2 CL:"bowtie --wrapper basic-0 --threads 4 -v 2 -m 10 -a /ifs/mirror/genomes/bowtie/mm9 /dev/fd/63 --sam"
[...]
SRR2057595.665063_CGCCG 16 chr19 3486359 255 63M * 0 0 * * XA:i:0 MD:Z:63 NM:i:0 UG:i:0 BX:Z:CGCCG
SRR2057595.1043355_CGCCG 16 chr19 3486359 255 63M * 0 0 * * XA:i:0 MD:Z:63 NM:i:0 UG:i:0 BX:Z:CGCCG
SRR2057595.2024535_CGCCG 16 chr19 3486359 255 63M * 0 0 * * XA:i:0 MD:Z:63 NM:i:0 UG:i:0 BX:Z:CGCCG
SRR2057595.3828487_CGCCG 16 chr19 3486359 255 63M * 0 0 * * XA:i:0 MD:Z:63 NM:i:0 UG:i:0 BX:Z:CGCCG
Agglomerated errors may represent individual variations (mind the ploidy)
Annotations (e.g. gene positions) aid the interpretation
-
- Basecalling
- De-multiplexing of samples
- Discriminate true signal from false positives
- Find nearby genes or regulatory elements
Annotations (e.g. gene positions) aid the interpretation
layout: text-image media: '/assets/bio/cgv/humangenomeproject.png' caption: 'Covers from the 2001 draft sequence release' reverse: true
- Are versioned in major (GRCh38, hg38) and minor releases (GRCh38.p14)
- Come in different flavors
- Used for most applications.
Long-reads filled gaps and revealed inversions
- Scale analyses to a large number of samples
- Allow for parallel processing
- Agnostic of the compute infrastructure
- A sequence of interdependent processes
- Outputs are consumed by other steps
:: window ::
graph TD
A(STAR) -->|*.Aligned.out.bam| B
A -->|*.Aligned.toTranscriptome.out.bam| B
B(samtools sort - coordinate)
B -->|*.sorted.bam| C
B -->|*.transcriptome.sorted.bam| C
C(umi-tools dedup)
C -->|*.umi_dedup.sorted.bam| E
C -. *.umi_dedup.sorted.bam .-> D
C -->|*.umi_dedup.transcriptome.bam| D
C -->|*.umi_dedup.transcriptome.bam| P
D(samtools sort - name)
D -->|*.umi_dedup.transcriptome.sorted.bam| S
D -. *.umi_dedup.namesorted.bam .-> E
E(picard MarkDuplicates)
E -->|*.markdup.sorted.bam| F
F(featureCounts)
P(prepare-for-rsem.py)
P -->|*.umi_dedup.transcriptome.filtered.bam| R
R(RSEM)
S(salmon)
- One process
- Input is a list of three greetings
- The process is run for each input
:: window ::
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process sayHello {
input:
val x
output:
stdout
script:
"""
echo '$x world!'
"""
}
workflow {
Channel.of('Bonjour',
'Hej',
'Hello') | sayHello | view
}
Several hundred workflow systems exist, but in bioinformatics it boils down to those:
Honorable mention: Reflow, Workflow Description Language
- Isolated processes linked by dependencies (Directed acyclic graphs)
- Conceptually no dimension for time
- Snakemake, Nextflow ...
- Specify a sequence of steps explicitly
- Airflow, ...
- Each output is the materialization of a task sequence.
- Data assets are "aware" of their pedigree
- Dagster...
layout: iframe-right url: https://nf-co.re/pipelines class: text-sm
- Bioinformatic workflow community
- Nextflow pipelines
- 93 public pipelines
- More than 1ooo modules
- A friendly Slack space for questions
- Watch Beginner's guide to nf-core
layout: iframe-right url: https://anvio.org class: text-sm
- Microbial omic's community
- Build on top of Snakemake
- Reproducible exploratory analyses with artifacts and workflows
- Publish figures with provenance
- A friendly Discord space for questions
- Watch a video tutorial.
https://github.com/MatthiasZepper/Lecture-OmicsDataAnlysis