Skip to content

Commit

Permalink
Merge pull request #36 from camilogarciabotero/orf-rehaul
Browse files Browse the repository at this point in the history
Rename ORF to ORFI
  • Loading branch information
camilogarciabotero authored Aug 9, 2024
2 parents 21ac919 + e872ca3 commit 4d71ffc
Show file tree
Hide file tree
Showing 21 changed files with 396 additions and 338 deletions.
1 change: 1 addition & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
56 changes: 28 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

>This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) *gene finder* library framework for the Julia Language.
The `GeneFinder` package aims to be a versatile module that enables the application of different gene finding algorithms to the `BioSequence` type, by providing a common interface and a flexible data structure to store the predicted ORF or genes. The package is designed to be easily extensible, allowing users to implement their own algorithms and integrate them into the framework.
The `GeneFinder` package aims to be a versatile module that enables the application of different gene finding algorithms to the `BioSequence` type, by providing a common interface and a flexible data structure to store the predicted ORFI or genes. The package is designed to be easily extensible, allowing users to implement their own algorithms and integrate them into the framework.

> [!WARNING]
This package is currently under development and is not yet ready for production use. The API is subject to change.
Expand All @@ -37,12 +37,12 @@ You can install `GeneFinder` from the julia REPL. Press `]` to enter pkg mode, a
add GeneFinder
```

## Finding complete and overlapped ORFs
## Finding complete and overlapped ORFIs

The main package function is `findorfs`. Under the hood, the `findorfs` function is an interface for different gene finding algorithms that can be plugged using the `finder` keyword argument. By default it uses the `NaiveFinder` algorithm, which is a simple algorithm that finds all (non-outbounded) ORFs in a DNA sequence (see the [NaiveFinder](https://camilogarciabotero.github.io/GeneFinder.jl/dev/api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N},%20BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}},%20Tuple{N}}%20where%20N) documentation for more details).
The main package function is `findorfs`. Under the hood, the `findorfs` function is an interface for different gene finding algorithms that can be plugged using the `finder` keyword argument. By default it uses the `NaiveFinder` algorithm, which is a simple algorithm that finds all (non-outbounded) ORFIs in a DNA sequence (see the [NaiveFinder](https://camilogarciabotero.github.io/GeneFinder.jl/dev/api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N},%20BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}},%20Tuple{N}}%20where%20N) documentation for more details).

> [!NOTE]
The `minlen` kwarg in the `NaiveFinder` mehtod has been set to 6nt, so it will catch random ORFs not necesarily genes thus it might consider `dna"ATGTGA"` -> `aa"M*"` as a plausible ORF.
The `minlen` kwarg in the `NaiveFinder` mehtod has been set to 6nt, so it will catch random ORFIs not necesarily genes thus it might consider `dna"ATGTGA"` -> `aa"M*"` as a plausible ORFI.
Here is an example of how to use the `findorfs` function with the `NaiveFinder` algorithm:

Expand All @@ -54,22 +54,22 @@ seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCA

orfs = findorfs(seq, finder=NaiveFinder) # use finder=NaiveCollector as an alternative

12-element Vector{ORF{4, NaiveFinder}}:
ORF{NaiveFinder}(29:40, '+', 2)
ORF{NaiveFinder}(137:145, '+', 2)
ORF{NaiveFinder}(164:184, '+', 2)
ORF{NaiveFinder}(173:184, '+', 2)
ORF{NaiveFinder}(236:241, '+', 2)
ORF{NaiveFinder}(248:268, '+', 2)
ORF{NaiveFinder}(362:373, '+', 2)
ORF{NaiveFinder}(470:496, '+', 2)
ORF{NaiveFinder}(551:574, '+', 2)
ORF{NaiveFinder}(569:574, '+', 2)
ORF{NaiveFinder}(581:601, '+', 2)
ORF{NaiveFinder}(695:706, '+', 2)
12-element Vector{ORFI{4, NaiveFinder}}:
ORFI{NaiveFinder}(29:40, '+', 2)
ORFI{NaiveFinder}(137:145, '+', 2)
ORFI{NaiveFinder}(164:184, '+', 2)
ORFI{NaiveFinder}(173:184, '+', 2)
ORFI{NaiveFinder}(236:241, '+', 2)
ORFI{NaiveFinder}(248:268, '+', 2)
ORFI{NaiveFinder}(362:373, '+', 2)
ORFI{NaiveFinder}(470:496, '+', 2)
ORFI{NaiveFinder}(551:574, '+', 2)
ORFI{NaiveFinder}(569:574, '+', 2)
ORFI{NaiveFinder}(581:601, '+', 2)
ORFI{NaiveFinder}(695:706, '+', 2)
```

The `ORF` structure displays the location, frame, and strand, but currently does not include the sequence *per se*. To extract the sequence of an `ORF` instance, you can use the `sequence` method directly on it, or you can also broadcast it over the `orfs` collection using the dot syntax `.`:
The `ORFI` structure displays the location, frame, and strand, but currently does not include the sequence *per se*. To extract the sequence of an `ORFI` instance, you can use the `sequence` method directly on it, or you can also broadcast it over the `orfs` collection using the dot syntax `.`:

```julia
sequence.(orfs)
Expand All @@ -89,7 +89,7 @@ sequence.(orfs)
ATGCAACCCTGA
```

Similarly, you can extract the amino acid sequences of the ORFs using the `translate` function.
Similarly, you can extract the amino acid sequences of the ORFIs using the `translate` function.

```julia
translate.(orfs)
Expand All @@ -109,17 +109,17 @@ translate.(orfs)
MQP*
```

## Let's score the ORFs
## Let's score the ORFIs

ORFs sequences can be scored using different schemes that evaluate them under a biological context. There are two ways to make this possible: by adding a scoring method to the finder algorithm or by using a scoring method after predicting the ORFs. The first approach is likely more efficient, but the second approach is more flexible. We will showcase the second approach in this example.
ORFIs sequences can be scored using different schemes that evaluate them under a biological context. There are two ways to make this possible: by adding a scoring method to the finder algorithm or by using a scoring method after predicting the ORFIs. The first approach is likely more efficient, but the second approach is more flexible. We will showcase the second approach in this example.

A commonly used scoring scheme for ORFs is the *log-odds ratio* score. This score is based on the likelihood of a sequence belonging to a specific stochastic model, such as coding or non-coding. The [BioMarkovChains](https://github.com/camilogarciabotero/BioMarkovChains.jl) package provides a `log_odds_ratio_score` method (currently imported), also known as `lors`, which can be used to score ORFs using the log-odds ratio approach.
A commonly used scoring scheme for ORFIs is the *log-odds ratio* score. This score is based on the likelihood of a sequence belonging to a specific stochastic model, such as coding or non-coding. The [BioMarkovChains](https://github.com/camilogarciabotero/BioMarkovChains.jl) package provides a `log_odds_ratio_score` method (currently imported), also known as `lors`, which can be used to score ORFIs using the log-odds ratio approach.

```julia
orfs = findorfs(seq, finder=NaiveFinder)
```

The `lors` method has been overloaded to take an ORF object and can be used later to calculate the score of the ORFs.
The `lors` method has been overloaded to take an ORFI object and can be used later to calculate the score of the ORFIs.

```julia
lors.(orfs)
Expand All @@ -139,11 +139,11 @@ lors.(orfs)
0.469404606944017
```

We can extend basically any method that scores a `BioSequence` to score an `ORF` object. To see more about scoring ORFs, check out the [Scoring ORFs](https://camilogarciabotero.github.io/GeneFinder.jl/dev/features/) section in the documentation.
We can extend basically any method that scores a `BioSequence` to score an `ORFI` object. To see more about scoring ORFIs, check out the [Scoring ORFIs](https://camilogarciabotero.github.io/GeneFinder.jl/dev/features/) section in the documentation.

## Writting ORFs into bioinformatic formats
## Writting ORFIs into bioinformatic formats

`GeneFinder` also now facilitates the generation of `FASTA`, `BED`, and `GFF` files directly from the found ORFs. This feature is particularly useful for downstream analysis and visualization of the ORFs. To accomplish this, the package provides the following functions: `write_orfs_fna`, `write_orfs_faa`, `write_orfs_bed`, and `write_orfs_gff`.
`GeneFinder` also now facilitates the generation of `FASTA`, `BED`, and `GFF` files directly from the found ORFIs. This feature is particularly useful for downstream analysis and visualization of the ORFIs. To accomplish this, the package provides the following functions: `write_orfs_fna`, `write_orfs_faa`, `write_orfs_bed`, and `write_orfs_gff`.

Functionality:

Expand All @@ -165,7 +165,7 @@ using BioSequences, GeneFinder
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
```

Once a `BioSequence` object has been created, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
Once a `BioSequence` object has been created, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFIs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.

```julia
outfile = "LFLS01000089.fna"
Expand Down Expand Up @@ -204,4 +204,4 @@ ATGTGTCCAACGGCAGCCTGA
ATGCAACCCTGA
```

This could also be done to writting a `FASTA` file with the nucleotide sequences of the ORFs using the `write_orfs_fna` function. Similarly for the `BED` and `GFF` files using the `write_orfs_bed` and `write_orfs_gff` functions respectively.
This could also be done to writting a `FASTA` file with the nucleotide sequences of the ORFIs using the `write_orfs_fna` function. Similarly for the `BED` and `GFF` files using the `write_orfs_bed` and `write_orfs_gff` functions respectively.
10 changes: 5 additions & 5 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,27 @@ end

## The Main ORF type

The main type of the package is `ORF` which represents an Open Reading Frame.
The main type of the package is `ORFI` which represents an Open Reading Frame Interval. It is a subtype of the `GenomicInterval` type from the `GenomicFeatures` package.

```@autodocs
Modules = [GeneFinder]
Pages = ["types.jl"]
```

## Finding ORFs
## Finding ORFIs

The function `findorfs` is the main function of the package. It is generic method that can handle different gene finding methods.
The function `findorfs` serves as a method interface as it is generic method that can handle different gene finding methods.

```@autodocs
Modules = [GeneFinder]
Pages = ["findorfs.jl"]
```

## Finding ORFs using BioRegex and scoring
## Finding ORFs using BioRegex

```@autodocs
Modules = [GeneFinder]
Pages = ["algorithms/naivefinder.jl"]
Pages = ["algorithms/naivefinder.jl", "algorithms/naivecollector.jl"]
```

## Writing ORFs to files
Expand Down
70 changes: 35 additions & 35 deletions docs/src/features.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Scoring ORFs

The `ORF` type is designed to be flexible and can store various types of information about the ORF. A very neat feature is that stores a view of the sequence that the ORF represents. Since the the ORF sequence is stored in the struct, then any method that can be applied to a sequence can be applied to the ORF sequence. This is useful for scoring the ORFs by overloading a method that calculates a score for a `BioSequence`. For instance the `lors` function from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package can be used to calculate a score of the ORFs predicted for the phi genome.
The `ORFI` type is designed to be flexible and can store various types of information about ORFs. A very neat feature is that stores a view of the sequence that the ORF represents. Since the the ORFI sequence is stored in the struct, then any method that can be applied to a sequence can be applied to the ORF sequence. This is useful for scoring the ORFs by overloading a method that calculates a score for a `BioSequence`. For instance the `lors` function from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package can be used to calculate a score of the ORFs predicted for the phi genome.

Take the following example:

Expand All @@ -9,38 +9,38 @@ phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCG

phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75)

124-element Vector{ORF{4, NaiveFinder}}:
ORF{NaiveFinder}(9:101, '-', 3)
ORF{NaiveFinder}(100:627, '+', 1)
ORF{NaiveFinder}(223:447, '-', 1)
ORF{NaiveFinder}(248:436, '+', 2)
ORF{NaiveFinder}(257:436, '+', 2)
ORF{NaiveFinder}(283:627, '+', 1)
ORF{NaiveFinder}(344:436, '+', 2)
ORF{NaiveFinder}(532:627, '+', 1)
ORF{NaiveFinder}(636:1622, '+', 3)
ORF{NaiveFinder}(687:1622, '+', 3)
ORF{NaiveFinder}(774:1622, '+', 3)
ORF{NaiveFinder}(781:1389, '+', 1)
ORF{NaiveFinder}(814:1389, '+', 1)
ORF{NaiveFinder}(829:1389, '+', 1)
ORF{NaiveFinder}(861:1622, '+', 3)
124-element Vector{ORFI{4, NaiveFinder}}:
ORFI{NaiveFinder}(9:101, '-', 3)
ORFI{NaiveFinder}(100:627, '+', 1)
ORFI{NaiveFinder}(223:447, '-', 1)
ORFI{NaiveFinder}(248:436, '+', 2)
ORFI{NaiveFinder}(257:436, '+', 2)
ORFI{NaiveFinder}(283:627, '+', 1)
ORFI{NaiveFinder}(344:436, '+', 2)
ORFI{NaiveFinder}(532:627, '+', 1)
ORFI{NaiveFinder}(636:1622, '+', 3)
ORFI{NaiveFinder}(687:1622, '+', 3)
ORFI{NaiveFinder}(774:1622, '+', 3)
ORFI{NaiveFinder}(781:1389, '+', 1)
ORFI{NaiveFinder}(814:1389, '+', 1)
ORFI{NaiveFinder}(829:1389, '+', 1)
ORFI{NaiveFinder}(861:1622, '+', 3)
ORF{NaiveFinder}(4671:5375, '+', 3)
ORF{NaiveFinder}(4690:4866, '+', 1)
ORF{NaiveFinder}(4728:5375, '+', 3)
ORF{NaiveFinder}(4741:4866, '+', 1)
ORF{NaiveFinder}(4744:4866, '+', 1)
ORF{NaiveFinder}(4777:4866, '+', 1)
ORF{NaiveFinder}(4806:5375, '+', 3)
ORF{NaiveFinder}(4863:5258, '-', 3)
ORF{NaiveFinder}(4933:5019, '+', 1)
ORF{NaiveFinder}(4941:5375, '+', 3)
ORF{NaiveFinder}(5082:5375, '+', 3)
ORF{NaiveFinder}(5089:5325, '+', 1)
ORF{NaiveFinder}(5122:5202, '-', 1)
ORF{NaiveFinder}(5152:5325, '+', 1)
ORF{NaiveFinder}(5164:5325, '+', 1)
ORFI{NaiveFinder}(4671:5375, '+', 3)
ORFI{NaiveFinder}(4690:4866, '+', 1)
ORFI{NaiveFinder}(4728:5375, '+', 3)
ORFI{NaiveFinder}(4741:4866, '+', 1)
ORFI{NaiveFinder}(4744:4866, '+', 1)
ORFI{NaiveFinder}(4777:4866, '+', 1)
ORFI{NaiveFinder}(4806:5375, '+', 3)
ORFI{NaiveFinder}(4863:5258, '-', 3)
ORFI{NaiveFinder}(4933:5019, '+', 1)
ORFI{NaiveFinder}(4941:5375, '+', 3)
ORFI{NaiveFinder}(5082:5375, '+', 3)
ORFI{NaiveFinder}(5089:5325, '+', 1)
ORFI{NaiveFinder}(5122:5202, '-', 1)
ORFI{NaiveFinder}(5152:5325, '+', 1)
ORFI{NaiveFinder}(5164:5325, '+', 1)
```

We can now calculate a score using the `lors` (`logg_odds_ratio_score`) scoring scheme (see [lors](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/transitions.jl#L179) from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package).
Expand Down Expand Up @@ -96,9 +96,9 @@ In the `lors` case, the two models are the coding and non-coding models of the *

## Analysing Lambda ORFs

As mentioned above the `lors` calculates the log odds ratio of the ORF sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.
As mentioned above the `lors` calculates the log odds ratio of the ORFI sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORFI sequence is coding.

Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.
Now we can even analyse how is the distribution of the ORFIs' scores as a function of their lengths compared to random sequences.

```julia
using FASTX, CairoMakie
Expand Down Expand Up @@ -154,4 +154,4 @@ f

![](assets/lors-lambda.png)

What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORF the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.
What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORFI the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.
4 changes: 2 additions & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ add GeneFinder
author = {Camilo García},
title = {GeneFinder.jl},
url = {https://github.com/camilogarciabotero/GeneFinder.jl},
version = {v0.3.0},
version = {v0.6.0},
year = {2024},
month = {04}
month = {08}
}
```
Loading

0 comments on commit 4d71ffc

Please sign in to comment.