Merge pull request #36 from camilogarciabotero/orf-rehaul

Rename ORF to ORFI
camilogarciabotero · Aug 9, 2024 · 4d71ffc · 4d71ffc
2 parents 21ac919 + e872ca3
commit 4d71ffc
Show file tree

Hide file tree

Showing 21 changed files with 396 additions and 338 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1 @@
+{}
diff --git a/README.md b/README.md
@@ -24,7 +24,7 @@
 
 >This is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) *gene finder* library framework for the Julia Language.
 
-The `GeneFinder` package aims to be a versatile module that enables the application of different gene finding algorithms to the `BioSequence` type, by providing a common interface and a flexible data structure to store the predicted ORF or genes. The package is designed to be easily extensible, allowing users to implement their own algorithms and integrate them into the framework.
+The `GeneFinder` package aims to be a versatile module that enables the application of different gene finding algorithms to the `BioSequence` type, by providing a common interface and a flexible data structure to store the predicted ORFI or genes. The package is designed to be easily extensible, allowing users to implement their own algorithms and integrate them into the framework.
 
 > [!WARNING] 
   This package is currently under development and is not yet ready for production use. The API is subject to change.
@@ -37,12 +37,12 @@ You can install `GeneFinder` from the julia REPL. Press `]` to enter pkg mode, a
 add GeneFinder
 ```
 
-## Finding complete and overlapped ORFs
+## Finding complete and overlapped ORFIs
 
-The main package function is `findorfs`. Under the hood, the `findorfs` function is an interface for different gene finding algorithms that can be plugged using the `finder` keyword argument. By default it uses the `NaiveFinder` algorithm, which is a simple algorithm that finds all (non-outbounded) ORFs in a DNA sequence (see the [NaiveFinder](https://camilogarciabotero.github.io/GeneFinder.jl/dev/api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N},%20BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}},%20Tuple{N}}%20where%20N) documentation for more details).
+The main package function is `findorfs`. Under the hood, the `findorfs` function is an interface for different gene finding algorithms that can be plugged using the `finder` keyword argument. By default it uses the `NaiveFinder` algorithm, which is a simple algorithm that finds all (non-outbounded) ORFIs in a DNA sequence (see the [NaiveFinder](https://camilogarciabotero.github.io/GeneFinder.jl/dev/api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N},%20BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}},%20Tuple{N}}%20where%20N) documentation for more details).
 
 > [!NOTE] 
-  The `minlen` kwarg in the `NaiveFinder` mehtod has been set to 6nt, so it will catch random ORFs not necesarily genes thus it might consider `dna"ATGTGA"` -> `aa"M*"` as a plausible ORF.
+  The `minlen` kwarg in the `NaiveFinder` mehtod has been set to 6nt, so it will catch random ORFIs not necesarily genes thus it might consider `dna"ATGTGA"` -> `aa"M*"` as a plausible ORFI.
 
 Here is an example of how to use the `findorfs` function with the `NaiveFinder` algorithm:
 
@@ -54,22 +54,22 @@ seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCA
 
 orfs = findorfs(seq, finder=NaiveFinder) # use finder=NaiveCollector as an alternative
 
-12-element Vector{ORF{4, NaiveFinder}}:
- ORF{NaiveFinder}(29:40, '+', 2)
- ORF{NaiveFinder}(137:145, '+', 2)
- ORF{NaiveFinder}(164:184, '+', 2)
- ORF{NaiveFinder}(173:184, '+', 2)
- ORF{NaiveFinder}(236:241, '+', 2)
- ORF{NaiveFinder}(248:268, '+', 2)
- ORF{NaiveFinder}(362:373, '+', 2)
- ORF{NaiveFinder}(470:496, '+', 2)
- ORF{NaiveFinder}(551:574, '+', 2)
- ORF{NaiveFinder}(569:574, '+', 2)
- ORF{NaiveFinder}(581:601, '+', 2)
- ORF{NaiveFinder}(695:706, '+', 2)
+12-element Vector{ORFI{4, NaiveFinder}}:
+ ORFI{NaiveFinder}(29:40, '+', 2)
+ ORFI{NaiveFinder}(137:145, '+', 2)
+ ORFI{NaiveFinder}(164:184, '+', 2)
+ ORFI{NaiveFinder}(173:184, '+', 2)
+ ORFI{NaiveFinder}(236:241, '+', 2)
+ ORFI{NaiveFinder}(248:268, '+', 2)
+ ORFI{NaiveFinder}(362:373, '+', 2)
+ ORFI{NaiveFinder}(470:496, '+', 2)
+ ORFI{NaiveFinder}(551:574, '+', 2)
+ ORFI{NaiveFinder}(569:574, '+', 2)
+ ORFI{NaiveFinder}(581:601, '+', 2)
+ ORFI{NaiveFinder}(695:706, '+', 2)
 ```
 
-The `ORF` structure displays the location, frame, and strand, but currently does not include the sequence *per se*. To extract the sequence of an `ORF` instance, you can use the `sequence` method directly on it, or you can also broadcast it over the `orfs` collection using the dot syntax `.`:
+The `ORFI` structure displays the location, frame, and strand, but currently does not include the sequence *per se*. To extract the sequence of an `ORFI` instance, you can use the `sequence` method directly on it, or you can also broadcast it over the `orfs` collection using the dot syntax `.`:
 
 ```julia
 sequence.(orfs)
@@ -89,7 +89,7 @@ sequence.(orfs)
  ATGCAACCCTGA
 ```
 
-Similarly, you can extract the amino acid sequences of the ORFs using the `translate` function.
+Similarly, you can extract the amino acid sequences of the ORFIs using the `translate` function.
 
 ```julia
 translate.(orfs)
@@ -109,17 +109,17 @@ translate.(orfs)
  MQP*
 ```
 
-## Let's score the ORFs
+## Let's score the ORFIs
 
-ORFs sequences can be scored using different schemes that evaluate them under a biological context. There are two ways to make this possible: by adding a scoring method to the finder algorithm or by using a scoring method after predicting the ORFs. The first approach is likely more efficient, but the second approach is more flexible. We will showcase the second approach in this example.
+ORFIs sequences can be scored using different schemes that evaluate them under a biological context. There are two ways to make this possible: by adding a scoring method to the finder algorithm or by using a scoring method after predicting the ORFIs. The first approach is likely more efficient, but the second approach is more flexible. We will showcase the second approach in this example.
 
-A commonly used scoring scheme for ORFs is the *log-odds ratio* score. This score is based on the likelihood of a sequence belonging to a specific stochastic model, such as coding or non-coding. The [BioMarkovChains](https://github.com/camilogarciabotero/BioMarkovChains.jl) package provides a `log_odds_ratio_score` method (currently imported), also known as `lors`, which can be used to score ORFs using the log-odds ratio approach.
+A commonly used scoring scheme for ORFIs is the *log-odds ratio* score. This score is based on the likelihood of a sequence belonging to a specific stochastic model, such as coding or non-coding. The [BioMarkovChains](https://github.com/camilogarciabotero/BioMarkovChains.jl) package provides a `log_odds_ratio_score` method (currently imported), also known as `lors`, which can be used to score ORFIs using the log-odds ratio approach.
 
 ```julia
 orfs = findorfs(seq, finder=NaiveFinder)
 ```
 
-The `lors` method has been overloaded to take an ORF object and can be used later to calculate the score of the ORFs.
+The `lors` method has been overloaded to take an ORFI object and can be used later to calculate the score of the ORFIs.
 
 ```julia
 lors.(orfs)
@@ -139,11 +139,11 @@ lors.(orfs)
  0.469404606944017
 ```
 
-We can extend basically any method that scores a `BioSequence` to score an `ORF` object. To see more about scoring ORFs, check out the [Scoring ORFs](https://camilogarciabotero.github.io/GeneFinder.jl/dev/features/) section in the documentation.
+We can extend basically any method that scores a `BioSequence` to score an `ORFI` object. To see more about scoring ORFIs, check out the [Scoring ORFIs](https://camilogarciabotero.github.io/GeneFinder.jl/dev/features/) section in the documentation.
 
-## Writting ORFs   into bioinformatic formats
+## Writting ORFIs   into bioinformatic formats
 
-`GeneFinder` also now facilitates the generation of `FASTA`, `BED`, and `GFF` files directly from the found ORFs. This feature is particularly useful for downstream analysis and visualization of the ORFs. To accomplish this, the package provides the following functions: `write_orfs_fna`, `write_orfs_faa`, `write_orfs_bed`, and `write_orfs_gff`.
+`GeneFinder` also now facilitates the generation of `FASTA`, `BED`, and `GFF` files directly from the found ORFIs. This feature is particularly useful for downstream analysis and visualization of the ORFIs. To accomplish this, the package provides the following functions: `write_orfs_fna`, `write_orfs_faa`, `write_orfs_bed`, and `write_orfs_gff`.
 
 Functionality:
 
@@ -165,7 +165,7 @@ using BioSequences, GeneFinder
 seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"
 ```
 
-Once a `BioSequence` object has been created, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
+Once a `BioSequence` object has been created, the `write_orfs_fna` function proves useful for generating a `FASTA` file containing the nucleotide sequences of the ORFIs. Notably, the `write_orfs*` methods support either an `IOStream` or an `IOBuffer` as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.
 
 ```julia
 outfile = "LFLS01000089.fna"
@@ -204,4 +204,4 @@ ATGTGTCCAACGGCAGCCTGA
 ATGCAACCCTGA
 ```
 
-This could also be done to writting a `FASTA` file with the nucleotide sequences of the ORFs using the `write_orfs_fna` function. Similarly for the `BED` and `GFF` files using the `write_orfs_bed` and `write_orfs_gff` functions respectively.
+This could also be done to writting a `FASTA` file with the nucleotide sequences of the ORFIs using the `write_orfs_fna` function. Similarly for the `BED` and `GFF` files using the `write_orfs_bed` and `write_orfs_gff` functions respectively.
diff --git a/docs/src/api.md b/docs/src/api.md
@@ -7,27 +7,27 @@ end
 
 ## The Main ORF type
 
-The main type of the package is `ORF` which represents an Open Reading Frame.
+The main type of the package is `ORFI` which represents an Open Reading Frame Interval. It is a subtype of the `GenomicInterval` type from the `GenomicFeatures` package.
 
 ```@autodocs
 Modules = [GeneFinder]
 Pages = ["types.jl"]
 ```
 
-## Finding ORFs
+## Finding ORFIs
 
-The function `findorfs` is the main function of the package. It is generic method that can handle different gene finding methods. 
+The function `findorfs` serves as a method interface as it is generic method that can handle different gene finding methods.
 
 ```@autodocs
 Modules = [GeneFinder]
 Pages = ["findorfs.jl"]
 ```
 
-## Finding ORFs using BioRegex and scoring
+## Finding ORFs using BioRegex
 
 ```@autodocs
 Modules = [GeneFinder]
-Pages = ["algorithms/naivefinder.jl"]
+Pages = ["algorithms/naivefinder.jl", "algorithms/naivecollector.jl"]
 ```
 
 ## Writing ORFs to files

diff --git a/docs/src/features.md b/docs/src/features.md
@@ -1,6 +1,6 @@
 ## Scoring ORFs
 
-The `ORF` type is designed to be flexible and can store various types of information about the ORF. A very neat feature is that stores a view of the sequence that the ORF represents. Since the the ORF sequence is stored in the struct, then any method that can be applied to a sequence can be applied to the ORF sequence. This is useful for scoring the ORFs by overloading a method that calculates a score for a `BioSequence`. For instance the `lors` function from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package can be used to calculate a score of the ORFs predicted for the phi genome.
+The `ORFI` type is designed to be flexible and can store various types of information about ORFs. A very neat feature is that stores a view of the sequence that the ORF represents. Since the the ORFI sequence is stored in the struct, then any method that can be applied to a sequence can be applied to the ORF sequence. This is useful for scoring the ORFs by overloading a method that calculates a score for a `BioSequence`. For instance the `lors` function from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package can be used to calculate a score of the ORFs predicted for the phi genome.
 
 Take the following example:
 
@@ -9,38 +9,38 @@ phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCG
 
 phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75)
 
-124-element Vector{ORF{4, NaiveFinder}}:
- ORF{NaiveFinder}(9:101, '-', 3)
- ORF{NaiveFinder}(100:627, '+', 1)
- ORF{NaiveFinder}(223:447, '-', 1)
- ORF{NaiveFinder}(248:436, '+', 2)
- ORF{NaiveFinder}(257:436, '+', 2)
- ORF{NaiveFinder}(283:627, '+', 1)
- ORF{NaiveFinder}(344:436, '+', 2)
- ORF{NaiveFinder}(532:627, '+', 1)
- ORF{NaiveFinder}(636:1622, '+', 3)
- ORF{NaiveFinder}(687:1622, '+', 3)
- ORF{NaiveFinder}(774:1622, '+', 3)
- ORF{NaiveFinder}(781:1389, '+', 1)
- ORF{NaiveFinder}(814:1389, '+', 1)
- ORF{NaiveFinder}(829:1389, '+', 1)
- ORF{NaiveFinder}(861:1622, '+', 3)
+124-element Vector{ORFI{4, NaiveFinder}}:
+ ORFI{NaiveFinder}(9:101, '-', 3)
+ ORFI{NaiveFinder}(100:627, '+', 1)
+ ORFI{NaiveFinder}(223:447, '-', 1)
+ ORFI{NaiveFinder}(248:436, '+', 2)
+ ORFI{NaiveFinder}(257:436, '+', 2)
+ ORFI{NaiveFinder}(283:627, '+', 1)
+ ORFI{NaiveFinder}(344:436, '+', 2)
+ ORFI{NaiveFinder}(532:627, '+', 1)
+ ORFI{NaiveFinder}(636:1622, '+', 3)
+ ORFI{NaiveFinder}(687:1622, '+', 3)
+ ORFI{NaiveFinder}(774:1622, '+', 3)
+ ORFI{NaiveFinder}(781:1389, '+', 1)
+ ORFI{NaiveFinder}(814:1389, '+', 1)
+ ORFI{NaiveFinder}(829:1389, '+', 1)
+ ORFI{NaiveFinder}(861:1622, '+', 3)
  ⋮
- ORF{NaiveFinder}(4671:5375, '+', 3)
- ORF{NaiveFinder}(4690:4866, '+', 1)
- ORF{NaiveFinder}(4728:5375, '+', 3)
- ORF{NaiveFinder}(4741:4866, '+', 1)
- ORF{NaiveFinder}(4744:4866, '+', 1)
- ORF{NaiveFinder}(4777:4866, '+', 1)
- ORF{NaiveFinder}(4806:5375, '+', 3)
- ORF{NaiveFinder}(4863:5258, '-', 3)
- ORF{NaiveFinder}(4933:5019, '+', 1)
- ORF{NaiveFinder}(4941:5375, '+', 3)
- ORF{NaiveFinder}(5082:5375, '+', 3)
- ORF{NaiveFinder}(5089:5325, '+', 1)
- ORF{NaiveFinder}(5122:5202, '-', 1)
- ORF{NaiveFinder}(5152:5325, '+', 1)
- ORF{NaiveFinder}(5164:5325, '+', 1)
+ ORFI{NaiveFinder}(4671:5375, '+', 3)
+ ORFI{NaiveFinder}(4690:4866, '+', 1)
+ ORFI{NaiveFinder}(4728:5375, '+', 3)
+ ORFI{NaiveFinder}(4741:4866, '+', 1)
+ ORFI{NaiveFinder}(4744:4866, '+', 1)
+ ORFI{NaiveFinder}(4777:4866, '+', 1)
+ ORFI{NaiveFinder}(4806:5375, '+', 3)
+ ORFI{NaiveFinder}(4863:5258, '-', 3)
+ ORFI{NaiveFinder}(4933:5019, '+', 1)
+ ORFI{NaiveFinder}(4941:5375, '+', 3)
+ ORFI{NaiveFinder}(5082:5375, '+', 3)
+ ORFI{NaiveFinder}(5089:5325, '+', 1)
+ ORFI{NaiveFinder}(5122:5202, '-', 1)
+ ORFI{NaiveFinder}(5152:5325, '+', 1)
+ ORFI{NaiveFinder}(5164:5325, '+', 1)
 ```
 
 We can now calculate a score using the `lors` (`logg_odds_ratio_score`) scoring scheme (see [lors](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/transitions.jl#L179) from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package).
@@ -96,9 +96,9 @@ In the `lors` case, the two models are the coding and non-coding models of the *
 
 ## Analysing Lambda ORFs
 
-As mentioned above the `lors` calculates the log odds ratio of the ORF sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.
+As mentioned above the `lors` calculates the log odds ratio of the ORFI sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORFI sequence is coding.
 
-Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.
+Now we can even analyse how is the distribution of the ORFIs' scores as a function of their lengths compared to random sequences.
 
 ```julia
 using FASTX, CairoMakie
@@ -154,4 +154,4 @@ f
 
 ![](assets/lors-lambda.png)
 
-What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORF the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.
+What this plot shows is that the ORFs in the lambda genome have a higher scores than random sequences of the same length. The score is a measure of how likely a sequence given the coding model is compared to the non-coding model. In other words, the higher the score the more likely the sequence is coding. So, the plot shows that the ORFs in the lambda genome are more likely to be coding regions than random sequences. It also shows that the longer the ORFI the higher the score, which is expected since longer ORFs are more likely to be coding regions than shorter ones.
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -62,8 +62,8 @@ add GeneFinder
 	author  = {Camilo García},
 	title   = {GeneFinder.jl},
 	url     = {https://github.com/camilogarciabotero/GeneFinder.jl},
-	version = {v0.3.0},
+	version = {v0.6.0},
 	year    = {2024},
-	month   = {04}
+	month   = {08}
 }
 ```