Skip to content

Commit f32900e

Browse files
author
Pierre Marijon
committed
Update to yacrd 0.6
1 parent 08271fd commit f32900e

File tree

18 files changed

+307
-359
lines changed

18 files changed

+307
-359
lines changed

Cargo.lock

Lines changed: 223 additions & 190 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
name = "yacrd"
33
version = "0.6.0"
44
authors = ["Pierre Marijon <[email protected]>"]
5+
edition = '2018'
56

67
exclude = ["image/*", "tests/*"]
78

@@ -10,30 +11,26 @@ homepage = "https://github.com/natir/yacrd"
1011
repository = "https://github.com/natir/yacrd"
1112
readme = "Readme.md"
1213
license = "MIT"
13-
keywords = ["bioinformatics", "chimera", "long-read"]
14+
keywords = ["bioinformatics", "chimera", "long-read", "scrubbing"]
1415

1516
[badges]
1617
travis-ci = { repository = "natir/yacrd", branch = "master" }
1718

1819
[dependencies]
1920
bio = "0.30"
20-
csv = "1"
21-
log = "0.4.0"
21+
csv = "1.1"
22+
log = "0.4"
2223
anyhow = "1.0"
2324
niffler = {git = "https://github.com/luizirber/niffler/", branch = "api_1.0"}
2425
thiserror = "1.0"
2526
structopt = "0.3"
2627
env_logger = "0.7"
27-
lazy_static = "1.0"
28-
serde_derive = "1.0"
29-
enum_primitive = "0.1.1"
30-
3128

3229
[dev-dependencies]
33-
tempfile = "3"
30+
tempfile = "3.1"
3431

3532
[profile.release]
36-
debug = true # uncomment for proffiling
33+
# debug = true # uncomment for proffiling
3734
lto = 'thin'
3835
opt-level = 3
3936
overflow-checks = false

Readme.md

Lines changed: 17 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,5 @@
1-
# README IN BETA JUMP TO 0.5.1 TAG
2-
31
# Yet Another Chimeric Read Detector for long reads
42

5-
[![build-status]][github-actions]
6-
7-
![yacrd pipeline presentation](image/pipeline.svg)
8-
93
Using all-against-all read mapping, yacrd performs:
104

115
1. computation of pile-up coverage for each read
@@ -16,7 +10,7 @@ Chimera detection is done as follows:
1610
1. for each region where coverage is smaller or equal than `min_coverage` (default 0), yacrd creates a _bad region_.
1711
2. if there is a _bad region_ that starts at a position strictly after the beginning of the read and ends strictly before the end of the read, the read is marked as `Chimeric`
1812
3. if total _bad region_ length > 0.8 * read length, the read is marked as `NotCovered`
19-
4. if read isn't `Chimeric` or `NotCovered` is `NotBad`
13+
4. if a read isn't `Chimeric` or `NotCovered` is `NotBad`
2014

2115
## Rationale
2216

@@ -84,7 +78,7 @@ yacrd -i overlap.paf -o reads.yacrd
8478
yacrd can perform some post-detection operation:
8579

8680
- filter: for sequence or overlap file, record with reads marked as Chimeric or NotCovered isn't write in output
87-
- extract: for sequence or overlap file, record contain reads marked as Chimeric or NotCovered is write in output
81+
- extract: for sequence or overlap file, record contains reads marked as Chimeric or NotCovered is write in output
8882
- split: for sequence file bad region in middle of reads are removed, NotCovered read is removed
8983
- scrubb: for sequence file all bad region are removed, NotCovered read is removed
9084

@@ -96,24 +90,24 @@ yacrd -i mapping.paf -o reads.yacrd split -i reads.fasta -o reads.split.fasta
9690
yacrd -i mapping.paf -o reads.yacrd scrubb -i reads.fasta -o reads.scrubb.fasta
9791
```
9892

99-
### Read scrubbing overlapping recommanded parameter
93+
### Read scrubbing overlapping recommended parameter
10094

101-
For nanopore data, we recommand to use minimap2 with all-vs-all nanopore preset with maximal distance between seeds fixe to 500 (option `-g 500`) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 4 (option `-c`) and minimal coverage of read fixed to 0.4 (option `-n`).
95+
For nanopore data, we recommend using minimap2 with all-vs-all nanopore preset with a maximal distance between seeds fixe to 500 (option `-g 500`) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option `-c`) and minimal coverage of read fixed to 0.4 (option `-n`).
10296

10397
This is an exemple of how run a yacrd scrubbing:
10498
```
10599
minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
106100
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
107101
```
108102

109-
For pacbio P6-C4 data, we recommand to use minimap2 with all-vs-all pacbio preset with maximal distance between seeds fixe to 800 (option `-g 800`) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 4 (option `-c 4`) and minimal coverage of read fixed to 0.4 (option `-n 0.4`).
103+
For pacbio P6-C4 data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 800 (option `-g 800`) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option `-c 4`) and minimal coverage of read fixed to 0.4 (option `-n 0.4`).
110104

111105
```
112106
minimap2 -x ava-pb -g 800 reads.fasta reads.fasta > overlap.paf
113107
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta
114108
```
115109

116-
For pacbio Sequel data, we recommand to use minimap2 with all-vs-all pacbio preset with maximal distance between seeds fixe to 5000 (option `-g 5000`) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 3 (option `-c 3`) and minimal coverage of read fixed to 0.4 (option `-n 0.4`).
110+
For pacbio Sequel data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 5000 (option `-g 5000`) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 3 (option `-c 3`) and minimal coverage of read fixed to 0.4 (option `-n 0.4`).
117111

118112
```
119113
minimap2 -x ava-pb -g 5000 reads.fasta reads.fasta > overlap.paf
@@ -133,7 +127,7 @@ yacrd use extension to detect format file if your filename contains (anywhere):
133127

134128
#### Compression
135129

136-
yacrd automaticly detect file if is compress or not (gzip, bzip2 and lzma compression is avaible). For post-detection operation if input is compress output have same compression.
130+
yacrd automatically detect file if is compress or not (gzip, bzip2 and lzma compression is available). For post-detection operation, if input is compressed output have the same compression format.
137131

138132
#### Use yacrd report as input
139133

@@ -142,13 +136,13 @@ You can use yacrd report as input in place of overlap file, `ondisk` option are
142136
## Output
143137

144138
```
145-
type_of_read id_in_mapping_file length_of_read length_of_gap,begin_pos_of_gap,end_pos_of_gap;length_of_gap,be…
139+
type_of_read id_in_mapping_file length_of_read length_of_gap,begin_pos_of_gap,end_pos_of_gap;length_of_gap,be…
146140
```
147141

148142
### Example
149143

150144
```
151-
NotCovered readA 4599 3782,0,3782
145+
NotCovered readA 4599 3782,0,3782
152146
```
153147

154148
Here, readA doesn't have sufficient coverage, there is a zero-coverage region of length 3782bp between positions 0 and 3782.
@@ -170,13 +164,13 @@ Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré, yacrd and fpa: upstream too
170164
bibtex format:
171165
```
172166
@article {Marijon2019,
173-
author = {Marijon, Pierre and Chikhi, Rayan and Varr{\'e}, Jean-St{\'e}phane},
174-
title = {yacrd and fpa: upstream tools for long-read genome assembly},
175-
elocation-id = {674036},
176-
year = {2019},
177-
doi = {10.1101/674036},
178-
URL = {https://www.biorxiv.org/content/early/2019/06/18/674036},
179-
eprint = {https://www.biorxiv.org/content/early/2019/06/18/674036.full.pdf},
180-
journal = {bioRxiv}
167+
author = {Marijon, Pierre and Chikhi, Rayan and Varr{\'e}, Jean-St{\'e}phane},
168+
title = {yacrd and fpa: upstream tools for long-read genome assembly},
169+
elocation-id = {674036},
170+
year = {2019},
171+
doi = {10.1101/674036},
172+
URL = {https://www.biorxiv.org/content/early/2019/06/18/674036},
173+
eprint = {https://www.biorxiv.org/content/early/2019/06/18/674036.full.pdf},
174+
journal = {bioRxiv}
181175
}
182176
```

src/cli.rs

Lines changed: 16 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -22,45 +22,39 @@ SOFTWARE.
2222

2323
#[derive(StructOpt, Debug)]
2424
#[structopt(
25-
version = "0.6b Mew",
25+
version = "0.6.0 Flareon",
2626
author = "Pierre Marijon <[email protected]>",
2727
name = "yacrd",
2828
about = "
2929
Yacrd use overlap between reads, to detect 'good' and 'bad' region,
30-
region with coverage over threshold is 'good' other are 'bad'.
31-
If read have a 'bad' region in middle this reads is mark as 'Chimeric'.
32-
If ratio of 'bad' region length on total read length is larger than threshold this reads is mark as 'Not_covered'.
30+
a region with coverage over the threshold is 'good' others are 'bad'.
31+
If read has a 'bad' region in middle this reads is mark as 'Chimeric'.
32+
If the ratio of 'bad' region length on total read length is larger than threshold this reads is mark as 'Not_covered'.
3333
3434
Yacrd can make some other actions:
35-
- filter: for sequence or overlap file, record with reads marked as Chimeric or Not_covered isn't write in output
36-
- extract: for sequence or overlap file, record contain reads marked as Chimeric or Not_covered is write in output
37-
- split: for sequence file bad region in middle of reads are removed, Not_covered read is removed
38-
- scrubb: for sequence file all bad region are removed, Not_covered read is removed
35+
- filter: for sequence or overlap file, record with reads marked as Chimeric or NotCovered isn't written in the output
36+
- extract: for sequence or overlap file, record contains reads marked as Chimeric or NotCovered is written in the output
37+
- split: for sequence file bad region in the middle of reads are removed, NotCovered read is removed
38+
- scrubb: for sequence file all bad region are removed, NotCovered read is removed
3939
"
4040
)]
4141
pub struct Command {
4242
#[structopt(
4343
short = "i",
4444
long = "input",
4545
required = true,
46-
help = "path to input file overlap (.paf|.m4) or yacrd report (.yacrd) format audetected input-format overide detection"
46+
help = "path to input file overlap (.paf|.m4|.mhap) or yacrd report (.yacrd), format is autodetect and compression input is allowed (gz|bzip2|lzma)"
4747
)]
4848
pub input: String,
4949

5050
#[structopt(
5151
short = "o",
5252
long = "output",
5353
required = true,
54-
help = "path output file, yacrd format by default output-format can overide this value"
54+
help = "path output file"
5555
)]
5656
pub output: String,
5757

58-
#[structopt(long = "input-format", possible_values = &["paf", "m4", "yacrd", "json"], help = "set the input-format")]
59-
pub input_format: Option<String>,
60-
61-
#[structopt(long = "output-format", possible_values = &["yacrd", "json"], default_value = "yacrd", help = "set the output-format")]
62-
pub output_format: String,
63-
6458
#[structopt(
6559
short = "c",
6660
long = "coverage",
@@ -73,21 +67,21 @@ pub struct Command {
7367
short = "n",
7468
long = "not-coverage",
7569
default_value = "0.8",
76-
help = "if ratio of bad region length on total lengh is lower that this value, all read is mark as bad"
70+
help = "if the ratio of bad region length on total length is lower than this value, read is marked as NotCovered"
7771
)]
7872
pub not_coverage: f64,
7973

8074
#[structopt(
8175
short = "d",
8276
long = "ondisk",
83-
help = "if it set yacrd create tempory file, with value of this parameter as prefix, to reduce memory usage but increase the runtime, warning if prefix contain path separator (`/` for unix or `\\` for windows) directory is delete"
77+
help = "yacrd switches to 'ondisk' mode which will reduce memory usage but increase computation time. The value passed as a parameter is used as a prefix for the temporary files created by yacrd. Be careful if the prefix contains path separators (`/` for unix or `\\` for windows) this folder will be deleted"
8478
)]
8579
pub ondisk: Option<String>,
8680

8781
#[structopt(
8882
long = "ondisk-buffer-size",
8983
default_value = "64000000",
90-
help = "with the default value yacrd in ondisk mode use around 800 MBytes, you can increase to reduce runtime but increase memory usage"
84+
help = "with the default value yacrd in 'ondisk' mode use around 1 GBytes, you can increase to reduce runtime but increase memory usage"
9185
)]
9286
pub ondisk_buffer_size: String,
9387

@@ -99,11 +93,11 @@ pub struct Command {
9993
pub enum SubCommand {
10094
#[structopt(about = "All bad region of read is removed")]
10195
Scrubb(Scrubb),
102-
#[structopt(about = "Record mark as chimeric or Not_covered is filter")]
96+
#[structopt(about = "Record mark as chimeric or NotCovered is filter")]
10397
Filter(Filter),
104-
#[structopt(about = "Record mark as chimeric or Not_covered is extract")]
98+
#[structopt(about = "Record mark as chimeric or NotCovered is extract")]
10599
Extract(Extract),
106-
#[structopt(about = "Record mark as chimeric or Not_covered is split")]
100+
#[structopt(about = "Record mark as chimeric or NotCovered is split")]
107101
Split(Split),
108102
}
109103

src/editor/extract.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ use anyhow::{anyhow, Context, Result};
2525
use bio::io::{fasta, fastq};
2626

2727
/* local use */
28-
use editor;
29-
use error;
30-
use stack;
31-
use util;
28+
use crate::editor;
29+
use crate::error;
30+
use crate::stack;
31+
use crate::util;
3232

3333
pub fn extract(
3434
input_path: &str,
@@ -225,8 +225,8 @@ where
225225
mod tests {
226226
use super::*;
227227

228-
use reads2ovl;
229-
use reads2ovl::Reads2Ovl;
228+
use crate::reads2ovl;
229+
use crate::reads2ovl::Reads2Ovl;
230230

231231
const FASTA_FILE: &'static [u8] = b">1
232232
ACTG

src/editor/filter.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ use anyhow::{anyhow, Context, Result};
2525
use bio::io::{fasta, fastq};
2626

2727
/* local use */
28-
use editor;
29-
use error;
30-
use stack;
31-
use util;
28+
use crate::editor;
29+
use crate::error;
30+
use crate::stack;
31+
use crate::util;
3232

3333
pub fn filter(
3434
input_path: &str,
@@ -225,8 +225,8 @@ where
225225
mod tests {
226226
use super::*;
227227

228-
use reads2ovl;
229-
use reads2ovl::Reads2Ovl;
228+
use crate::reads2ovl;
229+
use crate::reads2ovl::Reads2Ovl;
230230

231231
const FASTA_FILE: &'static [u8] = b">1
232232
ACTG

src/editor/mod.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ pub use self::split::*;
3636
use anyhow::{Context, Result};
3737

3838
/* local use */
39-
use error;
40-
use util;
39+
use crate::error;
40+
use crate::util;
4141

4242
#[derive(Debug, PartialEq)]
4343
pub enum ReadType {

src/editor/scrubbing.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ use anyhow::{anyhow, Context, Result};
2525
use bio::io::{fasta, fastq};
2626

2727
/* local use */
28-
use editor;
29-
use error;
30-
use stack;
31-
use util;
28+
use crate::editor;
29+
use crate::error;
30+
use crate::stack;
31+
use crate::util;
3232

3333
pub fn scrubbing(
3434
input_path: &str,
@@ -212,8 +212,8 @@ where
212212
mod tests {
213213
use super::*;
214214

215-
use reads2ovl;
216-
use reads2ovl::Reads2Ovl;
215+
use crate::reads2ovl;
216+
use crate::reads2ovl::Reads2Ovl;
217217

218218
const FASTA_FILE: &'static [u8] = b">1
219219
ACTGGGGGGACTGGGGGGACTG

src/editor/split.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ use anyhow::{Context, Result};
2525
use bio::io::{fasta, fastq};
2626

2727
/* local use */
28-
use editor;
29-
use error;
30-
use stack;
31-
use util;
28+
use crate::editor;
29+
use crate::error;
30+
use crate::stack;
31+
use crate::util;
3232

3333
pub fn split(
3434
input_path: &str,
@@ -202,8 +202,8 @@ where
202202
mod tests {
203203
use super::*;
204204

205-
use reads2ovl;
206-
use reads2ovl::Reads2Ovl;
205+
use crate::reads2ovl;
206+
use crate::reads2ovl::Reads2Ovl;
207207

208208
const FASTA_FILE: &'static [u8] = b">1
209209
ACTGGGGGGACTGGGGGGACTG

0 commit comments

Comments
 (0)