Breaking changes for v1 #35

jakobnissen · 2023-12-30T21:19:54Z

This PR implements a thorough overhaul of Kmers.jl. There are too many changes to list exhaustively, but here are some important changes

Skipmers have been removed. They are, to my knowledge, not used widely in bioinformatics, and not in the bioinfo literature. I can add them if people want to.
There are no longer any Kmer constructors that guess the alphabet
There are no longer any Kmer constructors that infers the length of the sequence
There are no longer any functions to produce a shuffled kmer. I don't think this is widely used
There are no longer functionality to produce random kmers.

TODO

Wait for feedback from stakeholders
Review TODOs in code
Reach a code coverage level of 100 %
Review documentation
Release BioSymbols and BioSequences versions compatible
Implement testeable benchmarks and micro-optimise everything
Determine real compat of Random and BioSequences
Deploy preview doc build
Squash all changes into one mega commit

Closes #12
Closes #13 (with the answer "no")
Closes #17
Closes #23
Closes #28
Closes #31
Closes #34

cjprybol

Wow, nice work on this @jakobnissen. This is a big overhaul.

I skimmed all of the changes and think it looks nice. I can try to give it a detailed review late next week if you'd like a second set of eyes (to be honest I'm completely lost with the bit-code here and in biosequences but the rest of it makes sense).

I saw your note that people can conceptually think of kmers as SVectors, but they're still implemented as NTuples, correct? I had been using SVectors of BioSymbol nucleotides to close the gaps of what I wasn't able to do with Kmers.jl, mostly with Amino Acids but with DNA and RNA as well. I liked the design interface of Kmers as SVectors, but I assume there's some efficiency reason why that isn't the direction that Kmers.jl went with implementation.

Are there any Fw + Rv Kmer iterators? I know that the current iterator returns fw + rv. It's easy to just call reverse_complement so I won't miss it if you removed it, but I'm just trying to think through where I'll need to update my projects to handle these updates.

Thanks for all of your efforts on this and related projects!

jakobnissen · 2023-12-30T22:23:08Z

Thank you for the kind words, @cjprybol!

I'm planning to post on Discourse one of the coming days about this upcoming release and request feedback. There is still a little more work to do - for example, this PR needs changes to BioSequences to land first. Nonetheless - yes! I would like a review. To test it, dev the kmers_compat branch of BioSequences (there is an open PR), then dev Kmers.
The best review you can give is a usability one. What could be nicer, API wise? Any missing but important functionality? I'm pretty comfortable with the lower level implementation.

We don't back Kmers with SVector, because a) we want to avoid StaticArrays as a large dependency and b) we need to do most of the low-level bit twiddling ourselves anyway, since symbols are stored in a packed format in Kmers (e.g. only 2 bits for a nucleotide instead of whole byte).

A fw/rv iterator might be a good idea! I've currently implemented a canonical kmer iterator that only outputs the lexographically smaller of the two. But I think a fw/rv iterator that yields both may be more generically usable.

kescobo · 2023-12-31T12:32:16Z

I've currently implemented a canonical kmer iterator that only outputs the lexographically smaller of the two.

I smell MinHash!

In the discourse post, you note that a bunch of other more complicated convenience features were removed or not implemented. I infer that in most cases, the intent is for those things to be implemented in other packages, but are there any features that you think will belong here in a point release after the API is stable?

jakobnissen · 2023-12-31T13:26:09Z

@kescobo In the discourse post, I only mention skipmers, minimizers and k-min-mers. In the message of this PR, I mention other misc features such as random kmers.

I believe skipmers were an invention of Sebarina Ward's research group, which is why they were originally implemented in BioSequences.jl (and Kmers.jl). As far as I know, they have not had much usage in the wider bioinfo world. One notable exception is the bwa-mem algorithm of popular aligner bwa by Heng Li. However, minimizers have been shown to be superior to kmers and skipmers for most applications, and indeed Heng Li have created minimap2 using minimizers, which is mostly a successor to bwa (and is mostly just better though it's complicated). So, while it's a little awkward to remove something so explicitly added by the original author, I don't believe skipmers really belong in a generic package like Kmers.jl, though I could be persuaded otherwise.

Minimizers, and their even newer relative k-min-mers, are more useful, but the precise algorithm for extracting them differ from application to application, so I'm guessing users would want to create their own minimizer iterators. I'd rather provide functionality to more easily make custom iterators than to provide my own which is probably not optimal for most users.

Regarding the other functionality - shuffled and random kmers, I don't think they are that useful. I'll add them if someone needs them and makes a case for them. For the constructors that infer the length and alphabet, I don't think it's a good interface. Much better to be explicit.

I smell MinHash!

Yes! You can combine MinHash.jl with this iterator to have a pretty efficient minhash:

julia> using MinHash, BioSequences, Kmers, BenchmarkTools

julia> seq = randdnaseq(10_000_000);

julia> @btime sketch(CanonicalDNAMers{31}(seq), 100)
  35.612 ms (8 allocations: 6.88 KiB)

That's 3.5 nanoseconds per kmer. That could plausibly become even faster, as I'm planning to optimise hashing of kmers as much as humanly possible.

kescobo · 2023-12-31T13:53:01Z

Great! So I take this to mean that you consider Kmers.jl feature complete, barring someone making a strong affirmative case for something. Or put another way, there aren't any features you have explicitly planned to implement after the 1.0 release (at least not in this package).

FWIW, I think this is a good decision - since you're aggressively optimizing performance here, it makes sense to keep the surface as small as possible.

I'd rather provide functionality to more easily make custom iterators than to provide my own which is probably not optimal for most users.

This seems really great. Are there any other things that are (or should be planned to be) extensible like this? Is it possible for other packages to make their own kmer types, and if so, is the interface well-documented? (Noting that I haven't yet looked at the code or docs, if this is explained elsewhere, no need to repeat yourself).

So, while it's a little awkward to remove something so explicitly added by the original author, I don't believe skipmers really belong in a generic package like Kmers.jl

I don't think it's awkward. That code is still all there in the history, and can be revived if needed. I don't think she'd expect you to maintain her code forever. One consideration might be whether it is in-principle possible to reimplement in a package that extends this one, but even if not, I don't think you need to worry about it.

docs/src/translation.md

camilogarciabotero · 2024-01-02T17:08:52Z

I tried to make some changes but wrongfully created an alternative PR (already closed), here are some small typos I found in the docs

docs/src/faq.md

camilogarciabotero · 2024-01-02T21:30:51Z

docs/src/iteration.md

+SpacedAAMers
+```
+
+The convenience functions [`each_dna_codon`](@ref) and [`each_rna_codon`](@ref) return `SpacedKmers` with a K value of 3 and step size of 3:


This is very neat! Would it be useful that the each_dna_codon method has a kwarg for the position or frame?

camilogarciabotero · 2024-01-02T21:50:54Z

I was experimenting a little bit with the each_dna_codon method and found an unexpected behavior. This is from the docs:

julia> collect(each_dna_codon("TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 TGA
 CGA
 TCG

But what if the input seq is of a BioSequence type:

collect(each_dna_codon(dna"TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 TGA
 TGA
 TGA

Shouldn't it be allowed?

kescobo

I'll give you my review in spurts so that you can begin to address them if you want. So far,
I've made it through the documentation. Amazing work! It's very clear.

The code will probably take me longer to get through, and I will probably have less to say about it. As described in the docs, the API seems great, and I agree with your choices re: what to remove. As I said a bit earlier, it might be nice to write up a little bit about how you might implement a custom iterator (similar to the MinHash part), if only to expose whether there are any parts of the API that need tweaking to facilitate this.

I'd be comfortable with you tagging this prior to my review of the code, since that may take me some time. But I will endeavor to get that done in the next week or two.

🎉 This is a really great addition to the ecosystem, thanks again for all of this work. It's really impressive!

kescobo · 2024-01-02T17:17:03Z

docs/src/hashing.md

+!!! warning
+    The value of hashes are guaranteed to be reproducible for a given version
+    of Kmers.jl and Julia, but may __change__ in new minor versions of Julia
+    or Kmers.jl


Probably for MinHash.jl, but maybe worth revisiting hashing and compatibility with other implementations. See sourmash-bio/sourmash#2284 and links, esp BioJulia/BioSequences.jl#243 for prior discussion.

I don't think think there's anything to do here - we don't want to slow down Kmers.jl for the sake of compat, but if there's anything that would help enabling this (eg fast conversion to String if that's possible), it could be worth considering.

kescobo · 2024-01-02T17:22:08Z

.github/workflows/UnitTests.yml

+      - uses: actions/checkout@v2
+      - uses: julia-actions/setup-julia@latest
+        with:
+          version: '1'


Do we only want to test on release?

kescobo · 2024-01-03T01:19:16Z

Project.toml

+StringViews = "354b36f9-a18e-4713-926e-db85100087ba"

 [deps]
 BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
+BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
+StringViews = "354b36f9-a18e-4713-926e-db85100087ba"


Does it make sense that StringViews is both a dep AND a weak dep? Is there any advantage to that?

kescobo · 2024-01-03T01:20:13Z

README.md

-Kmers provides the type representing kmers as well as the implementations of
-the APIs specified by the
-[`BioSequences.jl`](https://github.com/BioJulia/BioSequences.jl) package.
+In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.


Suggested change

In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.

In Kmers.jl, the `Kmer` type is parameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.

kescobo · 2024-01-03T01:21:45Z

README.md

+## Usage
+### ⚠️ WARNING ⚠️
+`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.


Suggested change

## Usage

### ⚠️ WARNING ⚠️

`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.

## Usage

### ⚠️ WARNING ⚠️

`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.

I've seen this blank-line-between-headers on markdown lint before. I don't really care, and this will be the only comment I leave about it. Feel free to disregard.

kescobo · 2024-01-03T01:51:04Z

docs/src/kmers.md

+can be created with the type aliases `DNAKmer`, `RNAKmer` and `AAKmer`:
+
+```jldoctest
+julia> DNAKmer{3}("tag")


Is there a reason that the {3} can't be elided here?

on second thought, the string literals below should maybe be considered the best way to do this.

kescobo · 2024-01-03T01:55:03Z

docs/src/kmers.md

+julia> seq = randseq(DNAAlphabet{2}(), 100_000_000);
+
+julia> @time count_aaas(seq)
+  0.193463 seconds (32.05 k allocations: 2.051 MiB, 21.88% compilation time)


It may be obvious to you, but I have no idea what I'd expect here if the code weren't type stable or if the parsing overhead was being paid.

If that's the point you're trying to make, you might need to show this for a smaller sequence and show a slow/fast version (or else explain what you would expect).

kescobo · 2024-01-03T02:00:37Z

docs/src/kmers.md

+However, type unstable functions may be type-stable, if the indexing value is
+known at compile time, and the Julia compiler uses constant folding:
+
+```jldoctest
+julia> f(x) = x[2:5]; # 2:5 is a compile time constant
+
+julia> Test.@inferred f(mer"UCGUAGC"r)
+RNA 4-mer:
+CGUA
+```


I usually think of things like !!! warning blocks to be asides - it's a bit strange to have this bit that's related to the warning be outside of it. Possible solutions (if you agree it's a problem):

Put the warning prior to the L3 header and add "see below for solutions to this problem" inside it. Then start just below the header with "Tough indexing operations are often type-unstable, functions may be..."

Nest another callout block inside the warning, like

!!! warning Type instability is bad! !!! tip But functions are good!

kescobo · 2024-01-03T02:01:44Z

docs/src/kmers.md

Stopped after this file for now. Will work on this progressively over the next few days.

kescobo · 2024-01-03T02:04:23Z

docs/src/translation.md

+Since `Kmer`s are immutable, the in-place `translate!` function is not implemented for `Kmers`.
+Also, remember that `Kmer`s are only efficient when short (at most a few hundred symbols). Hence, entire exons or genes should probably not ever be represented by a `Kmer`, but rather as a `LongSequence` or `LongSubSeq` from BioSequences.jl.
+
+### Reverse translation


This is pretty cool!

camilogarciabotero · 2024-01-08T22:57:51Z

I was playing around more. This is probably an undesired behavior or something not yet implemented... But as the interfaces of BioSequences and Kmers share some methods I was wondering if the find* methods will work in some ways between BioSequence and Kmer types:

Symbol finding with findall:

findall(DNA_T, dna"ATGCTG")
2-element Vector{Int64}:
 2
 5

will this work:

findall(mer"T"d,   dna"ATGCTG")

BioRegex finding with findfirst and findlast

findfirst(biore"ATG"dna, dna"ATGCTG", 1)
1:3

will this work:

findfirst(mer"T"d,   dna"ATGCTG")

I didn't see this in the in this PR, but ignore this comment if this is already something discussed. Best

cjprybol · 2024-01-20T11:33:29Z

test/translation.jl

+    @test_throws Exception translate(mer"CUGUAGUUGUCGC"r)
+    @test_throws Exception translate(mer"AGCGA"d)
+
+    # Cannot transla AA seq


transla -> translate

cjprybol

I'm happy with this whenever everyone else is. This is a very generous contribution! As long as the tests pass, I don't have any concerns with the implementation details and am excited to work with this

BeastyBlacksmith · 2024-02-14T13:16:32Z

Can't wait to have it 😍

jakobnissen added 24 commits June 23, 2023 20:55

Use JuliaFormatter

efd1d8b

Random cleanup

4abc5e9

More fixups - squash

45a9f38

More stuff

cfaacb3

Add revtrans and setindex

2c64877

Add translation

1ecb76c

Make EveryKmer iterator

7385d9b

Fixup README.md

1293e69

More stuff

82b4767

Fixup README

603c591

Begin EveryCanonicalKmer

a3ecda4

Rename: EveryKmer to FwKmers

6c1535e

Start SpacedKmers

ba774f1

Some refactoring

6584e2e

Refactor to trait objects

550dd6e

Extensive refactoring

726021b

More refactoring

db1f16c

Start tests

c7e5e80

More tests

2912114

Add Canonical and UnambiguousKmers

648c4bd

Update some docstrings

1192b72

Add some docs

30952eb

Add rest of the docs

c92db18

Remove old iterators

d916616

cjprybol reviewed Dec 30, 2023

View reviewed changes

jakobnissen added 3 commits December 31, 2023 10:20

Add FwRvIterator

47b4e03

Push doc preview

c280f48

Misc cleanup

78fe5dd

Add FxHash

320596b

camilogarciabotero reviewed Jan 2, 2024

View reviewed changes

docs/src/translation.md Outdated Show resolved Hide resolved

camilogarciabotero reviewed Jan 2, 2024

View reviewed changes

docs/src/faq.md Outdated Show resolved Hide resolved

jakobnissen added 5 commits January 2, 2024 20:10

Misc changes

9c79f6a

Typos and fixes

825bc73

Make minhash example smaller

8801e32

Optimise slicing

2e7553f

Fix bug in slicing

2dd7acb

camilogarciabotero reviewed Jan 2, 2024

View reviewed changes

kescobo approved these changes Jan 3, 2024

View reviewed changes

cjprybol reviewed Jan 20, 2024

View reviewed changes

cjprybol approved these changes Jan 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking changes for v1 #35

Breaking changes for v1 #35

jakobnissen commented Dec 30, 2023 •

edited

Loading

cjprybol left a comment

jakobnissen commented Dec 30, 2023

kescobo commented Dec 31, 2023

jakobnissen commented Dec 31, 2023 •

edited

Loading

kescobo commented Dec 31, 2023 •

edited

Loading

camilogarciabotero commented Jan 2, 2024

camilogarciabotero Jan 2, 2024

camilogarciabotero commented Jan 2, 2024

kescobo left a comment

kescobo Jan 2, 2024

kescobo Jan 2, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

kescobo Jan 3, 2024

camilogarciabotero commented Jan 8, 2024

cjprybol Jan 20, 2024

cjprybol left a comment

BeastyBlacksmith commented Feb 14, 2024

	In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.
	In Kmers.jl, the `Kmer` type is parameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.

Breaking changes for v1 #35

Are you sure you want to change the base?

Breaking changes for v1 #35

Conversation

jakobnissen commented Dec 30, 2023 • edited Loading

TODO

cjprybol left a comment

Choose a reason for hiding this comment

jakobnissen commented Dec 30, 2023

kescobo commented Dec 31, 2023

jakobnissen commented Dec 31, 2023 • edited Loading

kescobo commented Dec 31, 2023 • edited Loading

camilogarciabotero commented Jan 2, 2024

Choose a reason for hiding this comment

camilogarciabotero commented Jan 2, 2024

kescobo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camilogarciabotero commented Jan 8, 2024

Choose a reason for hiding this comment

cjprybol left a comment

Choose a reason for hiding this comment

BeastyBlacksmith commented Feb 14, 2024

jakobnissen commented Dec 30, 2023 •

edited

Loading

jakobnissen commented Dec 31, 2023 •

edited

Loading

kescobo commented Dec 31, 2023 •

edited

Loading