Add LCS Based algorithm that finds similar strings.#50
Add LCS Based algorithm that finds similar strings.#50Ph0enixKM wants to merge 3 commits intorapidfuzz:mainfrom
Conversation
|
@dguo, are these changes sufficient? |
| table[0] = table.pop().unwrap(); | ||
| table.push(vec![0 as usize; left.len() + 1]); |
There was a problem hiding this comment.
This means we reallocate on each iteration. Instead you should simply swap the rows: https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L331
Then again I feel like this should be possible using a single vector as long as table[0][col] is stored in the previous iteration before overwriting it. Similar to https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L255-L257
|
|
||
| #[inline] | ||
| fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize { | ||
| let (left, right) = get_shorter_longer_strings(left, right); |
There was a problem hiding this comment.
This leads to an extra allocation of two Strings. You should be able to switch them without this. Then again I am not sure we even want to swap them, since we have a large focus on binary size.
People who care about performance and not so much about binary size should use https://docs.rs/rapidfuzz/latest/rapidfuzz/distance/lcs_seq/index.html which is significantly faster.
| /// assert_eq!(0.8, lcs_normalized("night", "fight")); | ||
| /// assert_eq!(1.0, lcs_normalized("ferris", "ferris")); | ||
| /// ``` | ||
| pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 { |
There was a problem hiding this comment.
- This should follow the interface of other functions in the library which accept &str.
- Both the normalized and not normalized version should be public
- possibly name it lcs_seq instead of lcs, since lcs can mean both longest common subsequence and longest common substring which are different metrics with the same abbreviation
- probably we would want a generic version of the algorithms similar to other metrics.
| /// assert_eq!(1.0, lcs_normalized("ferris", "ferris")); | ||
| /// ``` | ||
| pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 { | ||
| let (len1, len2) = (left.as_ref().len(), right.as_ref().len()); |
There was a problem hiding this comment.
using len here is incorrect, since we operate on chars and not on bytes. It would need to use .chars().count()
| #[inline] | ||
| fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize { | ||
| let (left, right) = get_shorter_longer_strings(left, right); | ||
| let mut table = vec![vec![0 as usize; left.len() + 1]; 2]; |
There was a problem hiding this comment.
This should use the char count as well.
| if rletter == lletter { | ||
| table[1][col + 1] = 1 + table[0][col]; | ||
| } else { | ||
| table[1][col + 1] = max(table[0][col + 1], table[1][col]); | ||
| } |
There was a problem hiding this comment.
In rust I would probably use something like:
| if rletter == lletter { | |
| table[1][col + 1] = 1 + table[0][col]; | |
| } else { | |
| table[1][col + 1] = max(table[0][col + 1], table[1][col]); | |
| } | |
| table[1][col + 1] = if rletter == lletter { | |
| 1 + table[0][col] | |
| } else { | |
| max(table[0][col + 1], table[1][col]) | |
| }; |
instead.
|
Thank you @maxbachmann for your time. I'll fix the code soon and resolve conflicts |
This solution uses the length finding variant of LCS algorithm.
Time Complexity: O(n * m)
Memory Complexity: O(min(n, m))
The solution itself is based on a lightweight library that I've created myself some time ago. I just realised that this great library exists and wanted to contribute my solutions as I see that you do not have LCS based algorithm in your arsenal.