Add LCS Based algorithm that finds similar strings. by Ph0enixKM · Pull Request #50 · rapidfuzz/strsim-rs

Ph0enixKM · 2022-08-09T15:01:27Z

This solution uses the length finding variant of LCS algorithm.
Time Complexity: O(n * m)
Memory Complexity: O(min(n, m))

The solution itself is based on a lightweight library that I've created myself some time ago. I just realised that this great library exists and wanted to contribute my solutions as I see that you do not have LCS based algorithm in your arsenal.

Ph0enixKM · 2022-08-15T12:11:24Z

@dguo, are these changes sufficient?

maxbachmann · 2024-01-04T21:48:50Z

src/lib.rs

+        table[0] = table.pop().unwrap();
+        table.push(vec![0 as usize; left.len() + 1]);


This means we reallocate on each iteration. Instead you should simply swap the rows: https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L331

Then again I feel like this should be possible using a single vector as long as table[0][col] is stored in the previous iteration before overwriting it. Similar to https://github.com/dguo/strsim-rs/blob/1d92c1d51c6118cd95d7417a6dcbd25abb9c36c0/src/lib.rs#L255-L257

maxbachmann · 2024-01-04T21:51:58Z

src/lib.rs

+
+#[inline]
+fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize {
+    let (left, right) = get_shorter_longer_strings(left, right);


This leads to an extra allocation of two Strings. You should be able to switch them without this. Then again I am not sure we even want to swap them, since we have a large focus on binary size.

People who care about performance and not so much about binary size should use https://docs.rs/rapidfuzz/latest/rapidfuzz/distance/lcs_seq/index.html which is significantly faster.

maxbachmann · 2024-01-04T21:57:28Z

src/lib.rs

+/// assert_eq!(0.8, lcs_normalized("night", "fight"));
+/// assert_eq!(1.0, lcs_normalized("ferris", "ferris"));
+/// ```
+pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 {


This should follow the interface of other functions in the library which accept &str.

Both the normalized and not normalized version should be public

possibly name it lcs_seq instead of lcs, since lcs can mean both longest common subsequence and longest common substring which are different metrics with the same abbreviation

probably we would want a generic version of the algorithms similar to other metrics.

maxbachmann · 2024-01-04T22:02:01Z

src/lib.rs

+/// assert_eq!(1.0, lcs_normalized("ferris", "ferris"));
+/// ```
+pub fn lcs_normalized(left: impl AsRef<str>, right: impl AsRef<str>) -> f64 {
+    let (len1, len2) = (left.as_ref().len(), right.as_ref().len());


using len here is incorrect, since we operate on chars and not on bytes. It would need to use .chars().count()

maxbachmann · 2024-01-04T22:02:54Z

src/lib.rs

+#[inline]
+fn lcs_length(left: impl AsRef<str>, right: impl AsRef<str>) -> usize {
+    let (left, right) = get_shorter_longer_strings(left, right);
+    let mut table = vec![vec![0 as usize; left.len() + 1]; 2];


This should use the char count as well.

maxbachmann · 2024-01-04T22:05:46Z

src/lib.rs

+            if rletter == lletter {
+                table[1][col + 1] = 1 + table[0][col];
+            } else {
+                table[1][col + 1] = max(table[0][col + 1], table[1][col]);
+            }


In rust I would probably use something like:

Suggested change

if rletter == lletter {

table[1][col + 1] = 1 + table[0][col];

} else {

table[1][col + 1] = max(table[0][col + 1], table[1][col]);

}

table[1][col + 1] = if rletter == lletter {

1 + table[0][col]

} else {

max(table[0][col + 1], table[1][col])

};

instead.

Ph0enixKM · 2024-01-05T08:59:25Z

Thank you @maxbachmann for your time. I'll fix the code soon and resolve conflicts

Ph0enixKM added 3 commits August 9, 2022 16:57

feat: add lcs based solution

59327ef

fix: embed lcs code

753beff

fix: readme typo

b26ebeb

maxbachmann reviewed Jan 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LCS Based algorithm that finds similar strings.#50

Add LCS Based algorithm that finds similar strings.#50
Ph0enixKM wants to merge 3 commits intorapidfuzz:mainfrom
Ph0enixKM:lcs

Ph0enixKM commented Aug 9, 2022 •

edited

Loading

Uh oh!

Ph0enixKM commented Aug 15, 2022

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

maxbachmann Jan 4, 2024

Uh oh!

Ph0enixKM commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		table[0] = table.pop().unwrap();
		table.push(vec![0 as usize; left.len() + 1]);

Uh oh!

Conversation

Ph0enixKM commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0enixKM commented Aug 15, 2022

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

maxbachmann Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

Ph0enixKM commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ph0enixKM commented Aug 9, 2022 •

edited

Loading