The algorithm is as follows:
- Trim whitespace from each line in both files
- Perform a UNIX diff
- Map unchanged lines
- Analyzing the UNIX diff, let
leftLines
be the lines that are deleted from the left file, andrightLines
the lines that are added to the right file. - Map each
leftLine
to arightLine
if their distance is smaller than a predefined threshold. The distance is a combination of levenshtein distance of the two lines as well as the cosine similarity of the context around each line.
See this research paper for details about the algorithm.
The paper describes the use of simhash to improve performance. This implementation does not perform this optimization because the performance seems "good enough".
The paper also describes an option to detect line spliting. This is not implemented.