Architecture

The algorithm is as follows:

Trim whitespace from each line in both files
Perform a UNIX diff
Map unchanged lines
Analyzing the UNIX diff, let leftLines be the lines that are deleted from the left file, and rightLines the lines that are added to the right file.
Map each leftLine to a rightLine if their distance is smaller than a predefined threshold. The distance is a combination of levenshtein distance of the two lines as well as the cosine similarity of the context around each line.

See this research paper for details about the algorithm.

Differences from the research paper

The paper describes the use of simhash to improve performance. This implementation does not perform this optimization because the performance seems "good enough".

The paper also describes an option to detect line spliting. This is not implemented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARCHITECTURE.md

ARCHITECTURE.md

Architecture

Differences from the research paper

Files

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Differences from the research paper