Skip to content

Add levenshtein distance. Fix #404 #567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ examples_files/
**/examples.rst
_build/
v2.1/docs/engine_files/
build
build/
Pipfile
Pipfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
------
Syntax
------

**levenshtein (** op1 , op2 **)**

----------------
Input parameters
----------------
.. list-table::

* - op1, op2
- the operands


------------------------------------
Examples of valid syntaxes
------------------------------------
.. code-block::

levenshtein(DS_1, DS_2)
levenshtein("foo", "bar")

------------------------------------
Semantics for scalar operations
------------------------------------
Levenshtein distance is a string metric for measuring the difference between two sequences.

For example:

| ``levenshtein("foo", "bar")`` gives ``3``
| ``levenshtein("foo", "")`` gives ``3``
| ``levenshtein("foo", "foo")`` gives ``0``
| ``levenshtein("bar", "baz")`` gives ``1``

-----------------------------
Input parameters type
-----------------------------
op1, op2 ::

dataset { measure<string> _+ }
| component<string>
| string

-----------------------------
Result type
-----------------------------
result ::

dataset { measure<integer> _+ }
| component<integer>
| integer

-----------------------------
Additional Constraints
-----------------------------
Parameters cannot be omitted.

---------
Behaviour
---------

As for the invocations at Data Set level, the operator has the behaviour of the “Operators applicable on one Scalar Value
or Data Set or Data Set Component”. As for the invocations at Component or Scalar level, the operator has the behaviour
of the “Operators applicable on more than two Scalar Values or Data Set Components”
(see the section “Typical behaviours of the ML Operators”).
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Id_1,Id_2,Me_1,Me_2
1,A,"hello world","hello"
2,A,"say hello","hello"
3,A,"he","hello"
4,A,"hello!","hello"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"name": "DS_1",
"components": [
{
"name": "Id_1",
"role": "Identifier",
"data_type": "Integer"
},
{
"name": "Id_2",
"role": "Identifier",
"data_type": "String"
},
{
"name": "Me_1",
"role": "Measure",
"data_type": "String"
},
{
"name": "Me_2",
"role": "Measure",
"data_type": "String"
}
]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Id_1,Id_2,Me_1,Me_2
1,A,"hi world","hello"
2,A,"say hi","hello"
3,A,"he","hello"
4,A,"hi!","hello"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"name": "DS_1",
"components": [
{
"name": "Id_1",
"role": "Identifier",
"data_type": "Integer"
},
{
"name": "Id_2",
"role": "Identifier",
"data_type": "String"
},
{
"name": "Me_1",
"role": "Measure",
"data_type": "String"
},
{
"name": "Me_2",
"role": "Measure",
"data_type": "String"
}
]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Id_1,Id_2,Me_1,Me_2
1,A,4,0
2,A,4,0
3,A,0,0
4,A,4,0
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"name": "DS_r",
"components": [
{
"name": "Id_1",
"role": "Identifier",
"data_type": "Integer"
},
{
"name": "Id_2",
"role": "Identifier",
"data_type": "String"
},
{
"name": "Me_1",
"role": "Measure",
"data_type": "Integer"
},
{
"name": "Me_2",
"role": "Measure",
"data_type": "Integer"
}
]}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DS_r := levenshtein(DS_1, DS_2);
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Id_1,Id_2,Me_1,Me_2,delta
1,A,"hello world","hello",6
2,A,"say hello","hello",4
3,A,"he","hello",3
4,A,"hello!","hello",1
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"name": "DS_r",
"components": [
{
"name": "Id_1",
"role": "Identifier",
"data_type": "Integer"
},
{
"name": "Id_2",
"role": "Identifier",
"data_type": "String"
},
{
"name": "Me_1",
"role": "Measure",
"data_type": "String"
},
{
"name": "Me_2",
"role": "Measure",
"data_type": "String"
},
{
"name": "delta",
"role": "Measure",
"data_type": "Integer"
}
]}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DS_r := DS_1 [calc delta := levenshtein (Me_1, Me_2)];
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Given the operand dataset DS_1:
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
=============================================================================================
String distances: `levenshtein`
=============================================================================================
.. include:: ./content.rst
.. include:: ./examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ VTL-ML - String Operators
String pattern replacement/index
String pattern location/index
String length/index
String distances/index
3 changes: 3 additions & 0 deletions v2.1/src/main/antlr4/org/sdmx/vtl/Vtl.g4
Original file line number Diff line number Diff line change
Expand Up @@ -173,15 +173,18 @@ stringOperators:
| SUBSTR LPAREN expr (((COMMA startParameter=optionalExpr) (COMMA endParameter=optionalExpr))? | COMMA startParameter=optionalExpr ) RPAREN # substrAtom
| REPLACE LPAREN expr COMMA param=expr ( COMMA optionalExpr)? RPAREN # replaceAtom
| INSTR LPAREN expr COMMA pattern=expr ( COMMA startParameter=optionalExpr)? (COMMA occurrenceParameter=optionalExpr)? RPAREN # instrAtom
| LEVENSHTEIN LPAREN left=expr COMMA right=expr RPAREN # levenshteinAtom
;

stringOperatorsComponent:
op=(TRIM | LTRIM | RTRIM | UCASE | LCASE | LEN) LPAREN exprComponent RPAREN # unaryStringFunctionComponent
| SUBSTR LPAREN exprComponent (((COMMA startParameter=optionalExprComponent) (COMMA endParameter=optionalExprComponent))? | COMMA startParameter=optionalExprComponent ) RPAREN # substrAtomComponent
| REPLACE LPAREN exprComponent COMMA param=exprComponent ( COMMA optionalExprComponent)? RPAREN # replaceAtomComponent
| INSTR LPAREN exprComponent COMMA pattern=exprComponent ( COMMA startParameter=optionalExprComponent)? (COMMA occurrenceParameter=optionalExprComponent)? RPAREN # instrAtomComponent
| LEVENSHTEIN LPAREN leftComponent=exprComponent COMMA rightComponent=exprComponent RPAREN # levenshteinAtomComponent
;


numericOperators:
op=(CEIL | FLOOR | ABS | EXP | LN | SQRT) LPAREN expr RPAREN # unaryNumeric
| op=(ROUND | TRUNC) LPAREN expr (COMMA optionalExpr)? RPAREN # unaryWithOptionalNumeric
Expand Down
1 change: 1 addition & 0 deletions v2.1/src/main/antlr4/org/sdmx/vtl/VtlTokens.g4
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ lexer grammar VtlTokens;
RTRIM : 'rtrim';
INSTR : 'instr';
REPLACE : 'replace';
LEVENSHTEIN : 'levenshtein';
CEIL : 'ceil';
FLOOR : 'floor';
SQRT : 'sqrt';
Expand Down
6 changes: 6 additions & 0 deletions v2.1/src/test/resources/NegativeTests.vtl
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,12 @@ length()
//more than one operands used
length(DS_1, "hi")

//no operands used
levenshtein()

//second operand missing
levenshtein(DS_1)

//second operand missing
DS_r := DS_1 +

Expand Down
4 changes: 4 additions & 0 deletions v2.1/src/test/resources/PositiveTests.vtl
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,10 @@ DS_r := DS_2 [calc Me_10:= length(Me_1), Me_20:=length(Me_2)];

DS_r := length(DS_2);

DS_r := levenshtein(DS_1, "test");

DS_r := levenshtein(DS_1, DS_2);

DS_r := + DS_1;

DS_r := DS_1 [calc Me_3 := + Me_1 ];
Expand Down