|
| 1 | +# Akin Examples |
| 2 | + |
| 3 | +## Akin |
| 4 | + |
| 5 | +Akin is a collection of string comparison algorithms for Elixir. Algorithms can be called independently or combined to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries. |
| 6 | + |
| 7 | +## Algorithms |
| 8 | + |
| 9 | +Utilities are provided to return all avialable algorithms. |
| 10 | + |
| 11 | +```elixir |
| 12 | +Akin.Util.list_algorithms() |
| 13 | +``` |
| 14 | + |
| 15 | +**Note**: Hamming Distance is excluded as it only compares strings of equal length. To use the Hamming Distance algorithm, call it directly (see: [Independent Algorithms](#independent-algorithms)). |
| 16 | + |
| 17 | +## Combined Algorithms |
| 18 | + |
| 19 | +### Metrics |
| 20 | + |
| 21 | +Results from all algorithms are returned as a map of metrics. |
| 22 | + |
| 23 | +<!-- livebook:{"break_markdown":true} --> |
| 24 | + |
| 25 | +#### Compare Strings |
| 26 | + |
| 27 | +Experiment by changing the value of the strings. |
| 28 | + |
| 29 | +```elixir |
| 30 | +a = "weird" |
| 31 | +b = "wierd" |
| 32 | + |
| 33 | +Akin.compare(a, b) |
| 34 | +``` |
| 35 | + |
| 36 | +### Options |
| 37 | + |
| 38 | +Comparison accepts options in a Keyword list. |
| 39 | + |
| 40 | +1. `algorithms`: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0. |
| 41 | + 1. `metric` - algorithm metric. Default is both |
| 42 | + * "string": uses string algorithms |
| 43 | + * "phonetic": uses phonetic algorithms |
| 44 | + 2. `unit` - algorithm unit. Default is both. |
| 45 | + * "whole": uses algorithms best suited for whole string comparison (distance) |
| 46 | + * "partial": uses algorithms best suited for partial string comparison (substring) |
| 47 | +2. `level` - level for double phonetic matching. Default is "normal". |
| 48 | + * "strict": both encodings for each string must match |
| 49 | + * "strong": the primary encoding for each string must match |
| 50 | + * "normal": the primary encoding of one string must match either encoding of other string (default) |
| 51 | + * "weak": either primary or secondary encoding of one string must match one encoding of other string |
| 52 | +3. `match_at`: an algorith score equal to or above this value is condsidered a match. Default is 0.9 |
| 53 | +4. `ngram_size`: number of contiguous letters to split strings into. Default is 2. |
| 54 | +5. `short_length`: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8. |
| 55 | +6. `stem`: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default `false` |
| 56 | + |
| 57 | +```elixir |
| 58 | +opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]] |
| 59 | +Akin.compare(a, b, opts) |
| 60 | +``` |
| 61 | + |
| 62 | +```elixir |
| 63 | +opts = [algorithms: [metric: "phonetic", unit: "whole"]] |
| 64 | +Akin.compare(a, b, opts) |
| 65 | +``` |
| 66 | + |
| 67 | +```elixir |
| 68 | +Akin.compare(a, b, algorithms: [metric: "string", unit: "whole"], ngram_size: 1) |
| 69 | +``` |
| 70 | + |
| 71 | +#### n-gram Size |
| 72 | + |
| 73 | +The default ngram size for the algorithms is 2. You can change by setting |
| 74 | +a value in opts. |
| 75 | + |
| 76 | +```elixir |
| 77 | +opts = [algorithms: ["sorensen_dice"]] |
| 78 | +Akin.compare(a, b, opts) |
| 79 | +``` |
| 80 | + |
| 81 | +```elixir |
| 82 | +opts = [algorithms: ["sorensen_dice"], ngram_size: 1] |
| 83 | +Akin.compare(a, b, opts) |
| 84 | +``` |
| 85 | + |
| 86 | +#### Match Level |
| 87 | + |
| 88 | +The default match strictness is "normal" You change it by setting |
| 89 | +a value in opts. Currently it only affects the outcomes of the `substring_set` and |
| 90 | +`double_metaphone` algorithms |
| 91 | + |
| 92 | +```elixir |
| 93 | +left = "Alice in Wonderland" |
| 94 | +right = "Alice's Adventures in Wonderland" |
| 95 | + |
| 96 | +Akin.compare(left, right, algorithms: ["substring_set"]) |
| 97 | +``` |
| 98 | + |
| 99 | +```elixir |
| 100 | +Akin.compare(left, right, algorithms: ["substring_set"], level: "weak") |
| 101 | +``` |
| 102 | + |
| 103 | +```elixir |
| 104 | +left = "which way" |
| 105 | +right = "whitch way" |
| 106 | + |
| 107 | +Akin.compare(left, right, algorithms: ["double_metaphone"], level: "weak") |
| 108 | +``` |
| 109 | + |
| 110 | +```elixir |
| 111 | +Akin.compare(left, right, algorithms: ["double_metaphone"], level: "strict") |
| 112 | +``` |
| 113 | + |
| 114 | +#### Stems |
| 115 | + |
| 116 | +Compare the stemmed version of two strings. |
| 117 | + |
| 118 | +```elixir |
| 119 | +not_gerund = "write" |
| 120 | +gerund = "writing" |
| 121 | + |
| 122 | +Akin.compare(not_gerund, gerund, algorithms: ["bag_distance", "double_metaphone"]) |
| 123 | +``` |
| 124 | + |
| 125 | +```elixir |
| 126 | +Akin.compare(not_gerund, gerund, algorithms: ["bag_distance", "double_metaphone"], stem: true) |
| 127 | +``` |
| 128 | + |
| 129 | +### Preprocessing |
| 130 | + |
| 131 | +Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms. |
| 132 | + |
| 133 | +```elixir |
| 134 | +name = "Alice Liddell" |
| 135 | + |
| 136 | +Akin.Util.compose(name) |
| 137 | +``` |
| 138 | + |
| 139 | +### Accents |
| 140 | + |
| 141 | +```elixir |
| 142 | +name_a = "Hubert Łępicki" |
| 143 | + |
| 144 | +Akin.Util.compose(name_a) |
| 145 | +``` |
| 146 | + |
| 147 | +```elixir |
| 148 | +name_b = "Hubert Lepicki" |
| 149 | + |
| 150 | +Akin.compare(name_a, name_b) |
| 151 | +``` |
| 152 | + |
| 153 | +### Phonemes |
| 154 | + |
| 155 | +```elixir |
| 156 | +Akin.phonemes(name) |
| 157 | +``` |
| 158 | + |
| 159 | +```elixir |
| 160 | +Akin.phonemes("wonderland") |
| 161 | +``` |
| 162 | + |
| 163 | +## Independent Algorithms |
| 164 | + |
| 165 | +Each algorithm can be called directly. Module names are camelcased versions of the the snakecased algorithm names returned by `list_algorithms/0`. |
| 166 | + |
| 167 | +```elixir |
| 168 | +a = Akin.Util.compose("weird") |
| 169 | +b = Akin.Util.compose("wierd") |
| 170 | +Akin.BagDistance.compare(a, b) |
| 171 | +``` |
| 172 | + |
| 173 | +Hamming Distance is excluded from `list_algorithms/0` and the combined algorithm metrics as it only compares strings of equal length. To use the Hamming Distance algorithm, call it directly. |
| 174 | + |
| 175 | +```elixir |
| 176 | +Akin.Hamming.compare("weird", "wierd") |
| 177 | +``` |
0 commit comments