In this second blog post about Julia, I’ll share with you a Julia implementation of the Levenshtein algorithm.
The Levenshtein algorithm
The basic Levenshtein algorithm is used to count the minimum number of insertions, deletions and substitutions that are needed to convert one string into another. For instance, to convert English doubt into French doute, you need at least two operations. You could replace the b by a t and then replace the t by an e; or you could delete the b and then insert the e. As this example shows, there may be more than one way to convert one string into another using the minimum number of required operations, but this minimum number itself is unique for each pair of strings.
Implementation in Julia
I won’t cover the logic of the Levenshtein algorithm here. The following is a straightforward Julia implementation of the pseudocode found on Wikipedia, assuming a cost of 1 for all operations. The function takes two inputs (a string a that is to be converted to a string b) and outputs an array with the Levenshtein distances between all substrings of a on the one hand and all substrings of b on the other hand. The entry in the bottom right corner of this array is the Levenshtein distances between the full strings and this is output separately as well.
Let’s compute the Levenshtein distance between the German word Zyklus (‘cycle’) and its Swedish counterpart cykel. Note the use of ; at the end of the line to suppress the output.
However, since the levenshtein() function outputs two pieces of information (both the matrix with the distances between the substrings as well as the final Levenshtein distance), this vectorised call yields a tuple of three subtuples, each subtuple containing both a matrix and the corresponding final Levenshtein distance. This is why the output above looks so messy. If we wanted to obtain just the Levenshtein distances, we could write a for-loop to extract them. But I think an easier solution is to first write a wrapper around the levenshtein() function that outputs only the final Levenshtein distance and use the vectorised version of this wrapper instead:
We now know that we need four operations to transform Zyklus into cykel and five to transform zuster into Schwester. But which are the operations that you need for these transformations? The function lev_alignment() defined below outputs one possible set of operations that would do the job. (Unlike the minimum number of operations required to transform one string into another, the set of operations needed isn’t uniquely defined.)
functionlev_alignment(a::String, b::String) source =Vector{Char}() target =Vector{Char}() operations =Vector{Char}() lev_matrix =levenshtein(a, b)[1] row =size(lev_matrix, 1) col =size(lev_matrix, 2)while (row >1&& col >1)if lev_matrix[row -1, col -1] == lev_matrix[row, col] && lev_matrix[row -1, col -1] <=min( lev_matrix[row -1, col] , lev_matrix[row, col -1] ) row = row -1 col = col -1pushfirst!(source, a[row])pushfirst!(target, b[col])pushfirst!(operations, ' ')elseif lev_matrix[row -1, col] <=min(lev_matrix[row -1, col -1], lev_matrix[row, col -1]) row = row -1pushfirst!(source, a[row])pushfirst!(target, ' ')pushfirst!(operations, 'D')elseif lev_matrix[row, col -1] <= lev_matrix[row -1, col -1] col = col -1pushfirst!(source, ' ')pushfirst!(target, b[col])pushfirst!(operations, 'I')else row = row -1 col = col -1pushfirst!(source, a[row])pushfirst!(target, b[col])pushfirst!(operations, 'S')endendend# If first column reached, move up.while (row >1) row = row -1pushfirst!(source, a[row])pushfirst!(target, ' ')pushfirst!(operations, 'D')end# If first row reached, move left.while (col >1) col = col -1pushfirst!(source, ' ')pushfirst!(target, b[col])pushfirst!(operations, 'I')endreturnvcat(reshape(source, (1, :)) , reshape(target, (1, :)) , reshape(operations, (1, :)) )end
lev_alignment (generic function with 1 method)
I won’t cover the logic behind the algorithm as this is more about learning Julia that the Levenshtein algorithm. On the Julia side, note first how empty character vectors can be initialised. Moreover, notice that the pushfirst!()
function is decorated with a ! (a ‘bang’). This communicates to whoever is reading the code that this function changes some of its input. For instance, pushfirst!(source, a[row]) means that the current character of a (i.e., a[row]) is added to the front of the source vector. That is, this command changes the source vector. Finally, the source, target and operations vectors are all column vectors. In order to display them somewhat nicely, I converted each of them to a single-row matrix using reshape(). Then, the three resulting rows are concatenated vertically using vcat() to show how the two strings can be aligned and which operations are needed to transform one into the other.
So we substitute c for z, insert an e and delete the u and s. As I mentioned, this set of operations isn’t uniquely defined. Indeed, we could have also substituted c for z, e for l and l for u and then deleted the s. This also corresponds to a Levenshtein distance of four operations.
Normalised Levenshtein distances
Above, we computed raw Levenshtein distances. The problem with these is that longer string pairs will tend to have larger raw Levenshtein distances than shorter string pairs, even if they do seem more similar. To correct for this, we can computed normalised Levenshtein distances instead. There are various ways to compute these; one option is to divide the raw Levenshtein distance by the length of the alignment:
(Behind the scenes, we run the Levenshtein algorithm twice: once in lev_dist() and again in lev_alignment(). This seems wasteful - unless the Julia compiler is able to clean up the double work? I’m not sure.)
We obtain a normalised Levenshtein distance of about 0.57 for Zyklus - cykel:
norm_lev_dist("zyklus", "cykel")
0.5714285714285714
We can use a vectorised version of this function, too:
norm_lev_dist.(dutch, german)
(0.75, 0.5555555555555556, 0.5)
Of course, normalised Levenshtein distances are symmetric, so we obtain the same result when running the following command: