Monday, December 10, 2007

Boobs, Booty, and Booze: Adventures In Edit Distance

This week I got to thinking about word puzzles like this:


Fill in the four intermediate words. Each word is four letters long and differs by exactly one letter from the word above it and the word below it.

This is an example of a type of minimum edit distance. “Spry” is five steps away from “moat” when each step consists of changing a single letter.

How can puzzles like this be created? How can a puzzle's difficulty be evaluated? What does the space of possible puzzles look like?

To answer these questions, I went to and picked up the 12dicts package. This is a collection of word lists which have been compiled from twelve dictionaries. I chose the 2of12inf list, which contains any word listed in at least two of the twelve dictionaries, plus all of that word's inflections.

After filtering the list to words of a specific length the next step is to build an adjacency graph. I tried brute force first: a doubly-nested loop over all the words, testing each pair to see if they were adjacent. This takes a very long time. As an example, with the 2,546 four-letter words there are 3,242,331 pairs of words to consider.

Because the graph is pretty sparse there are much faster ways to build it. One way is to consider each letter position in turn, sorting the words into groups in which all the other letters are the same. All the words in a given group are neighbors of each other. Here's an implementation in Python:

adjacent_words = {}

for i in range(word_len):
grouped_words = {}
for word in words:
group = word[:i] + word[i+1:]
grouped_words.setdefault(group, []).append(word)

for group_words in grouped_words.itervalues():
for word1 in group_words:
for word2 in group_words:
if word1 != word2:
adjacent_words.setdefault(word1, set()).add(word2)

Once we have the graph structure in hand we can begin analyzing it in various ways. The simplest is to identify words that have no adjacencies. For instance, “sexy” and “odor” stand alone. These words can't be used in puzzles.

These “loner” words are a special case of identifying the connected components of the graph. These are the groups of words that are interconnected. A puzzle needs to have both of its endpoints in the same connected component or it won't be solvable. The basic algorithm for identifying connected components is to flood-fill from every word that hasn't already been visited, keeping track of all the words collected on each fill.

It turns out that, for lengths of three to six letters, most words belong to a single large connected component. As words get longer, they become less and less interconnected and the number of components increases rapidly.


In the table above, Adjacencies are the number of edges in the graph. Loners counts connected components with only a single word. Groups counts components with two or more words. Biggest is the size of the largest group.

The final column, Bridges, counts edges which, if removed, break a component into two separate pieces. These represent the word transformations which you might expect to use frequently, since they are the only way to get from words on one side to words on the other side.

As it happens, though, words tend to be very well connected, and so there are never more than 20 or so words on one side of a bridge. Most of the time there is only one word on the other side of the bridge, i.e. the word is a dead-end with only one neighbor. An example of a dead-end is the word SPRY in the puzzle above. There is only one possible word to go above it in the puzzle (at least in my dictionary).

The diagram below shows several bridges in the five-letter word graph.

The edge between boozy and booze represents the most common sort of bridge, dividing one word from the rest. In comparison, cutting the edge between booby and bobby isolates nine words. Other bridges are the edge from hobby to hubby, and from tubby to tabby.

The algorithm for finding bridges took a while for me to understand. It's hinted at in the exercises of Introduction to Algorithms (Corman, Leiserson, Rivest) and is built on a depth-first traversal of the graph. I found pseudocode for it but it happens to be incorrect.

The gist of it is that, if you do a depth-first traversal, any edges you find that point at already-visited nodes are back edges: they close a loop, meaning that none of the edges on that loop can be bridges. Related to this, all bridge edges must be edges of the tree produced by the depth-first traversal. During the recursive descent of the depth-first traversal you can compute the highest (closest to root) position that each subtree points to. If a subtree does not point to a node outside of itself, then the edge to that subtree is a bridge.

Another metric for good words to know is the words with the most connections. For example, here are the top few words of length four, along with how many neighboring words each has:
ware: 25
pats: 24
wine: 24
bale: 23
bare: 23
care: 23
core: 23
mare: 23
mate: 23
paps: 23
tars: 23
bars: 22
caps: 22
cars: 22
done: 22
line: 22
lore: 22
male: 22
mats: 22
mine: 22
pale: 22
pare: 22
pits: 22
tons: 22
tots: 22

I'm out of time now, and haven't gotten to the questions of generating puzzles or evaluating their difficulty. Difficulty might be a function of how many choices you have at each step of the puzzle, combined with the relative frequency of each of the words in the English language (and hence, how likely you are to know them). Nevertheless, I feel I understand the word space a lot better, and have managed to generate some interesting puzzles along the way. I find that five steps is about the maximum that I can reasonably do. There are minimum paths that get up to 17 and higher steps, though.


Anonymous said...


mike bayer said...

Someone gave me this as an interview question once. This entire problem, my never having seen it before. Needless to say, i didnt get the job.

Heres one that I adapted from a C version in the "Programming Challenges" book. It doesnt do any graph theory, it just uses a matrix of letters, but thats because its not limiting to real words, so its not quite the same "edit distance" problem.

For this one, it comes up with: