Assume that we will be passed two Strings that represent the words we’d like to find. Assume “distance” means the number of characters between occurrences – that is, line breaks will be ignored. Assume the document is represented by a long string. We need to find the two occurrences of these words that are closest together.
Simple solution:
- Walk through the string and find each occurrence of word A. Store the middle index of this occurrence
- Walk through again and find each occurrence of word B. Store the middle index of this occurrence.
- Compare the arrays of indicies. Find the smallest difference between each middle position of A and each middle position of B.
- Running time = O(n + ab) where a and b are the number of occurrences of the input strings A and B.
Walk through the document, character by character, looking to match word. If we find such a match, put the index of the middle character into the array. Continue on our walk. Repeat for the second word. Now we have two arrays of integers. Our new task is to find the smallest difference between the elements in each array. Use the mergesort merge method, keeping track of the smallest difference. Return this distance.
Optimization:
Instead of walking one character at a time, when we find a mismatch, skip ahead by an intelligent number of characters – that is, to the next index at which we might find a good match. We can use the Boyer-Moore algorithm for this approach.
Further optimization:
Look for both words in one pass by using Boyer-Moore tables that account for both words. As we walk, keep track of the current shortest distance. Running time is O(n).
Does anyone have a simpler or clearer answer?