How would you implement Google spelling correction algorithms?
Anónimo
Building the database: Use Google’s bot as raw data - detect each page’s language and build a dictionary by language. for each language, Each word is scored according to the number of instances of this word in all pages together. a word that appears in less than N different pages gets a score of 0. The dictionary structure key: word value: linked list of (url, show count) We can save the dictionary per domain and do a map-reduce to get aggregated results. Then, when rescanning a domain, we can easily clear results for this domain and start-over. Pre-processing: Do a map-reduce to convert the dictionary to string->score Spell check Request if a word is in the dictionary - return OK Else try to replace each letter with similar letters and get scores. if not enough options, try to replace each letter with keyboard-close letters and get scores. if not enough options, try to replace each letter other letters and get scores. sort all alternative by scores and return results. *CACHE* the query and result with expiration date. Optimization: When possible, send the selected suggestion back to google. save all selections with score by number of selections: wrong-word -> sorted list of (# of selections, correct word). if a word is found in this dictionary, return top results from there.