Title: Mining the Web to Determine Similarity Between Short Text Fragments
Abstract: A common task underlying many information retrieval systems involves determining the similarity of very short text fragments. For example, in a search engine we may want to determine if two user queries are similar to each other (perhaps to aid users in formulating better queries) or identify phrases which may be topically related to each other (e.g., “NASA” and “space exploration”). For such a task, traditional methods, such as counting the term overlap between the short text fragments or employing the standard cosine similarity measure often fail because the short texts may share no terms in common. We address this problem by introducing a novel method for measuring the similarity between short pieces of text (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. We start by discussing the general problem, then define our new similarity kernel function and mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this function in a large-scale system for suggesting related queries to search engine users.
Bio: Mehran Sahami is a Professor and Associate Chair for Education in the Computer Science department at Stanford University. Prior to joining the Stanford faculty, he was a Senior Research Scientist at Google for several years. His research interests include computer science education, artificial intelligence, and web search. He has published over 40 technical papers and has over 20 patent filings, but he still hasn’t figured out how to get his kids to brush their teeth at bedtime without a fuss.