Pinterest is an online visual discovery tool helping people all over the world discover, collect, and share what they love. Users “pin” images, video, and other media from the Internet or their uploads. A collection of pins on a theme form a “pinboard,” the basis for organizing a trip, sharing one’s passion, or organizing a wish list or event.
Pinterest’s global users expect to find what they are looking for, no matter the language. For languages such as Chinese, Japanese and Korean, which are written without spaces between each word, it is particularly important to have linguistically intelligent text processing. Through Rosette Base Linguistics, Pinterest expands searches in CJK for more accurate, comprehensive results.
Non-linguistic methods, such as n-gram (dicing text into overlapping lengths of n-characters) will allow indexing and searching, but will bloat an index, slow performance, and increase false positives. Consider a Japanese search for 東京都美術館 (“Tokyo Metropolitan Art Museum”). Morphological analysis yields 東京都 (“Tokyo”) and 美術館 (“art museum”). Bigramming yields 東京 (“Tokyo”), 京都 (“Kyoto”), 都美 (not a word), 美術 (“art”), and 術館 (not a word). When seeking images of the Tokyo Metropolitan Museum, Rosette ensures that art museums in Kyoto won’t be mixed in!