The rise of social media is a worldwide phenomenon, and people are using many languages to interact online. Last year, only half of all tweets were in English and more than 75% of Facebook users are outside the U.S. Many applications have been developed to ingest and analyze the data from various social media sources. Rosette®, a software development kit (SDK), enables these applications to work effectively on text in over 40 of the world’s major languages. Rosette quickly integrates with social media applications to give developers a head start in analyzing multilingual data from Twitter, Facebook, LinkedIn, and other social media channels.
The Rosette linguistics platform enables social media monitoring tools to
identify language of incoming feeds, analyze sentences for sentiment analysis,
extract entities for metadata, and improve search results. View enlarged
version.
Basis Technology has been the industry choice for multi-language natural language processing, starting with major search engines—including Google, Yahoo!, Microsoft Bing, and Oracle Endeca. We’ve continued to refine and hone our linguistic software components to meet the new wave of language challenges inherent in social media analysis. Contact us for a free evaluation of how Rosette can make your social media analysis software internationally ready.
Cleaning and aggregating social media content starts with language identification. However, location-based and user-specified language settings for posts can be unreliable. Our language identifier has been tuned for high throughput and accuracy and identifies 55 languages. The language identifier is designed to keep up with the Internet’s unprecedented flow of data—blog entries, product reviews, and the Twitter Firehose at over 140 million tweets a day.
Semantic and sentiment analysis requires analyzing every word in a sentence. In languages such as English, Portuguese, Japanese, Spanish, and Dutch, Rosette’s linguistic analysis will:
Our entity extractor populates metadata for each post, article, and social conversation with extracted entities—e.g., people, places, companies, and product names. Social media monitoring applications can then filter data based on entities in the metadata. Rosette® Entity Extractor automatically generates metadata for 18 types of entities in over a dozen languages. Developers can customize the entity extractor to detect other entities.
Modern vendors of sentiment analysis ascribe sentiment to entities rather than to documents. This method provides a clearer view of what people are saying about brands, products, and their features. Rosette will supply any semantic or sentiment analysis system with accurate and comprehensive entity extraction in the major languages of the Americas, Europe, Asia, and the Middle East.
Social media content aggregators can offer a more rewarding experience to subscribers with Rosette’s document clustering. Give your users the ability to review groups of near-identical conversations or posts rather than read every one. The number of items in a group can also indicate trending topics and product, or expose incidents of social media spamming.
When indexing a high volume of tweets, clustering will detect nearly identical posts, such as retweets, to avoid unnecessary processing.
The quality of a data feed is only as good as its search. For any language searched, adding linguistic processing at index and query time increases the number of relevant search results with little degradation to precision. Our morphological analyzers produce each word’s lemma (dictionary form of a word), which informs indexing. Other methods such as stemming only look at superficial commonalities, leading to potentially unrelated results.
The language-aware approach of lemmatization is used by top enterprise and web search engines today.
Social media posts are notoriously casual, and are full of misspelled names and nicknames. Overcoming name variants is especially critical for reputation tracking or brand analysis. Our name matcher will find all relevant posts for “Madonna” even when her name is spelled “マドンナ,” “Madonna Ciccone,” or “Madona.” It handles nicknames, missing name components, spelling errors and variants, mixed order names, names in different languages, and more.
Sample name search result for “Steve Jobs” finds variations of his names, even
in Arabic!
For more information, download the Rosette for Social Media Monitoring solution brief or read the press release about how Rosette is used inside the social media technology of NetBase.