Rosette® Entity Extractor turns raw data into concepts. This named entity recognition software provides semantic tagging to find entities in text. It builds this metadata by analyzing the text with a hybrid model built from a deep, statistical analysis of the language and a collection of rules about which words represent entities.
Basis Technology has years of experience providing software tools for analyzing and extracting information from multilingual text. Government agencies, search engines, social networks and intelligence-gathering teams run software from Basis Technology to mark up their text and find the important documents for closer attention. They use the metadata to feed into applications for link analysis, alerting, fact extraction, social media monitoring, and more.
Rosette Entity Extractor doesn’t need basic training because it comes ready with statistical models for 17 major languages, including Arabic, Chinese, and Russian. Many other solutions need to be trained before they can identify salient words.
Rosette Entity Extractor works with your software to flag important entities like names, locations, geographical coordinates, phone numbers, dates, and other entities. Your software provides a stream of characters to it and the software returns a collection of metadata indicating the words that correspond to entities.
Web and enterprise search leaders rely on Rosette to perform in their high-performance, multi-threaded environments. Rosette is chosen for its reliability, scalability, and flexibility to integrate as an SDK with APIs in C, C++, Java, and .NET for all major architectures and operating systems.
REX is trained and tuned for each specific language, and is customizable to meet the particular entity needs of vertical applications such as health, medical, life sciences, financial, and manufacturing.
Our entity extractor leverages three industry-leading techniques. The type of entity determines which technique is most appropriate for accurate extraction. Dates follow regular patterns which make them easy to identify with rules, but names of people and places are highly ambiguous and require context-sensitive extraction. Rosette uses these approaches:
Statistical models provide (a) Recognition of never-seen-before names and (b) the best answers when words can have multiple meanings. Analyzing the correlation with the other words helps identify the correct context such as deciding when the word “Paris” is used as the name of a person or a city.
Regular expressions define entities with standard patterns like telephone numbers and email addresses. Users can add their own expressions easily.
Gazetteers—lists of entities—are used for entities which are well-defined words with little ambiguity about their meaning like the names of nations.
An algorithm balances the results from these three sources and produces a list of entities.
The output from the entity extractor is used by search engines and search applications to make your data easier to navigate and explore. Most modern search engines provide faceted navigation, which allows users to filter results based on which entities were discovered in their initial search results. REX is easily integrated into open source search engines such as Apache Lucene or Solr to improve search accuracy and comprehensiveness. Entity extraction can also be fed into other processes, like document clustering, query auto-complete, or used by advanced search applications to boost the ranking of relevant results.
The entity extractor is also part of our larger Rosette platform, which accepts raw text, identifies the language, tokenizes it, and then taking the entities, provides name matching and name translation as needed.
For more information on Rosette Entity Extractor, download the product datasheet, request a product evaluation, or read a whitepaper entitled, Entity Extraction Enables “Discovery”. You may also refer to our presentations on this product: “A Gentle Introduction to Entity Extraction” and “Sit! Down! Extract! Teaching New Tricks to Your Entity Extractor.”