Instantly tag named entities from large
quantities of text
Big Text represents the vast majority of the world’s big data. Lying hidden within that text is extremely valuable information, unable to be accessed unless read manually—a challenge compounded when foreign languages are involved. This hidden data often comes in the form of entities—names, places, dates, and other words and phrases that establish the real meaning in the text.
Rosette® Entity Extractor (REX) instantly scans through huge volumes of multilingual, unstructured text and tags key data. REX uses multiple approaches to achieve the most accurate results: advanced statistical modeling, customizable rules, and pre-defined lists.
As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.
- Simple API
- High-scale and Throughput
- Industrial-strength Support
- Easy Installation
- Flexible and Customizable
- Integration: Java, C++, or Web Services
- Platform: Unix, Linux, Mac, Windows
- Component of the Rosette SDK
Statistical modeling with advanced linguistics solves two major problems:
- Overlap in the names of people, places, and organizations causes ambiguity. Consider the common surname Smith, compared with the business name Smith & Co., and the town of Smithfield, RI.
- Unique and new names with seemingly infinite formats and spelling variations.
Because of these problems, entity extraction for people, organizations, and locations can only be solved with a statistical engine. This solution utilizes machine learning to analyze, annotate, and process millions of news and blog articles on the web to train what is—and isn’t—an entity, in a real-world, context-rich setting.
Entities can simply be matched against standard lists and user taxonomies. For example, weapon names are matched with a list-based extractor. A large collection of gazeteers are included; custom lists, such as a terror watch list, can be easily added.
Rules may be used to detect regular expressions or patterns such as dates, times, and email addresses. Many standard string patterns are included; customers can customize by editing or adding their own rules, based on their specific needs.
Predefined Entity Types
REX natively supports the following entity types. User-defined entities, such as SKU numbers, are also available.
- Credit Card Number
- Geographic Coordinate
- Generic Number
- Personal ID Number
- Phone Number
- Email Address/URL