Basis Technology Blog 

Adapt Rosette Entity Extractor to Your Content for Increased Accuracy

  • November 10, 201410/11/14

Entity extraction is becoming a mission-critical tool for finding mentions of people, places, organizations, and products in massive quantities of text. In patent searches, law enforcement, voice-of-the-customer analysis, ad targeting, content recommendation, eDiscovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.

Among named entity recognition systems, those such as Rosette Entity Extractor (REX) who rely on machine learning to find entities have the advantage. They can find previously unknown entities. Furthermore, because statistical entity extractors are context sensitive, it can disambiguate between places like Paris and people named Paris.

Why Entity Extraction Needs To Be Flexible

When it comes to entity extraction, not all content is created equal. While most entity extractors are quite accurate out-of-the-box when working on well-formed text such as news articles, the high degree of content variation in blogs, restaurant reviews, financial documents, electronic medical records, legal contracts, and patent filings, can limit the algorithms’ accuracy.

The Rosette Entity Extractor (REX) has an advantage in these cases. REX’s statistical model has been tuned to a wide range of content beyond simply published news. And, for users with particularly quirky data—whether in format, style, or vocabulary—and for those who need every last bit of accuracy, REX includes robust field training capabilities with multiple mechanisms for adapting to your data’s idiosyncrasies, thus maximizing the accuracy of named entity extraction on your data.

Using Field Training to Improve Accuracy

Level 1: Just Add Data

The easiest level of adaptation, called “Unsupervised Field Training,” can be almost completely user driven. REX provides access to a state-of-the art clustering tool chain. You add any quantity of your own data—no need for annotation! just any old documents you have lying around that are representative of the data you need to extract—and REX will build you a new model adapted to the idiosyncrasies of your data, dramatically increasing the entity extraction accuracy.

This unsupervised process allows REX to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters, i.e., “similar words tend to appear in similar contexts.” Thus it might learn that the function word “outturn” is used in financial documents the same way “outcome” is used in news articles, or that the words “Waltham”, “Atiak”, “Loveland”, “Svetogorsk”, “Yeisk” and “Descoberto” are all likely names of LOCATIONs, even though none were mentioned in the original “stock” annotated corpus. Consequently REX will better understand the context surrounding unfamiliar words, and as a result, extract them into existing, well-defined clusters.

Level 2: A Little Annotation Goes a Long Way

For even greater accuracy, you can annotate a small quantity of your data and actively teach REX the unique contexts for entities that are common to your documents. Only a few hundred annotated documents can create dramatic improvements in accuracy.

REX customers who have conducted field training report a drop in both false positives (increased precision) and false negatives (increased recall) from REX and a noticeable improvement in their overall analytics system.

Professional Support

Given that most of our customers welcome guidance in selecting data, building a new model, and evaluating the results, Basis Technology offers professional services to assist with field training. Whether you are just adding raw data to the REX model help.

Contact us if you have more questions about the highly adaptable Rosette Entity Extractor (REX).

This is a unique website which will require a more modern browser to work! Please upgrade today!