Base Linguistics

Rosette Base Linguistics (RBL)

Rosette: Big Text Analytics


Improve the speed and accuracy of your search application with advanced linguistic analysis.

Search many languages with high accuracy

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

40

Supported
Languages

  • WESTERN EUROPE
  • Catalan*
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish*
  • French
  • German
  • Greek
  • Italian
  • Norwegian
  • Portuguese
  • Spanish
  • Swedish
  • EASTERN EUROPE
  • Albanian*
  • Bulgarian*
  • Croatian*
  • Estonian*
  • Hungarian
  • Latvian*
  • Polish
  • Romanian
  • Russian
  • Serbian*
  • Slovak*
  • Slovenian*
  • Turkish
  • Ukranian*
  • MIDDLE EAST
  • Arabic
  • Hebrew
  • Pashto
  • Persian
  • Urdu
  • ASIA
  • Chinese, Simplified
  • Chinese, Traditional
  • Indonesian
  • Japanese
  • Korean
  • Malay*
  • Thai
  • * Limited Support
Code Base
C++
Web Services
Java
Microsoft .Net
Platform Support
Windows
Linux
Red Hat
Mac

Mac

KEY FEATURES

  • Simple API
  • High-scale and Throughput
  • Industrial-strength Support
  • Easy Installation
  • Flexible and Customizable
  • Integration: Java, C++, or Web Services
  • Platform: Unix, Linux, Mac, Windows
  • Component of the Rosette SDK
  • Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts

Select Customers

Advanced Morphological Features

Tokenization

Tokenization

Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

Tokenization Example

Lemmatization

Lemmatization

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

Lemmatization Example

 Noun Phrase Extraction

Noun Phrase Extraction

Certain nouns, especially proper names, can be very tricky to identify as a single entity. RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

Parts of Speech Tagging

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy.

Decompounding

Decompounding

RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

Example: German

Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.

Sentence Detection

Sentence Detection

The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.


Contact us about integrating Rosette Base Linguistics
into your search application:

This is a unique website which will require a more modern browser to work! Please upgrade today!