Rosette Base Linguistics

Text analytics fundamentals to prepare your data for analysis. Language-specific tools for tokenization, part-of-speech tagging, lemmatization, decompounding, and Chinese and Japanese readings for your input.

Sign up for free Request product evaluation


Search many languages with high accuracy

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed and accuracy.

What is base linguistics?

Base linguistics refers to the core morphological building blocks that prepare your text for further analysis, and allow you to effectively search or process text in many languages, including tokens, lemmas, parts of speech and more.

The leaders in multilingual search

Intelligent, successful search is about semantics. People want to put in a real query of human language and get an answer. Words like ‘spoke’ referring to part of a bicycle wheel can be easily confused with the past tense of the verb to speak. While open source platforms now provide the basic framework for inverted full-text search engines, the challenges of accurate search are compounded as you add more languages to the queries and results. Rosette provides the tools you need to search across 30+ languages.

Product highlights

  • 32 supported languages
  • Sentence tagging
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Decompounding
  • Chinese/Japanese readings
  • Fast and scalable
  • Cloud and enterprise deployments

Language-specific features

How It Works

Part of speech tagging

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.

Our base linguistics support the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.


Decompounding Example

Decompounding breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

Example: German

Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.

Chinese & Japanese readings

Chinese & Japanese readings

Our base linguistics functionality understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in hanzi, Rosette returns the pronunciation information in pinyin transcriptions. For Japanese content, Rosette returns furigana transcriptions in katakana. For example, if you call Rosette with “医療番組”, it will return this reading: “イリョウ”, “バングミ”.


Lemmatization Example

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in more recall, but poorer precision, associating unrelated words such as arsenic/arsenal which share a stem (arsen). Instead, our base linguistics tools find the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search precision and recall, and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

Example: English

Linguistic analysis is useful for every language; lemmatization for English improves recall and precision.

Challenge Query Stem Lemma
Two unrelated words may share a stem animals
anim animal
Stemming may deliver unintended results several sever several
Irregular verbs and nouns stump the stemmer spoke spoke speak (v.)
spoke (n.)


Tokenization Example

Many search tools use n-grams to break up text into overlapping strings of characters to create a search index in languages written without spaces between words. N-grams results in a larger index size and a reduction in precision. Our tools, in contrast, accurately identify and separate each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

Tech Specs

Availability and platform support

Deployment availability:

Supported languages

Arabic English Hungarian Persian Spanish
Catalan Estonian Italian Polish Swedish
Chinese, Simplified Finnish Japanese Portuguese Thai
Chinese, Traditional French Korean Romanian Turkish
Czech German Latvian Russian Urdu
Danish Greek Norwegian Serbian
Dutch Hebrew Pashto Slovak

Try the Demo


Easy to use

Built for the most demanding text analytics applications and engineered to deliver high accuracy without sacrificing speed, Rosette Cloud is instantly accessible and offers a variety of plans to suit both startups and enterprises. The tokenization and sentences endpoints break your text into word components and sentences, and the morphological analysis endpoint provides POS tagging, lemmatization, decompounding, and Chinese/Japanese readings.

Try base linguistics and the rest of Rosette Cloud’s endpoints, signup today for a 30-day free trial!

Get a Rosette Cloud Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various Rosette Cloud endpoints along-side examples in the binding of your choice.

Visit our GitHub for the binding and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on Rosette Cloud knowing that scalable, customizable, enterprise deployments are available if you need them.

  "tokens": [
  "lemmas": [


Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide Rosette Enterprise to be hosted on your internal servers.

Request product evaluation

If your organization requires an enterpreise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an evaluation package.

Drop us a line



Select Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for SalesForce

KonaSearch is a best-in-class search application for SalesForce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search