Commercial-strength analysis of unstructured Arabic text

Rosette® Base Linguistics for Arabic is a multi-platform, high-performance linguistic engine that facilitates the analysis of documents written in Arabic. Designed to plug into mainstream search engines and data mining products, it performs orthographic and lexical normalization of Arabic text.

Traditionally an oral language, Arabic is not well-suited for standard automatic analysis techniques that look at a language’s written form. Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. For example, articles such as “an” and “the” are not separate words as they are in languages like English but are actually attached to the words to which they refer (for example, “their houses” is written as a single token, بُيُوتُهُمْ). There is additional ambiguity in Arabic due to the inconsistent use or absence of vowels. Therefore Arabic text requires significant pre-processing before it can be accurately indexed, searched, or put through any other text manipulation.


  • Generates the linguistic stem form of a word
  • Identifies parts of speech
  • Performs orthographic normalization including the removal of vowel and nunation signs, unification of hamza forms, and the removal of kashida(tatweel)
  • Normalizes irregular “broken” plural forms to the correct singular form
  • Normalizes Arabic numerical expressions to their Latin counterparts
  • Ignores user-identified stop words

Rosette Base Linguistics also supports Farsi (Persian) and Urdu languages

Text Analytics


  • Simple API
  • High-scale and throughput
  • Industrial-strength support
  • Easy installation
  • Flexible and customizable
  • Java or C++
  • Component of the Rosette SDK
  • Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts
  • Cloudera certified

Contact us about integrating Rosette Base Linguistics
into your search application:

This is a unique website which will require a more modern browser to work! Please upgrade today!