Products
Home»Products»Rosette Linguistics Platform»Base Linguistics»Arabic

Supported Platforms

Windows, Linux, Solaris, AIX, HPUX, and MacOS

Languages Supported by Rosette Base Linguistics

  • Albanian
  • Arabic
  • Bulgarian
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Indonesian
  • Italian
  • Japanese
  • Korean
  • Latvian
  • Malay
  • Norwegian
  • Pashto
  • Persian (Farsi / Dari)
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Spanish
  • Swedish
  • Thai
  • Turkish
  • Ukrainian
  • Urdu

Arabic Base Linguistics

Commercial-strength analysis of unstructured Arabic text

Rosette® Base Linguistics for Arabic is a multi-platform, high-performance linguistic engine that facilitates the analysis of documents written in Arabic. Designed to plug into mainstream search engines and data mining products, it performs orthographic and lexical normalization of Arabic text.

Traditionally an oral language, Arabic is not well-suited for standard automatic analysis techniques that look at a language’s written form. Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. For example, articles such as “an” and “the” are not separate words as they are in languages like English but are actually attached to the words to which they refer (for example, “their houses” is written as a single token, بُيُوتُهُمْ). There is additional ambiguity in Arabic due to the inconsistent use or absence of vowels. Therefore Arabic text requires significant pre-processing before it can be accurately indexed, searched, or put through any other text manipulation.

Features:

  • Generates the linguistic stem form of a word
  • Identifies parts of speech
  • Performs orthographic normalization including the removal of vowel and nunation signs, unification of hamza forms, and the removal of kashida (tatweel)
  • Normalizes irregular “broken” plural forms to the correct singular form
  • Normalizes Arabic numerical expressions to their Latin counterparts
  • Ignores user-identified stop words

Rosette Base Linguistics also supports Farsi (Persian) and Urdu langauges

For More Information

Error

Fill out the form below, and we’ll contact you about your Arabic Base Linguistics questions.

* indicates a required field

Learn More

For more information about our Rosette Base Linguistics software, download the product datasheet, request a product evaluation, or browse our presentations about linguistic analysis and full-text search.