Arabic Base Linguistics
Leave the complexities of Arabic text normalization to us
Commercial-strength analysis of unstructured Arabic text
Arabic base linguistics facilitates the analysis of documents written in Arabic. Designed to plug into mainstream search engines and data mining applications, it performs orthographic and lexical normalization of Arabic text.
Why is Arabic text normalization necessary?
Traditionally an oral language, Arabic is not well-suited for standard automatic analysis techniques that look at a language’s written form. Arabic words frequently incorporate grammatical elements indicating attributes such as verb aspect, object, conjugation, person, number, gender, and others. For example, articles such as “an” and “the” are not separate words as they are in languages like English but are actually attached to the words to which they refer (for example, “their houses” is written as a single token, بُيُوتُهُمْ).
There is additional ambiguity in Arabic due to the inconsistent use or absence of vowels. Therefore Arabic text requires significant pre-processing before it can be accurately indexed, searched, or put through any other text manipulation.
- Generates the linguistic stem form of a word
- Identifies parts of speech
- Performs orthographic normalization including the removal of vowel and nunation signs, unification of hamza forms, and the removal of kashida (tatweel)
- Normalizes irregular “broken” plural forms to the correct singular form
- Normalizes Arabic numerical expressions to their Latin counterparts
- Ignores user-identified stop words
We also support base linguistics for Persian (Farsi and Dari), Pashto, and Urdu.