Our Arabic chat alphabet translator software can be integrated into any application to convert words from Arabic chat alphabet to standard Arabic script. This functionality is a key step in monitoring Arabic social media.
For over two decades, web search giants—including Google, Yahoo!, and Bing—enterprise search vendors, and government agencies have turned to Basis Technology’s text analysis software to enable them to process and search text in the major languages of the world. Our products are trained on large data sets, which are refreshed and updated as we adopt new technologies for ever greater accuracy and broader capabilities.
Arabic chat, also called Arabizi, is widely used in social media (Twitter, blogs, chat) as an easy-to-type alternative to standard Arabic. However, until now, automated analysis of this writing has not been supported by commercial text analysis tools. Complicating analysis is that depending on the writer, the Arabic chat alphabet can vary widely, replacing Arabic characters with numbers or English characters that sound like or resemble Arabic characters.
The Arabizi translator function can be integrated into any software environment such as a Java class library or web service. It is designed for high performance and is highly scalable—capable of running in multiple threads or in multiple cores.
Use of a dialect can identify the country an Arabic speaker hails from. Arabic words may be pronounced differently or have vocabulary variations from region to region due to dialectal differences. Arabic chat words—often written phonetically—reflect those differences. Rosette can detect dialectal chat and infer what is the most likely country of origin of the writer in addition to converting dialectal chat to natively written Arabic.
Arabic is used in over 25 different countries, so handling dialectal variations is key to accurate translation of Arabic chat alphabet to standard Arabic text. Just the one word “conspiracy” in Arabic chat alphabet is typed differently by those in Egypt, Saudi Arabia, and Morocco.
Rosette Chat Translator can:
Unlike machine translation systems which rely on conventional dictionaries, Rosette Chat Translator is powered by an algorithmic and statistical approach. The algorithm analyzes the morphological components of each word to pick likely translation candidates. The statistical model is trained on a database of 300 million Arabic words collected from thousands of different websites to help the algorithm rank candidates.
Combined with the full Rosette text analytics platform, the chat translator can pipeline the Arabic converted from Arabic chat alphabet into the language identifier, Arabic linguistic analysis component, and the entity extractor. The result is a robust platform for analyzing real-world Arabic text on the web whether written in Arabic chat alphabet or standard Arabic.
To learn more about Arabic chat alphabet, please refer to these articles.
“Text Message Transliteration Threatens Arabic - Linguists” The Jordan Times “Arab linguists were united in their view that transliteration poses a serious threat to the Arabic language.”
“W.H. Online Counterterrorism Woes” Politico.com “Officials know that buried in the vast network of online communications are messages from Al Qaeda and other militant groups seeking recruits to launch terrorist attacks.”
“Arabizi Is Destroying the Arabic Language” Arab News “Parents and teachers are becoming more concerned over the popularity of this new trend. Some see it as a threat to the Arabic language.”
“The Online Bastard: Transliterated Arabic.” Beirutspring.com “[Transliterated Arabic is] the reason why... spy agencies and law enforcement find it difficult to monitor Arab forums. It allows criminals to organize and peddle their wares... while staying completely outside of the radar.”
“Summary of Arabizi or Romanization: The dilemma of Writing Arabic Texts” Jīl Jadīd Conference, University of Texas at Austin “Researchers conducted a scan on the appearance of this way of writing in media and commercials. From their observations, this way of writing is increasing and more software is being developed to uncode it.”