Rosette Language Identifier (RLI)

Rosette: Big Text Analytics

Instantly Identify and triage many languages within large volumes of text.

Identify languages and transform encodings

Rosette® Language Identifier (RLI) scans text within documents to determine and locate written languages and character encoding with extreme speed and very high accuracy. Automatic language identification streamlines the processing of large quantities of text, which is necessary for applications that categorize, search, process, and store text in many languages. Individual documents may be routed to language specialists, or automatically tagged for improved workflow. This process may also be combined with language- specific search engine plug-ins (such as Rosette Base Linguistics) to improve the quality of search results.

RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis. As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Text Analytics


  • Simple API
  • High-scale and Throughput
  • Industrial-strength Support
  • Easy Installation
  • Flexible and Customizable
  • Integration: Java, C++, or Web Services
  • Platform: Unix, Linux, Mac, Windows
  • Component of the Rosette SDK


Language/Encoding Pairs


Supported Languages


Latin Script Variants


Legacy Encodings

RLI Demonstration Video

  • Albanian — ISO-8859-1, Windows-1252
  • Arabic — ISO-8859-6, Windows-720,
  • Windows-1256
  • Arabic (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Bengali — ISCII-Bengali
  • Bulgarian — ISO-8859-5, Windows-1251, KOI8-R
  • Catalan — ISO-8859-1, Windows-1252
  • Chinese, Simplified — GB-2312, GB-18030,
  • HZ-GB-2312, ISO-2022-CN
  • Chinese, Traditional — Big5, Big5-HKSCS
  • Croatian — Windows-1250
  • Czech — ISO-8859-2, Windows-1250
  • Danish — ISO-8859-1, Windows-1252
  • Dutch — ISO-8859-1, Windows-1252
  • English — ISO-8859-1, Windows-1252
  • Estonian — ISO-8859-13, Windows-1257
  • Finnish — ISO-8859-1, Windows-1252
  • French — ISO-8859-1, Windows-1252
  • German — ISO-8859-1, Windows-1252
  • Greek — ISO-8859-7, Windows-1253
  • Gujarati — ISCII-Gujarati
  • Hebrew — ISO-8859-8, Windows-1255
  • Hindi — ISCII-Hindi
  • Hungarian — ISO-8859-2, Windows-1250
  • Icelandic — ISO-8859-1, Windows-1252
  • Indonesian — ISO-8859-1, Windows-1252
  • Italian — ISO-8859-1, Windows-1252
  • Japanese — EUC-JP, ISO-2022-JP,
  • Shift-JIS, Shift-JIS-2004 (JIS X 0213)
  • Kannada — ISCII-Kannada
  • Korean — EUC-KR, ISO-2022-KR
  • Kurdish — Windows-1256
  • Kurdish (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Latvian — ISO-8859-13, Windows-1257
  • Lithuanian — ISO-8859-13, Windows-1257
  • Macedonian — ISO-8859-5, Windows-1251
  • Malay — ISO-8859-1, Windows-1252
  • Malayalam — ISCII-Malayalam
  • Norwegian — ISO-8859-1, Windows-1252
  • Pashto — ISO-8859-6, Windows-1256
  • Pashto (transliterated) — ISO-8859-1,
  • Windows-1252
  • Persian — ISO-8859-6, Windows-1256
  • Persian (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Polish — ISO-8859-2, Windows-1250
  • Portuguese — ISO-8859-1, Windows-1252
  • Romanian — ISO-8859-2, Windows-1250
  • Russian — ISO-8859-5, Windows-1251, KOI8-R,
  • IBM-866, Mac Cyrillic
  • Serbian — ISO-8859-5, Windows-1251
  • Serbian (transliterated) — ISO-8859-2,
  • Windows-1250
  • Slovak — Windows-1250
  • Slovenian — Windows-1250
  • Somali — ISO-8859-1, Windows-1252
  • Spanish — ISO-8859-1, Windows-1252
  • Swedish — ISO-8859-1, Windows-1252
  • Tagalog — ISO-8859-1, Windows-1252
  • Tamil — ISCII-Tamil
  • Telugu — ISCII-Telugu
  • Thai — Windows-874
  • Turkish — ISO-8859-9, Windows-1254
  • Ukrainian — ISO-8859-5, Windows-1251, KOI8-R
  • Urdu — ISO-8859-6, Windows-1256
  • Urdu (transliterated) — ISO-8859-1,
  • Windows-1252
  • Uzbek — ISO-8859-5, Windows-1251, KOI8-R
  • Uzbek (transliterated) — Windows-1251
  • Vietnamese — TCVN, VIQR, VISCII, VNI, VPS
Code Base
Web Services
Microsoft .Net
Platform Support
Red Hat

Select Customers

Identification Features

  • Identifies the primary or dominant language of a document
  • Identifies the language scripts within the document, such as Latin and Cyrillic
  • Determines the languages and their percentages within multilingual documents
  • Works with languages that have been transliterated or written with more than one alphabet, such as Arabic chat (Arabic in Latin script)

Language Boundary Locator


Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans alike. RLI enriches the text with start and end markers for each language placed within multilingual documents—even if all the languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

Encoding Conversion

Although modern text encoding standards, such as XML, mandate the use of Unicode, many existing applications, documents, websites, and data streams use “legacy encodings,” such as ASCII, ISO 8859-1, Shift-JIS, and many others.

Rosette accurately converts large collections of text with these legacy encodings into a single, uniform format in the Unicode standard. This converted text can then be used in any language, which eliminates data corruption and other problems due to incompatible code.

Contact us for more information about integrating RLI
into your application.

Learn More

Request a Product Evaluation

Download the Rosette Language Identifier Datasheet

Fill out this form for more information

This is a unique website which will require a more modern browser to work! Please upgrade today!