Language Identifier

Rosette Language Identifier (RLI)

Rosette: Big Text Analytics


Instantly Identify and triage many languages within large volumes of text.

Identify languages and transform encodings

Rosette® Language Identifier (RLI) scans text within documents to determine and locate written languages and character encoding with extreme speed and very high accuracy. Automatic language identification streamlines the processing of large quantities of text, which is necessary for applications that categorize, search, process, and store text in many languages. Individual documents may be routed to language specialists, or automatically tagged for improved workflow. This process may also be combined with language- specific search engine plug-ins (such as Rosette Base Linguistics) to improve the quality of search results.

RLI achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis. As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

KEY FEATURES

  • Simple API
  • High-scale and Throughput
  • Industrial-strength Support
  • Easy Installation
  • Flexible and Customizable
  • Integration: Java, C++, or Web Services
  • Platform: Unix, Linux, Mac, Windows
  • Component of the Rosette SDK

188

Language/Encoding Pairs


55

Supported Languages


7

Latin Script Variants


44

Legacy Encodings

RLI Demonstration Video

  • Albanian — ISO-8859-1, Windows-1252
  • Arabic — ISO-8859-6, Windows-720,
  • Windows-1256
  • Arabic (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Bengali — ISCII-Bengali
  • Bulgarian — ISO-8859-5, Windows-1251, KOI8-R
  • Catalan — ISO-8859-1, Windows-1252
  • Chinese, Simplified — GB-2312, GB-18030,
  • HZ-GB-2312, ISO-2022-CN
  • Chinese, Traditional — Big5, Big5-HKSCS
  • Croatian — Windows-1250
  • Czech — ISO-8859-2, Windows-1250
  • Danish — ISO-8859-1, Windows-1252
  • Dutch — ISO-8859-1, Windows-1252
  • English — ISO-8859-1, Windows-1252
  • Estonian — ISO-8859-13, Windows-1257
  • Finnish — ISO-8859-1, Windows-1252
  • French — ISO-8859-1, Windows-1252
  • German — ISO-8859-1, Windows-1252
  • Greek — ISO-8859-7, Windows-1253
  • Gujarati — ISCII-Gujarati
  • Hebrew — ISO-8859-8, Windows-1255
  • Hindi — ISCII-Hindi
  • Hungarian — ISO-8859-2, Windows-1250
  • Icelandic — ISO-8859-1, Windows-1252
  • Indonesian — ISO-8859-1, Windows-1252
  • Italian — ISO-8859-1, Windows-1252
  • Japanese — EUC-JP, ISO-2022-JP,
  • Shift-JIS, Shift-JIS-2004 (JIS X 0213)
  • Kannada — ISCII-Kannada
  • Korean — EUC-KR, ISO-2022-KR
  • Kurdish — Windows-1256
  • Kurdish (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Latvian — ISO-8859-13, Windows-1257
  • Lithuanian — ISO-8859-13, Windows-1257
  • Macedonian — ISO-8859-5, Windows-1251
  • Malay — ISO-8859-1, Windows-1252
  • Malayalam — ISCII-Malayalam
  • Norwegian — ISO-8859-1, Windows-1252
  • Pashto — ISO-8859-6, Windows-1256
  • Pashto (transliterated) — ISO-8859-1,
  • Windows-1252
  • Persian — ISO-8859-6, Windows-1256
  • Persian (transliterated) — ISO-8859-1,
  • Windows-1252, Windows-1256
  • Polish — ISO-8859-2, Windows-1250
  • Portuguese — ISO-8859-1, Windows-1252
  • Romanian — ISO-8859-2, Windows-1250
  • Russian — ISO-8859-5, Windows-1251, KOI8-R,
  • IBM-866, Mac Cyrillic
  • Serbian — ISO-8859-5, Windows-1251
  • Serbian (transliterated) — ISO-8859-2,
  • Windows-1250
  • Slovak — Windows-1250
  • Slovenian — Windows-1250
  • Somali — ISO-8859-1, Windows-1252
  • Spanish — ISO-8859-1, Windows-1252
  • Swedish — ISO-8859-1, Windows-1252
  • Tagalog — ISO-8859-1, Windows-1252
  • Tamil — ISCII-Tamil
  • Telugu — ISCII-Telugu
  • Thai — Windows-874
  • Turkish — ISO-8859-9, Windows-1254
  • Ukrainian — ISO-8859-5, Windows-1251, KOI8-R
  • Urdu — ISO-8859-6, Windows-1256
  • Urdu (transliterated) — ISO-8859-1,
  • Windows-1252
  • Uzbek — ISO-8859-5, Windows-1251, KOI8-R
  • Uzbek (transliterated) — Windows-1251
  • Vietnamese — TCVN, VIQR, VISCII, VNI, VPS
Code Base
C++
Web Services
Java
Microsoft .Net
Platform Support
Windows
Linux
Red Hat
Mac

Select Customers


Identification Features

  • Identifies the primary or dominant language of a document
  • Identifies the language scripts within the document, such as Latin and Cyrillic
  • Determines the languages and their percentages within multilingual documents
  • Works with languages that have been transliterated or written with more than one alphabet, such as Arabic chat (Arabic in Latin script)

Language Boundary Locator

RLI - RBL

Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans alike. RLI enriches the text with start and end markers for each language placed within multilingual documents—even if all the languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

Encoding Conversion

RLI-Unicode
Although modern text encoding standards, such as XML, mandate the use of Unicode, many existing applications, documents, websites, and data streams use “legacy encodings,” such as ASCII, ISO 8859-1, Shift-JIS, and many others.

Rosette accurately converts large collections of text with these legacy encodings into a single, uniform format in the Unicode standard. This converted text can then be used in any language, which eliminates data corruption and other problems due to incompatible code.


Contact us for more information about integrating RLI
into your application.

This is a unique website which will require a more modern browser to work! Please upgrade today!