Rosette Language Identifier

“We’ve expanded our work with Basis Technology because of the company’s commitment to add support for the languages that are most important to our customers. Inktomi is dedicated to providing a leading global search solution and continues to address the information retrieval needs of the international market with relevant search results across multiple languages, formats and locations.”


— Troy Toman,
   Vice President
   and General Manager

   Inktomi Enterprise
    Search Solutions

Automatically identify text in over 50 languages

Understanding the language and encoding of a given document is an essential step in working with unstructured multilingual text. Without this basic knowledge, applications such as information retrieval and text mining cannot accurately process data and important information may be completely missed or mis-routed.

Any application that works with unknown and disparate text sources in multiple languages can benefit from the Rosette Language Identifier (RLI). Using RLI, applications can take a fully automated approach to processing unknown text by quickly and accurately determining both the language and encoding of incoming data. RLI can identify a single primary language in a document, or multiple languages within each document, recognizing a wide selection of Asian, European and Middle Eastern languages.

RLI can be used to determine:

  • A single primary document language
  • A list of different languages in a multilingual document
  • The start/end boundaries of languages in a multilingual document
  • What languages are contained in the document and their percentage of the overall content
  • The start/end boundaries of different writing systems (scripts), such as Latin text, Arabic and Kanji

RLI can be instrumental in correctly extracting and routing foreign language text for further automated or manual processing and analysis, among other uses of the product.

RLI can be purchased separately or as part of the Rosette Linguistics Platform which performs more in-depth analysis of text using language specific rules and models.