Rosette® Language Identifier analyzes text, identifying the language and the character encoding scheme. Detecting the language of documents is a critical first step in any process that handles multilingual text. Our software recognizes 55 languages and 45 encodings and processes files extremely quickly and accurately.
Basis Technology has over 15 years of experience providing software tools for analyzing and extracting information from text in many languages. The language identifier is part of the Rosette platform, a comprehensive suite of advanced text analytics building blocks designed for processing multilingual text. The platform is used by major search engines, commercial applications, and government agencies.
The language identifier recognizes 55 languages including a wide selection of Asian, Indo-European, and Middle Eastern languages. It can accurately identify the language and encoding of text from documents, email messages, webpages, or any other source. Rosette’s accuracy comes from its proprietary algorithms, which are trained on gigabytes of hand-verified text.
This software was carefully engineered for the high performance demanded by large-scale, commercial deployments. It can process thousands of documents per second, is thread safe, and has a small memory footprint.
The language identifier is designed to be easily integrated into high-performance, large-scale applications as well as desktop applications. It has also been deployed in open source technologies such as Apache Solr and Lucene. Supported on Windows and Unix with fully documented APIs for C++, C, Java, and .NET, Rosette seamlessly fits into any application or system.
For texts that contain sections in several different languages, the software produces a complete list of all languages present. For these types of documents, Rosette indicates the location of the start and end boundaries of each distinct language or script region.
Rosette Language Identifier correctly identifies the text encoding. This capability is important for modern software systems which often get inputs with “legacy encodings” such as ASCII, ISO 8859-1 (also known as “Latin 1”), and Shift-JIS. The language recognizer can be combined with an encoding conversion engine (such as Rosette Core Library for Unicode), to identify the encoding, and then convert text to the Unicode encoding standard to simplify further processing.
For more information on Rosette Language Identifier, download the product datasheet, request a product evaluation, or read presentations about “Language Identification: The First Step in Processing Intelligence,” and “ What Language is That? Using Rosette® Language Identifier.”