Products
Home»Products»Rosette Linguistics Platform»Language Identifier

Supported Platforms

Windows, Linux, Solaris, AIX, HPUX, and MacOS

Supported Languages

Rosette Language Identifier

Automatically Detects the Language of Any Digital Text

Rosette® Language Identifier analyzes text, identifying the language and the character encoding scheme. Detecting the language of documents is a critical first step in any process that handles multilingual text. Our software recognizes 55 languages and 45 encodings and processes files extremely quickly and accurately.

Commercial Software Built by Experts in Natural Language Processing

Basis Technology has over 15 years of experience providing software tools for analyzing and extracting information from text in many languages. The language identifier is part of the Rosette platform, a comprehensive suite of advanced text analytics building blocks designed for processing multilingual text. The platform is used by major search engines, commercial applications, and government agencies.

Accurately Recognizes a Large Set of Languages

The language identifier recognizes 55 languages including a wide selection of Asian, Indo-European, and Middle Eastern languages. It can accurately identify the language and encoding of text from documents, email messages, webpages, or any other source. Rosette’s accuracy comes from its proprietary algorithms, which are trained on gigabytes of hand-verified text.

Processes Thousands of Documents Per Second

This software was carefully engineered for the high performance demanded by large-scale, commercial deployments. It can process thousands of documents per second, is thread safe, and has a small memory footprint.

Integrates into Any Software Product or System

The language identifier is designed to be easily integrated into high-performance, large-scale applications as well as desktop applications. It has also been deployed in open source technologies such as Apache Solr and Lucene. Supported on Windows and Unix with fully documented APIs for C++, C, Java, and .NET, Rosette seamlessly fits into any application or system.

Identifies All Languages in Multilingual Documents

For texts that contain sections in several different languages, the software produces a complete list of all languages present. For these types of documents, Rosette indicates the location of the start and end boundaries of each distinct language or script region.

Aids Conversion to Unicode by Identifying Legacy Encodings

Rosette Language Identifier correctly identifies the text encoding. This capability is important for modern software systems which often get inputs with “legacy encodings” such as ASCII, ISO 8859-1 (also known as “Latin 1”), and Shift-JIS. The language recognizer can be combined with an encoding conversion engine (such as Rosette Core Library for Unicode), to identify the encoding, and then convert text to the Unicode encoding standard to simplify further processing.

For More Information

Error

Fill out the form below, and we’ll contact you about your Rosette Language Identifier questions.

* indicates a required field

Learn More

For more information on Rosette Language Identifier, download the product datasheet, request a product evaluation, or read presentations about “Language Identification: The First Step in Processing Intelligence,” and “ What Language is That? Using Rosette® Language Identifier.”