Rosette Language Identifier

Instantly identify languages within large volumes of text to prepare for further analysis

Sign up for free Request product evaluation


What is language detection?

Global-minded companies must be able to work with data and content in dozens of languages. Whether it is searching text or learning upfront what language experts to hire for an eDiscovery project, Rosette tells you automatically. Only by detecting language of a search query, can you search the correct database or return results in the right language.

Why do I need it?

Without question, text analysis is most accurate when working natively within each language, and for that, language identification is a prerequisite in order to apply the correct language-specific analyzer.

A critical two-line email may be written in French, but have a 12-line legal footer in English. That email might fool most language identifiers into tagging the email as English. But only correctly tagging the language of each section will unlock the information inside.

Short texts are also challenging, but ubiquitous: in social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more.

Basis Technology leads the pack

Our language identifier outperforms most on the market in detecting the language of:

  • Short texts: such as the language of tweets and queries (from as little as 1-3 words, to a full sentence).
  • Multilingual documents: Rosette recognizes the dominant language in a body of text, as well as smaller sections of text in different languages.
  • Transliterated texts: At times, non-Latin languages (such as Arabic) may appear in Arabic script or Latin characters. Rosette recognizes both.
  • Language boundary detection: flagging language regions within multilingual text

Product highlights

  • 56 languages
  • 18 language scripts (e.g. Latin, Cyrillic)
  • 364 language/encoding pairs
  • Identifies the dominant language of a document
  • Identifies different language regions within multilingual documents
  • Delivers high accuracy based on as little as 1-3 words
  • Cloud and Enterprise deployments

How It Works

Superior coverage of language, encodings, and scripts

Our language identifier achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection.


The input data may be in any of 364 language-encoding-script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.


When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The built-in language profiles are then returned in ascending order from the most likely language (i.e., the built-in profile with the (shortest) distance from the input text’s profile).


Our language identifier returns a confidence score with each language result, ranging from 0 to 1. It is a measurement that you can use as a threshold for flagging results that are “too close to be sure.”

Language Boundary Locator

Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans alike. RLI enriches the text with start and end markers for each language placed within multilingual documents—even if all the languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

Short string language detection

For a number of languages, the language identifier uses additional proprietary algorithms for detecting the language of short strings (140 characters or less).

Tech Specs

Availability and platform support

Deployment availability:

Supported languages

Includes short string language identification, excepting for languages marked with *

Albanian German Latvian Slovak
Arabic Greek Lithuanian Slovenian
*Arabic (transliterated) Gujarati Macedonian Somali
Bengali Hebrew Malay Spanish
Bulgarian Hindi Malayalam Swedish
Catalan Hungarian Norwegian Tagalog
Chinese, Simplified Icelandic Pashto Tamil
Chinese, Traditional Indonesian *Pashto (transliterated) Telugu
Croatian Italian Persian Thai
Czech Japanese *Persian (transliterated) Turkish
Danish Kannada Polish Ukraine
Dutch Korean Portuguese Urdu
English Korean (North Dialect) Romanian *Urdu (transliterated)
Estonian Korean (South Dialect) Russian Uzbek
Finnish Kurdish Serbian Uzbek (transliterated)
French Kurdish (transliterated) Serbian (transliterated) Vietnamese

Try the Demo


Easy to use

Built for the most demanding text analytics applications and engineered to deliver high accuracy without sacrificing speed, Rosette Cloud is instantly accessible and offers a variety of plans to suit both startups and enterprises. The language ID endpoint identifies the dominant language within a document. For multilingual documents, send text through the sentence tagger endpoint and then feed a sentence at a time to the language ID endpoint. Or, ask about our enterprise deployments.

Try language identifier and the rest of Rosette Cloud’s endpoints for signup today for a 30-day free trial!

Get a Rosette Cloud Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various Rosette Cloud endpoints along-side examples in the binding of your choice.

Visit our GitHub for the binding and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on Rosette Cloud knowing that scalable, customizable, enterprise deployments are available if you need them.

  "languageDetections": [
      "language": "spa",
      "confidence": 0.38719602327387076
      "language": "eng",
      "confidence": 0.32699986625091865
      "language": "por",
      "confidence": 0.05569054210624943
      "language": "deu",
      "confidence": 0.030069489878380328
      "language": "swe",
      "confidence": 0.027734757034048835


Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide Rosette Enterprise to be hosted on your internal servers.

On premise language identification can identify both the dominant language of an entire document, and detect the language regions in multilingual documents.

Request product evaluation

If your organization requires an enterprise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an evaluation package.

Drop us a line



Select Customers Include

konasearch salesforce

Deep Search for Salesforce

AI-driven Search Application for SalesForce

KonaSearch is a best-in-class search application for SalesForce enabling users to search every field, file, and object across multiple orgs and other data sources.

View on AppExchange

SalesForce Search