PRODUCTS
Rosette Base Linguistics for Asian LanguagesOur Asian language analyzers are used in some of the world’s most transaction-heavy environments, like Google’s search engine and Amazon’s e-commerce site. Rosette Base Linguistics for Chinese, Japanese and Korean are extremely accurate and reliable solutions to help complex applications process unstructured Asian language text by conquering some of these languages’ many challenges, such as the use of numerous scripts and absence of spaces between words. Using advanced morphological analysis, our Asian Base Linguistics perform functions critical for analyzing Asian text such as segmentation, lemmatization, noun decompounding, part-of-speech tagging, sentence boundary detection, and base noun phrase analysis.
Rosette Base Linguistics relies on dictionaries that are continually updated to keep pace with the continuing evolution of each language. For further detail on the dictionaries, please download a datasheet.
Features Include:
- Segmentation and tokenization: The process of segmenting Asian text into unique word tokens

- Lemmatization: Providing the dictionary base form for an inflected verb or adjective)

- Noun Decompounding: The process of separating compound nouns

- Part-of-speech tagging: Identifying a word's part-of-speech such as noun, verb or preposition
- Sentence boundary detection: Making the boundaries of individual sentences
- Base noun-phrase analysis: Identifying sets of words including a noun which describes a single nominal expression
The Rosette Japanese Orthographic Analyzer (JOA), is a dictionary-driven software component that allows different orthographic forms of Japanese words to be normalized to a standard canonical form. This is similar to spelling variations in English, such as seen in foreign words and names (e.g. Osama and Usama). The dictionary used by JOA consists of thousands of variations observed in actual texts by lexicographers, since purely algorithmic approaches are prone to error. The current JOA data set is focused on general-purpose web search, and JOA is designed to help searches to find variations of Katakana orthographic notation as well as Kanji variations.
The JOA data includes approximately 9,000 katakana terms.
- An example of katakana variant normalization is shown in the following example. The canonical version is at the left, the other forms will be normalized to this.
- JOA supports new and old forms of kanji; there are over 89,000 entries containing words with old forms of characters which are normalized to the canonical modern form. For example:
ダンスセラピー ← ダンスセラピ / ダンステラピ / ダンステラピー
渡辺 ← 渡邊
国語 ← 國語
Also available is the Rosette Chinese Script Converter, for automatic conversion between Simplified and Traditional Chinese script. Chinese Script Converter solves the information retrieval issues stemming from the major differences between SC and TC, including character sets, encoding methods, orthography, vocabulary, and semantics.
For example, “taxi” is written as “出租汽车” in Simplified Chinese and “計程車” in Traditional Chinese. For more information, download the datasheet “Analyzing Text in Asian Languages.”



