Rosette Base Linguistics (RBL)

Rosette: Big Text Analytics


Chinese, Japanese, and Korean


Comprehensive morphological analysis of Chinese, Japanese and Korean text

Our CJK language analyzers are used in some of the world’s most transaction-heavy environments, like Google’s search engine and Amazon’s e-commerce site. Rosette Base Linguistics for Chinese, Japanese and Korean are extremely accurate and reliable solutions to help complex applications process unstructured CJK language text by conquering some of these languages’ many challenges, such as the use of numerous scripts and absence of spaces between words. Using advanced morphological analysis, Rosette Base Linguistics performs functions critical for analyzing CJK text such as segmentation, lemmatization, noun decompounding, part-of-speech tagging, sentence boundary detection, and base noun phrase analysis.

Rosette Base Linguistics relies on dictionaries that are continually updated to keep pace with the continuing evolution of each language. For further detail on the dictionaries, please download a datasheet.

Features Include:

    • Segmentation and tokenization: The process of segmenting CJK text into unique word tokensRBL Japanese Segmentation and Tokenization Sample
    • Lemmatization: Providing the dictionary base form for an inflected verb or adjective)RBL Korean Lemmatization Sample
    • Noun Decompounding: The process of separating compound nounsRBL Japanese Noun Decomposition Sample
  • Part-of-speech tagging: Identifying a word’s part-of-speech such as noun, verb or preposition
  • Sentence boundary detection: Making the boundaries of individual sentences
  • Base noun-phrase analysis: Identifying sets of words including a noun which describes a single nominal expression “学生時代” (“school days”) and “前の年” (“previous year”)

Rosette Japanese Orthographic Analyzer (JOA)

The Rosette Japanese Orthographic Analyzer (JOA), is a dictionary-driven software component that allows different orthographic forms of Japanese words to be normalized to a standard canonical form. This is similar to spelling variations in English, such as seen in foreign words and names (e.g. Osama and Usama). The dictionary used by JOA consists of thousands of variations observed in actual texts by lexicographers, since purely algorithmic approaches are prone to error. The current JOA data set is focused on general-purpose web search, and JOA is designed to help searches to find variations of Katakana orthographic notation as well as Kanji variations.

  • An example of katakana variant normalization is shown in the following example. The canonical version is at the left, the other forms will be normalized to this:ダンスセラピー ← ダンスセラピ / ダンステラピ / ダンステラピー
  • JOA supports new and old forms of kanji, for example:渡辺 ← 渡邊国語 ← 國語

Rosette Chinese Script Converter

Also available is the Rosette Chinese Script Converter, for automatic conversion between Simplified and Traditional Chinese script. Chinese Script Converter solves the information retrieval issues stemming from the major differences between SC and TC, including character sets, encoding methods, orthography, vocabulary, and semantics. For example, “taxi” is written as “出租汽车” in Simplified Chinese and “計程車” in Traditional Chinese.

Text Analytics

KEY FEATURES

  • Simple API
  • High-scale and Throughput
  • Industrial-strength Support
  • Easy Installation
  • Flexible and Customizable
  • Integration: Java, C++, or Web Services
  • Platform: Unix, Linux, Mac, Windows
  • Component of the Rosette SDK
  • Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts

Whitepaper: Morphological Analysis for Chinese, Japanese and Korean

The best approach for indexing Chinese, Japanese and Korean text for search engines.


Contact us about integrating Rosette Base Linguistics
into your search application:

This is a unique website which will require a more modern browser to work! Please upgrade today!