Japanese Base Linguistics

Specialized text analytics tools for more efficient and accurate processing of Japanese text



Katakana spelling variations

Since katakana words are a Japanese phonetic approximation of foreign words, there can be considerable variation in how different people spell the same word. Our Japanese base linguistics tools normalize these variations to a single spelling so that every search for “Venice” will find all the occurrences even though some may write ベニス, while others write ベネツィア and still others write ヴェネチア or ヴェネツィア.

Variations on “Bermuda” and “expo” in Japanese

Modern vs. old kanji

The Japanese borrowed Chinese ideographs (called “kanji”) from China centuries ago. And although the modern day kanji are somewhat simplified from back then, there are still a fair number of instances where both older and modern versions are in use, leading to spelling variations. Our Japanese base linguistics normalizes older kanji to its modern version.

Product Highlights

  • Normalize old kanji to modern kanji
  • Normalize katakana spelling variations
  • Dictionary to customize tokenization, lemmatization, readings
  • Sentence tagging
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Noun decompounding
  • Japanese readings

Older kanji variations converted to modern kanji

How It Works

Hybrid approach for high quality

Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high quality results.

  • Tokenization: Japanese is written using three scripts with no spaces between words. Tokenizing word boundaries increases search accuracy more effectively than language-agnostic n-gram techniques.
  • Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one lemma possible; Rosette chooses the best candidate based on context.
  • Part-of-speech tagging: Rosette selects the most likely POS tag based on sentence context.
  • Noun decompounding: Search engines especially appreciate noun decompounding to increase search recall.
  • Japanese readings: Kanji pronunciation varies depending on its context in Japanese. Rosette provides possible readings for each token–useful for text-to-speech or input method editor programs.

User customizable

Our on-premise SDK provides user dictionaries that the user may use to modify or correct the behavior of Rosette, by adding new words, noun decompounds, readings, and lemmas.