Japanese Base Linguistics
Katakana spelling variations
Since katakana words are a Japanese phonetic approximation of foreign words, there can be considerable variation in how different people spell the same word. Our Japanese base linguistics tools normalize these variations to a single spelling so that every search for “Venice” will find all the occurrences even though some may write ベニス, while others write ベネツィア and still others write ヴェネチア or ヴェネツィア.
|Variations on “Bermuda” and “expo” in Japanese|
Modern vs. old kanji
The Japanese borrowed Chinese ideographs (called “kanji”) from China centuries ago. And although the modern day kanji are somewhat simplified from back then, there are still a fair number of instances where both older and modern versions are in use, leading to spelling variations. Our Japanese base linguistics normalizes older kanji to its modern version.
- Normalize old kanji to modern kanji
- Normalize katakana spelling variations
- Dictionary to customize tokenization, lemmatization, readings
- Sentence tagging
- Part-of-speech tagging
- Noun decompounding
- Japanese readings
Older kanji variations converted to modern kanji
How It Works
Hybrid approach for high quality
Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high quality results.
- Tokenization: Japanese is written using three scripts with no spaces between words. Tokenizing word boundaries increases search accuracy more effectively than language-agnostic n-gram techniques.
- Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one lemma possible; Rosette chooses the best candidate based on context.
- Part-of-speech tagging: Rosette selects the most likely POS tag based on sentence context.
- Noun decompounding: Search engines especially appreciate noun decompounding to increase search recall.
- Japanese readings: Kanji pronunciation varies depending on its context in Japanese. Rosette provides possible readings for each token–useful for text-to-speech or input method editor programs.
Our on-premise SDK provides user dictionaries that the user may use to modify or correct the behavior of Rosette, by adding new words, noun decompounds, readings, and lemmas.