Chinese Base Linguistics


Analyze all your Chinese text at once, whether written in simplified or traditional script

Overview

Chinese script conversion

Because of the proliferation of Chinese speakers in many countries and regions, the language has many variations. Chinese in China and Taiwan developed independently of each other from the 1950s onward. China chose to transition to a simplified version of Chinese ideographs (hanzi), while Taiwan maintained the traditional script. These variations apply in other parts of the Chinese-speaking world as well. For example, Singapore uses simplified Chinese, while Hong Kong uses traditional Chinese.

For applications that work with Chinese, the text must be converted to a single form–whether traditional or simplified–in order to be searched and processed correctly.

Levels of conversion: simplified vs. traditional Chinese

There are three levels of conversion, all of which Rosette support.

  • Codepoint: These are cases where the character is unchanged, but because mainland China and Taiwan use different character encodings, the bytes have to be correctly interpreted and converted to the universal Unicode encoding.
  • Orthographic: In some cases, a single simplified character may map to one or more traditional characters. The correct destination character depends on word context.
  • Lexemic: In some cases—especially for modern objects and concepts—China and Taiwan chose different words to represent a foreign or new word such as “computer” or “Natalie Portman.”
Type of script conversion Simplified Chinese Traditional Chinese
Codepoint 大 (“big”)
Orthographic* 出发 (“set off”) 出發 (“set off”)
Orthographic* 头发 (“hair”) 頭髮 (“hair”)
Lexemic 出租汽车(“taxi”) 計程車(“taxi”)

*Note how the simplified Chinese words for “set off” and “hair” share the same second character, but that 2nd character differs in the traditional Chinese equivalent.

Product Highlights

  • Chinese script conversion between traditional and simplified scripts
  • Dictionary to customize tokenization, lemmatization, readings
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Noun decompounding
  • Chinese readings
  • Sentence tagging

How It Works

Hybrid approach for high quality

Our base linguistics uses a combination of dictionaries, statistical modeling, and rules to achieve high quality results.

  • Tokenization: Chinese is written without spaces between words. Tokenizing word boundaries increases search accuracy more effectively than language-agnostic n-gram techniques.
  • Lemmatization: Lemmatization (finding the dictionary form of a word) means a system can associate related words based on meaning, or index the lemma in place of the many surface forms. A given token may have more than one lemma possible.
  • Part-of-speech tagging: Rosette selects the most likely POS tag based on sentence context.
  • Chinese readings: Rosette provides the pronunciation (reading) of Chinese tokens—useful for text-to-speech or input method editor programs.

User Customizable

Our on-premise SDK provides user dictionaries that the user may use to modify or correct the behavior of Rosette,by adding new words.