About Us
Home»About Us»Resources»East Asian Language Issues

East Asian Language Issues

Chinese

  • Chinese Text Analysis  This presentation surveys the problems associated with automatic processing of Chinese. It reviews the various Chinese character sets and encoding systems; input methods and transliteration; and the solutions offered by Basis Technology’s Chinese Language Analyzer and Named Entity Extractor.

  • Processing the Mosaic of Chinese Dialects  This presentation explores the taxonomy of modern Chinese and illustrates the aforementioned difficulties through case studies of a dialect, Wu Chinese (spoken in the Shanghai area) and a Mandarin variant, Sichuanese (as spoken in Chengdu, the capital of Sichuan province).

  • The Web as a Corpus for Chinese Natural Language Processing  This presentation discusses how Basis Technology created process work, and the problems Basis overcame (or avoided), and how it all turned out, both as a problem of Chinese linguistics and as a challenge of downloading, filtering, and processing terabytes of raw web pages from the Internet.

    Presentation by John O’Neil.

  • Large Corpus Construction for Chinese Lexicon Development  The World Wide Web provides an important source of natural language data in many languages. However, it doesn’t include annotation about linguistic structure, so it’s necessary to use very large corpora to infer it. We developed a system for continuous, automatic acquisition of a Chinese lexicon. An up-to-date lexicon is needed for many applications, but Chinese is written without spaces between words, so determining word boundaries is the primary problem. We discuss our experience with using the Chinese Web for lexicon construction, focusing on both low-level details and problems we experienced during our initial proof-of-concept experiments, and on algorithmic issues.

Thai

  • Thai, the Tiger of Text Analysis: An Introduction to Thai Text Processing  In natural language processing (NLP), Arabic is known as a complex language, but the less-studied Thai poses even more intriguing challenges. Syllable boundaries are ambiguous since some vowels precede a consonant, some are written above a consonant, and some are combinations of the two. This variation makes it difficult to decide where the syllable boundary is, and, consequently, what sound a character represents as pronunciation for a character varies depending on syllable position. Moreover, Thai has no explicit word boundary marker and makes productive use of compounds. Conference “proceedings” is literally a “book collect article about academic in meeting seminar.” As a result, many character strings cannot be segmented into words in a straightforward manner. This presentation discusses some previous NLP approaches to Thai word segmentation and also looks at related issues in Romanization, transliteration, and search technologies. transliteration, and search technologies.