-
Chinese Text
Analysis This presentation surveys the problems associated
with automatic processing of Chinese. It reviews the various Chinese
character sets and encoding systems; input methods and transliteration; and
the solutions offered by Basis Technology’s Chinese Language Analyzer and
Named Entity Extractor.
Presentation by Joe Ho at Basis Technology’s Government Users
Conference in Chantilly, Virginia on June 8, 2010.
-
Processing the Mosaic
of Chinese Dialects This presentation explores the taxonomy
of modern Chinese and illustrates the aforementioned difficulties through
case studies of a dialect, Wu Chinese (spoken in the Shanghai area) and a
Mandarin variant, Sichuanese (as spoken in Chengdu, the capital of Sichuan
province).
Presentation by Benjamin Swanson at Basis Technology’s Government
Users Conference in College Park, MD on May 20, 2008.
-
The Web as a Corpus
for Chinese Natural Language Processing This presentation
discusses how Basis Technology created process work, and the problems Basis
overcame (or avoided), and how it all turned out, both as a problem of
Chinese linguistics and as a challenge of downloading, filtering, and
processing terabytes of raw web pages from the Internet.
Presentation by John O’Neil, Ph.D. at Basis Technology’s Government
Users Conference in Washington, D.C. on June 7, 2007.
-
Large Corpus Construction
for Chinese Lexicon Development The World Wide Web provides
an important source of natural language data in many languages. However, it
doesn’t include annotation about linguistic structure, so it’s necessary to
use very large corpora to infer it. We developed a system for continuous,
automatic acquisition of a Chinese lexicon. An up-to-date lexicon is needed
for many applications, but Chinese is written without spaces between words,
so determining word boundaries is the primary problem. We discuss our
experience with using the Chinese Web for lexicon construction, focusing on
both low-level details and problems we experienced during our initial
proof-of-concept experiments, and on algorithmic issues.
Presentation by Thomas Emerson at the 29th Internationalization &
Unicode Conference in San Francisco, CA on March 7-9, 2006.