Products
Home»Products»Rosette Linguistics Platform»Base Linguistics

Whitepaper: N-Gram vs. Morphological Analysis for Searching Chinese, Japanese, and Korean

Download

Supported Platforms

Windows, Linux, Solaris, AIX, HPUX, and MacOS

Languages Supported by Rosette Base Linguistics

  • Albanian
  • Arabic
  • Bulgarian
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Indonesian
  • Italian
  • Japanese
  • Korean
  • Latvian
  • Malay
  • Norwegian
  • Pashto
  • Persian (Farsi / Dari)
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Spanish
  • Swedish
  • Thai
  • Turkish
  • Ukrainian
  • Urdu

Rosette Base Linguistics

Sophisticated morphological analysis, segmentation, and tagging of Arabic, Asian, and European language text

Rosette Base Linguistics are featured in some of the world’s most widely used information retrieval and text mining applications, where they perform critical functions such as tokenization, decompounding, and part-of-speech (POS) analysis. What sets our Base Linguistics apart is the way they’ve been built.

“Basis Technology’s Rosette Base Linguistics analyzers have been essential in helping us build a strong customer base among large multinational companies. Incorporating these high-quality, high-performance linguistic engines into our enterprise search products has allowed us to meet the demands of organizations that need advanced search and navigation capabilities in multiple languages.” — Mark Watkins, Vice President of Development, Endeca

Rosette Base Linguistics rely on a morphological approach to analyzing text in different languages. This means that Base Linguistics work with the specific features of a given language: punctuation, actual words, word forms and affixes. Analysis is further backed up by dictionary data for key functions such as lemmatization and POS tagging.

Key Features

  • Tokenization is a requirement for automated analysis of languages lacking spaces between words, such as Chinese, Japanese, and Korean.
  • Lemmatization generates the dictionary form of each word, increasing search relevancy and slimming the search index—by indexing only lemmas (“cruise”) rather than all inflected forms (“cruising,” “cruised”).
  • Decompounding breaks compound words into sub-components to increasing search relevancy for German, Dutch, Korean, and Scandianvian languages.
  • Part-of-Speech Tagging is used during lemmatization to select the correct dictionary form of ambiguous words, such as the noun or verb “spoke.”
  • Sentence Boundary Detection locates the start and end of sentences.
  • Noun Phrase Analysis groups nouns and their modifiers, useful in document clustering algorithms.

By applying a deep understanding of an actual language, we can continually improve Base Linguistics over time by introducing additions to the dictionary and new state-of-the-art linguistic methods that are specific to those languages. We are also continuing to add languages.

Please contact info@basistech.com for more information about upcoming releases.

For More Information

Error

Fill out the form below, and we’ll contact you about your Rosette Base Linguistics questions.

* indicates a required field

Learn More

For more information on Rosette Base Linguistics, download the product datasheet, request a product evaluation, or read a whitepaper entitled, Morphological Analysis Searches for You—Intelligently.