Products
Home»Products»Rosette Linguistics Platform»Base Linguistics

Whitepaper: N-Gram vs. Morphological Analysis for Searching Chinese, Japanese, and Korean

Download

Supported Platforms

Windows, Linux, Solaris, and MacOS

Languages Supported by Rosette Base Linguistics

  • Albanian
  • Arabic
  • Bulgarian
  • Catalan
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Hebrew
  • Hungarian
  • Indonesian
  • Italian
  • Japanese
  • Korean
  • Latvian
  • Malay
  • Norwegian
  • Pashto
  • Persian (Farsi / Dari)
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Spanish
  • Swedish
  • Thai
  • Turkish
  • Ukrainian
  • Urdu

Rosette Base Linguistics

Sophisticated morphological analysis, segmentation, and tagging of Arabic, Asian, and European language text

Rosette Base Linguistics is featured in some of the world’s most widely used information retrieval and text mining applications, where it performs critical functions such as tokenization, decompounding, and part-of-speech (POS) tagging. What sets this product apart is the way it’s been built.

“Basis Technology’s Rosette Base Linguistics analyzers have been essential in helping us build a strong customer base among large multinational companies. Incorporating these high-quality, high-performance linguistic engines into our enterprise search products has allowed us to meet the demands of organizations that need advanced search and navigation capabilities in multiple languages.” — Mark Watkins, Vice President of Development, Endeca

Rosette Base Linguistics relies on a morphological approach to analyzing text in different languages. This means that Rosette works with the specific features of a given language: punctuation, actual words, word forms and affixes. Analysis is further backed up by dictionary data for key functions such as lemmatization and POS tagging.

Key Features

  • Tokenization is a requirement for automated analysis of languages lacking spaces between words, such as Chinese, Japanese, and Korean.
  • Lemmatization generates the dictionary form of each word, increasing search relevancy and slimming the search index—by indexing only lemmas (“cruise”) rather than all inflected forms (“cruising,” “cruised”).
  • Decompounding breaks compound words into sub-components to increasing search relevancy for German, Dutch, Korean, and Scandianvian languages.
  • Part-of-Speech Tagging is used during lemmatization to select the correct dictionary form of ambiguous words, such as the noun or verb “spoke.”
  • Sentence Boundary Detection locates the start and end of sentences.
  • Noun Phrase Analysis groups nouns and their modifiers, useful in document clustering algorithms.

By applying a deep understanding of an actual language, we can continually improve Rosette Base Linguistics over time by introducing additions to the dictionary and new state-of-the-art linguistic methods that are specific to those languages. We are also continuing to add languages.

Please contact info@basistech.com for more information about upcoming releases.

For More Information

Fill out the form below, and we’ll contact you about your Rosette Base Linguistics questions.

* indicates a required field
 First Name: *
 
 Last Name: *
 
 Organization: *
 
 Email Address: *
 
 Phone:
 

Learn More

For more information on Rosette Base Linguistics, download the product datasheet, request a product evaluation, or read a whitepaper entitled, Morphological Analysis Searches for You—Intelligently.