Improving search for users of Lucene
What is Lucene

Lucene is a high performance, scalable, cross-platform search toolkit available as open source libraries from the Apache Software Foundation. It enjoys widespread popularity as a cost-effective way to add search to a website or software product. Dozens of organizations have put it to use — ranging from IBM, to CNET, to Wikipedia — in applications ranging from an online postal stamp shop, to a full-featured enterprise content manager, to a classified ad website for cars, homes, and jobs.

Lucene provides functions that can be easily tailored into an integrator’s particular application, or a pre-built Lucene-based engine like Solr which can be deployed. The Lucene libraries include core search components such as a document indexer, index searcher, query parser, and text analyzer.


The same linguistic software that powers multilingual web search on Google, Microsoft Live Search, Yahoo! and leading enterprise search engines is now available for the Lucene open source search engine.

Like many search engines, Lucene’s standard text analyzer assumes that input text is English or European and uses space characters in text to mark word boundaries. But for East Asian languages, such as Chinese, Japanese and Korean, there are no spaces that can be used to delimit words.

Sending Asian text queries to a Lucene implementation running the standard text analyzer produces nonsense results and makes the search application useless for Asian language documents. The open source community has made available some limited language-specific solutions to this problem, including ones for Chinese, Japanese and Korean.

But for robust and full-fledged handling of Chinese, Japanese, Korean, Arabic -- searching in 18 languages -- Basis Technology's interface module for the Rosette® Linguistics Platform (RLP) adds extensive multilingual support to Lucene quickly and easily.

RLP is the same multilingual text analysis technology used by the leading commercial search engines including Google, Yahoo!, Ask, and Live.com Search. That means users can enjoy the same quality of experience with Lucene they have come to expect with their favorite web and enterprise search engines.

Multiple Analyzers in One

One reason RLP can analyze multiple languages is that RLP offers a robust collection of analyzer components — Rosette® Base Linguistics— for 18 languages, any or all of which are potentially available in the same Lucene implementation. Different components support different languages and more than one type of analysis may be involved in a search.

A built-in language identifier enables RLP to know which analysis tools to use when given text. A Unicode converter puts all text into Unicode for uniform text processing.

The base linguistics function of RLP is the starting point for building a search index and refining queries. Advanced linguistic features improve precision and recall of search results.

Segmentation and tokenization: Separates streams of text into unique word tokens as the first step towards building a search index.

Lemmatization: Provides the dictionary base form for an inflected word.

Noun decompounding: Separates compound words (such as used in German and Dutch) into their separate components.

Part-of-speech tagging: Identifies whether a word is a noun, verb, preposition, etc.

Entity extraction

RLP’s entity extraction functionality may be used to automatically locate names of people, places and organizations in documents and build a dedicated index of these high value phrases.

Language-specific advanced processing

Specific to standardizing the spelling variations in Japanese, the dictionary-driven Rosette® Japanese Orthographic Analyzer, allows different orthographic forms of Japanese words to be normalized to a standard canonical form, such that searches for the same katakana words spelled slightly differently will be found.

The Rosette® Chinese Script Converter lets readers of Simplified or Traditional Chinese locate information in the other scripts. Users in Taiwan might wish to locate information in documents from Mainland China.

The involvement of multiple techniques and multilingual analysis tools is hidden from the application, which only sees one analyzer as presented via a single Java or C++ API. Since RLP consists of independent modules that can be selectively turned on or off based on an application’s specific need, developers can start with the linguistics they need today and quickly add more advance linguistics (or support more languages) as needed tomorrow.

All You Need To Do

Integrating RLP into Lucene-enabled applications is straightforward. RLP is distributed as a downloaded SDK or runtime that includes executable RLP modules and the Rosette® Lucene Integration Module.

Lucene invokes RLP functions by calling an API and providing appropriate values, such as the location of documents to be indexed. The Lucene integration module enables RLP to connect to Lucene out of the box. Developers need to do no additional work for Lucene to search text in any language RLP supports. The Lucene integration module is included with RLP no additional cost.

Click here to request a software evaluation copy of RLP with the Rosette Lucene Integration Module.

Quality of Experience

Users, of course, only see the final result — a search experience that’s similar to what they expect with Google or other major search engines — regardless of language. Developers can appreciate that this experience comes with the ROI advantages of the leading open source search technology combined with the leading linguistic search technology. Achieving that combination is straightforward. Developers simply add Rosette and the Lucene integration module to their builds just as they would any Lucene text analyzer and in doing so they easily get more languages and much higher accuracy searching with RLP.