Essential Elements of Excellent Multilingual Search
Multilingual Search, Lucene & Solr
RBL, REX, RLI, RNT, RNI
In recent years, Apache Lucene and Solr have become a viable alternative to commercial search technologies, due to their speed, scalability, and feature-richness, coupled with the transparency and extensibility intrinsic to being open source, which lowers implementation and maintenance costs.
Thousands of organizations, including Twitter, LinkedIn, Netflix, CNET, Apple, Wikipedia, and Zappos, leverage Lucene and Solr to power a wide rage of search applications.
Out of the box, Lucene and Solr offer some language support; however, the level provided is insufficient to raise multilingual search quality to the standard demanded by commercial-caliber applications.
This is where Rosette comes in: Rosette’s pre-built JAR files, which integrate seamlessly with either pure Lucene or Lucene/Solr, provide sophisticated natural language processing technology as well as comprehensive dictionary data to fill in the gaps.
Rosette provides these linguistic advantages:
- Language identification in 55 languages and 45 encodings: for indexing documents in many languages
- Accurate segmentation in languages without spaces—Chinese, Japanese, and Korean—for greater precision
- Decompounding words into sub-components for languages that freely create compounds—such as German, Dutch, and Korean—to boost recall
- Lemmatization for relevant query expansion to boost recall and precision
- Part-of-speech tagging to improve precision and recall
- Entity extraction finds entities to enable faceted search on key names and entities in search results
Technical details are included in the whitepaper to show how Rosette easily integrates into Lucene or Solr.