The dtSearch Engine’s Language Analyzer API passes blocks of text to Rosette, accepting back words to index.
dtSearch quickly searches through terabytes of text across a desktop, network, Internet or intranet site. Its instant searching and file format support can also be embedded in other applications.
Basis Technology provides enterprise-quality linguistic analysis to the dtSearch Text Retrieval Engine via Rosette. Leading organizations use us for deep linguistic processing and highly accurate search results in many languages. This linguistic plug-in delivers quality multilingual search results in over 20 Asian, European, and Middle Eastern languages.
Basis Technology’s commercially supported text analytics platform for search is used by top search engines including Google, Yahoo!, and Bing to segment Chinese, Japanese, and Korean text, improve indexing through morphological analysis, and apply other language-specific features for better precision and greater recall in search results. With Rosette’s dtSearch connector, enterprise customers can access these tools for search-based applications, enterprise search, and other deployments.
The core developer component of the dtSearch product line, the dtSearch Engine, offers 25+ fielded and full-text federated search options. Its own file parsers highlight hits in popular file types, email types, and attachments and the spider supports static and dynamic data. dtSearch comes with APIs for .NET, Java, C++, SQL, and more for Windows or Linux (native 64-bit and 32-bit).
Implementing Asian, European, and Middle Eastern languages can require several vendors and modules with different performance levels and features. Rosette gives dtSearch implementers high speed and accuracy for these languages via one source, so that plugging in one, or 24 languages, is easy and predictable. Basis Technology has been providing support for our customers around the world for over 15 years.
At index and query time, the Language Identifier component of Rosette swiftly detects the encoding of documents, identifying 55 languages and 45 encodings. The algorithms are based on statistical profiles and trained on gigabytes of hand-verified data.
Each of the world’s languages is unique, and search engines need to understand specific features of each language to deliver the best results. Rosette uses a combination of lexical data, heuristic rules, and statistical models to tokenize text, perform morphological analysis, extract entities, search for name variants, and more. We continually evaluate new approaches to linguistic analysis and update technologies or lexical data in our regular releases to enable our customers to focus on what they do best.
Our software has been extensively tested by major web and enterprise search providers, who adopted Rosette to provide quality search results in over 20 languages. Our technology has been tuned for high throughput and is highly scalable in the dtSearch environment. Most importantly, our knowledgeable technical staff can help you whether your problem is with searching in Japanese, Arabic, Russian, or any other language we support.
Rosette comes with source code for a dtSearch-compatible language analyzer to seamlessly integrate Rosette functionality. Using the included sample build environment for Windows, the developer can start using the resulting DLL as soon as it is dropped into the appropriate dtSearch language analyzer directory. Source code for the language analyzer gives the developer maximum flexibility for customizing it to the needs of the application. Request a free evaluation copy of Rosette today.
Rosette’s language analyzer for dtSearch has full access to all the language identification and base linguistics functions of Rosette at index and query time:
Language identification in 55 languages and 45 encodings: for indexing documents in many languages
Accurate tokenization in languages without spaces—Chinese, Japanese, and Korean—for greater precision
Decompounding words into sub-components for languages that freely create compounds—such as German, Dutch, and Korean—to boost recall
Lemmatization for relevant query expansion to boost recall and precision
Part-of-speech tagging to improve precision and recall
Entity extraction to find entities, enabling faceted search results