Integrating Advanced Text Analytics into Solr Text analytics enhances search whether handling documents one by one (e.g., what language is each document in), analyzing the components within a document (finding entities), or looking at documents within a collection (cross-document entity resolution and near duplicate detection). This presentation steps through the ways text analytics can be integrated into search and delves into some of the technical implementation details as well.
Multilingual Search and Text Analytics with Solr When trying to maximize precision and recall in search engines for English and other languages, some issues need to be taken into account: language identification, word breaking, and other linguistic analysis. This presentation talks about these issues as well as providing Solr configuration recommendations and three options for indexing a document set containing multiple languages.
Language Support, Linguistics, and Text Analytics with Solr A look at how to hook language identification into the indexing stage of Solr and also the importance of using lemmatization instead of stemming for index slimming and increasing search relevancy, and using entity extraction to introduce faceting to the search. The talk also discusses how to best store multilingual data within the Solr index and related issues.
Building a Global Listening Platform with Solr A listening platform is a content aggregator platform for online media which can be used for social or brand monitoring or open source intelligence by government. This presentation talks about a Solr-based listening platform Kearns built in 3 months, which is capable of processing multilingual data from news articles and social media sites. Functionality discussed includes language identification, entity extraction, relationship extraction, classification, near-duplicate detection, and story tracking.
Language Identification, Language Support and Entity Extraction A technical presentation looking at the motivations for adding language identification and entity extraction to an Apache Solr search engine for searching in multiple languages. These slides include a discussion of stemming vs. lemmatization (looking up the dictionary form of a word) and explain how to integrate Rosette Language Identifier and Rosette Entity Extractor into Solr.
Linguistics 101: The Conceptual Base of Natural Language Processing If you are new to natural language processing (NLP) and text analytics, a good understanding of the characteristics of human languages and linguistic concepts is invaluable. The talk includes examples from the Germanic, Indo-European and Semitic languages to illustrate the important elements of textual analysis, including a general introduction of the philosophy and types of languages, the structure of words (morphology), the meaning of the words (semantics), noun and verb phrases (constituents), and the structure of sentences (syntax). The talk wraps up with examples from natural language processing to show the role linguistics play in text analytics. This talk is targeted to audiences new to natural language processing and text analytics or seeking to “fill in the blanks” of their linguistic understanding.
Rapid Information Triage: A Practical Approach Our intelligence community routinely collects more data than we can effectively analyze. This means that we must use our linguistic and analytic resources as efficiently as possible. This talk surveys common workflows and shows how products from Basis Technology can be used to rapidly identify relevant documents and save valuable analyst time. We’ll take you on a behind-the-scenes walk-through and demonstration of an information retrieval application, which incorporates the full suite of text analytics available in the Rosette 7 platform.
Building Multilingual Search-Based Applications A look at the linguistic and language support issues for search engines processing documents in many languages and the types of question an engineer should answer before embarking on such an endeavor.
Susan Feldman, IDC
Lucene and Solr for the Rest of the World Lucene is a popular open-source search engine library, used by a variety of commercial and non–commercial web sites. However, its built–in support for non-English languages is very limited, creating a significant barrier to sophisticated processing of data in certain languages. The Rosette linguistics platform helps overcome this barrier for a number of linguistically challenging languages such as Japanese and Arabic. This presentation explores how Rosette integrates with and what benefits it brings to Lucene.
Adding Linguistics to a Lucene-based Application This presentation surveys the challenges and solutions to integrating complex linguistics into this popular open-source application.