-
Multilingual Search and Text Analytics with Solr
When trying to maximize precision and recall in search engines for
English and other languages, some issues need to be taken into account: language
identification, word breaking, and other linguistic analysis. This presentation
talks about these issues as well as providing Solr configuration recommendations
and three options for indexing a document set containing multiple
languages.
Presentation by Steve Kearns at Apache Lucene Eurocon, Barcelona, Spain
on October 19, 2011.
-
Language Support, Linguistics, and Text Analytics with Solr A
look at how to hook language identification into the indexing stage of Solr and
also the importance of using lemmatization instead of stemming for index
slimming and increasing search relevancy, and using entity extraction to
introduce faceting to the search. The talk also discusses how to best store
multilingual data within the Solr index and related issues.
Presentation by Steve Kearns at the Boston Apache Lucene and Solr Meetup
on December 14, 2010.
-
Building a Global
Listening Platform with Solr A listening platform is a content
aggregator platform for online media which can be used for social or brand
monitoring or open source intelligence by government. This presentation talks
about a Solr-based listening platform Kearns built in 3 months, which is capable
of processing multilingual data from news articles and social media sites.
Functionality discussed includes language identification, entity extraction,
relationship extraction, classification, near-duplicate detection, and story
tracking.
Presentation by Steve Kearns at Lucene Revolution, Boston on October 7,
2010
-
Language Identification, Language Support and Entity Extraction
A technical presentation looking at the motivations for adding language
identification and entity extraction to an Apache Solr search engine for
searching in multiple languages. These slides include a discussion of stemming
vs. lemmatization (looking up the dictionary form of a word) and explain how to
integrate Rosette Language Identifier and Rosette Entity Extractor into
Solr.
Presentation by Steve Kearns at the Stockholm Findability and Enterprise
Search in September 2010.
-
Linguistics 101:
The Conceptual Base of Natural Language Processing If you are
new to natural language processing (NLP) and text analytics, a good
understanding of the characteristics of human languages and linguistic concepts
is invaluable. The talk includes examples from the Germanic, Indo-European and
Semitic languages to illustrate the important elements of textual analysis,
including a general introduction of the philosophy and types of languages, the
structure of words (morphology), the meaning of the words (semantics), noun and
verb phrases (constituents), and the structure of sentences (syntax). The talk
wraps up with examples from natural language processing to show the role
linguistics play in text analytics. This talk is targeted to audiences new to
natural language processing and text analytics or seeking to “fill in the
blanks” of their linguistic understanding.
Presentation by Zina Saadi at Basis Technology’s Government
Users Conference in Chantilly, VA on June 8-9, 2010
-
Rapid
Information Triage: A Practical Approach Our intelligence
community routinely collects more data than we can effectively analyze. This
means that we must use our linguistic and analytic resources as efficiently as
possible. This talk surveys common workflows and shows how products from Basis
Technology can be used to rapidly identify relevant documents and save valuable
analyst time. We’ll take you on a behind-the-scenes walk-through and
demonstration of the Odyssey Information Navigator—an information retrieval
application, which incorporates the full suite of text analytics available in
the Rosette 7 platform.
Presentation by Steve Kearns at Basis Technology’s Government
Users Conference in Chantilly, VA on June 8-9, 2010
-
Building
Multilingual Search-Based Applications A look at the
linguistic and language support issues for search engines processing documents
in many languages and the types of question an engineer should answer before
embarking on such an endeavor.
Presentation by Steve Kearns at ApacheCon Europe in May 2010.
-
Beyond Keyword
Search
Susan Feldman, IDC, Day Two Keynote Address at Basis
Technology’s Government Users Conference on June 9, 2009.
-
Lucene and Solr for the Rest of the World Lucene is a popular
open-source search engine library, used by a variety of commercial and
non–commercial web sites. However, its built–in support for non-English
languages is very limited, creating a significant barrier to sophisticated
processing of data in certain languages. The Rosette linguistics platform
helps overcome this barrier for a number of linguistically challenging languages
such as Japanese and Arabic. This presentation explores how Rosette integrates with
and what benefits it brings to Lucene.
Presentation by Teruhiko Kurosaka at Basis Technology’s
Government Users Conference on June 8, 2009.
-
Adding
Linguistics to a Lucene-based Application This presentation
surveys the challenges and solutions to integrating complex linguistics into
this popular open-source application.
Presentation by Chris Milner, Ph.D., and Steve Cohen at Basis
Technology’s Government Users Conference in Washington, D.C. on June 7,
2007.