About Us
Home»About Us»Events»Open Source Search Conference»2011»Presentations

For more information on the Open Source Search Conference:

Miryon Pak +1 (617) 386-2090

conference@basistech.com

Open Source Search Conference 2011 banner

Presentations

Search Analytics: What? Why? How?

Otis Gospodnetić, Founder, Sematext

Search is increasingly the primary information access mechanism, so knowing how your search is doing often has direct business impact. You’ve indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need? Regardless of whether you are using Solr, Lucene, or some other search solution, you should be paying attention what your users are telling you through their queries and clicks.

In this talk we’ll talk about search analytics and how it can be used to answer questions like:

  • Are too many users getting the dreaded “no matches” results?

  • How deep into search results do people dig?

  • Which hits are they clicking on, or what percentage of them don’t click on any hits?

  • How much do they use the “Did You Mean” or “Auto-Complete” suggestions?

We’ll explore what specific search analytics reports tell us and what specific actions you should take based on those reports.

What’s New in Solr?

Erik Hatcher, Lucid Imagination

Apache Solr, and its Apache Lucene heart, continue to evolve rapidly with powerful new capabilities, incredible performance and scalability gains, and—with boosted relevance to this audience—dramatically improved handling of all things related to Unicode, UTF-8, language analysis, and content processing pipelines. This talk will summarize, with succinct examples, the key new features available in the latest Solr releases.

Keynote: Building an OSINT Analysis Platform with Open Source Tools

Steve Kearns, Product Manager, Basis Technology

Open source tools like Apache Solr are incredibly powerful, but it isn’t always clear how they can be combined and applied to solve real-world problems. In this keynote, we will describe the process - and challenges - of building an open source intelligence analysis platform based on Apache Solr. The talk will cover all aspects of the platform - including web harvesting, content extraction, indexing, and building a supporting user interface. We will also discuss additional open and closed source analytic tools that may improve the suitability of the platform for analyzing large amounts of data.

Big Data Open Source Data Analytics

Christian Moen, Founder & CEO, Atilika Inc.

Today’s information society produces vast quantities of data – so called big data – which is so big that it is impractical to manage with traditional tools. Many believe that the future belongs to those who understand how to collect and utilize this data.

Open source software is driving innovation in the field of big data management, creating tools that are replacing multi-million dollar commercial data warehousing software for an increasing number of applications. This open source software perhaps also enables applications for which commercial software is not equally well suited..

This talk gives an overview of how a big data analytics platform was built using state-of-the-art open source big data technologies, including:

  • Flume – a distributed and reliable data collection framework

  • Apache Hadoop – a scalable and reliable distributed computing framework

  • Apache Hive – a data warehouse built atop Apache Hadoop, providing an SQL-like interface to big data.

The analytics platform is generic and has the capabilities of:

  • large-scale data collection from multiple sources

  • scalable and rich analysis of the collected data

  • interactive visualization and exploration of the analyzed data using a web browser

Migrating Search to Solr for the Office of the Law Revision Council

Paul Nelson, Chief Architect, Search Technologies

In 2009, Search Technologies migrated the search of the Office of the Law Revision Council (OLRC) from Personal Librarian Systems to Solr. The OLRC maintains the official codification of the general and permanent laws of the U.S. government, also known as the U.S. Code. Because OLRC is the originating author of the content, there were many interesting challenges not found in typical Solr installations:

  • Searching embedded fields

  • Handling complex queries with proximity operators

  • Highlighting in embedded fields

  • Case-sensitive and suffix-sensitive searches

  • Browsing “Table Of Contents” structure through Solr

  • Document metadata extraction and markup prior to indexing

This presentation will cover the start-to-finish architecture, design, and implementation of this search system and dive into the details, challenges encountered, and how each challenge was addressed in Solr (down to the plug-in level).

Geospatial Search Using Geohash Prefixes

David Smiley, Senior Software Developer, MITRE

Spatial search on documents that refer to a variable number of locations is an important property capability for text mining in the intelligence community in which the locations are extracted from documents. This talk introduces geohashes, which enables efficient spatial indexing, of supporting a variable number of points per document , a capability not otherwise found in Solr. Past approaches to spatial search involve indexing a Morton number (interleaved latitude and longitude bits) such as with the geohash encoding. The work presented supports a query shape that can be a box, point-radius, or even a polygon.

Geohashes—a latitude/longitude geocode system available in the public domain— is a character encoding of a Morton number (interleaved latitude & longitude bits). They are strings that incrementally narrow a latitude-longitude box on the Earth with each added character of the string. By indexing each indexed point at each intermediate level of geohash, geospatial queries can efficiently cover millions of points with efficiency. The query shape may be a box, point-radius, or even a polygon. This talk is targeted at developers.

Multi-Level Security with Solr: Searching Documents at Various Classifications

Scott Stults, Co-Founder, Open Source Connections

Most any search for government requires a solution for securely searching and retrieving documents from a repository containing various classifications levels. This presentation looks at how multi-level security (MLS) is implemented in Solr using ManifoldCF (formerly Apache Connectors Framework).

MLS is a common strategy used within the U.S. federal government and large enterprises to implement appropriate access control to restricted information. During the demonstration of MLS, a corpus of documents containing several classification levels and caveats will be indexed by Solr. After a brief overview of a simple security policy, we will show how query results are filtered by the clearance levels granted to a few principals. An overview of the existing authority and repository connectors present within ManifoldCF, as well as strategies for implementing complex security policies will conclude the presentation.

This talk is targeted towards users.

Steps toward Open Government: A Discovery Center for Library, Archives and Museum Object Collections Made with Solr

Ching-hsien Wang, Manager, Library and Archives System Support Branch Office of Chief Information Officer Smithsonian Institution

The Smithsonian Institution has created a one-stop search center for its diverse object collections from 40 databases of libraries, archives and museums. Using Solr, this search center currently indexes 6.4 million documents and records, and also supports faceted searching and navigation. This presentation will demonstrate the functions of the system, describe the overall system architecture and highlight operational and workflow issues in maintaining and supporting the system.