The Sleuth Kit and Open Source Digital Forensics Conference 2011 (6/14/2011)
Open Source Search Conference 2011 (6/15/2011)

Michael J. Totten is a foreign correspondent and foreign policy analyst who has reported from the Middle East, the Balkans, and the Caucasus. He publishes lengthy first-person narrative dispatches on his website, and feature and analysis pieces for various newspapers and magazines.
He has visited Iraq seven times, and on three of his trips he embedded with United States military personnel – with the 82nd Airborne in Baghdad in the wake of the surge during the summer of 2007; with the Third Battalion Fifth Marine Regiment in Fallujah during the winter of that same year; and with the Army’s Fourth Infantry Division in Baghdad in late 2008. He is a former resident of Beirut and covered Israel’s Second Lebanon War in 2006 from the Israeli side of the Lebanese border.
He writes regularly for Commentary magazine, and his work has also appeared in the New York Times, the Wall Street Journal, the New York Daily News, City Journal, LA Weekly, the Jerusalem Post, Beirut’s Daily Star, Reason magazine, Azure magazine, and the Australian edition of Newsweek. The Week magazine named him blogger of the year in 2006 for his dispatches from the Middle East, and he won the annual Weblog Awards two years in a row, in 2007 and 2008, for the Best Middle East or Africa Blog.
Basis Technology develops and supports software products which address the foreign language needs of the defense, intelligence, and law enforcement communities. Our software has been applied to such missions as document and media exploitation; document triage; watch list management; and geospatial fusion.
This presentation will describe recent developments and additions to Basis Technology’s product line, and examples of how our technology is being used to deliver more powerful analytic capabilities to the warfighter and intelligence analyst.
When searching documents or analyzing text, often the most critical pieces of information are the names of people, places, and organizations. But how can you be sure that one name is the same as another, especially if it’s written in a different language or appears as some variant form? What if you want to translate that name into a language you can recognize and process?
Name analysis tools have recently expanded to handle both new languages, such as those of Afghanistan (e.g., Pashto and Dari), and new types of variation (e.g., nicknames, initials, and re-ordering). This talk will present an overall view of name relationships and will demonstrate the use of Basis Technology products to navigate those relationships.
In Chinese, every ideograph can represent a word or concept. In automated text analysis, breaking Chinese ideographs into words is only the first step to mining text, extracting entities, and resolving names for the most widely spoken language in the world. This talk introduces the features and characteristics of the Chinese language as a prelude to automated analysis of Chinese texts. We will look at solutions to meet the difficulties of coping with the different character sets used in China, Taiwan and Hong Kong. We will also look at extracting named entities, searching Chinese text, and recognizing Chinese names expressed in other languages.
Cell phones, BlackBerrys, and iPhones provide people with the ability to exchange pictures, messages, check email, surf the Web, capture videos or watch videos all in the palm of their hands. As the reliance on handheld technology increases, these devices will develop further, and will eventually be used in the same way as computers are. It is critical for digital forensics analysts to know how to acquire, preserve, and effectively examine data seized from a handheld device.
This talk introduces naming practices in Afghanistan, following a primer on Pashto and Dari, the two major languages spoken in Afghanistan. We will explore the linguistic attributes of Pashto and Dari names such as their influence by Arabic names, spelling variations, and morphology.
The proliferation of transliteration styles for Arabic names into Western languages is well known, but what are the factors that shape how names are represented across the Arabic world? This talk will look at examples of names influenced by formal languages and spoken in the region as well as how these languages influence the orthography of the names in Latin alphabet.
Finding the few emails among thousands that mention a specific person or concept may provide a needed missing link, but what if the emails are in a language you don’t speak? This language barrier can be bridged by making a search system cross-lingual. Doing so involves trading off properties like implementation ease, accuracy, and speed. This talk will explore some specific options to tackle these trade-offs as well as other challenges of enabling cross-lingual search.
Lucene is a popular open-source search engine library, used by a variety of commercial and non–commercial web sites. However, its built–in support for non-English languages is very limited, creating a significant barrier to sophisticated processing of data in certain languages. The Rosette linguistics platform helps overcome this barrier for a number of linguistically challenging languages such as Japanese and Arabic. This talk explores how Rosette integrates with and what benefits it brings to Lucene. We will demonstrate various linguistic features of Rosette using Solr, an open-source search engine based on Lucene.
Harmony is a Department of Defense system deployed at the National Ground Intelligence Center (NGIC) that provides a community of users with access to its extensive collection of records. These records include foreign, military, and public documents, electronic media, and translations of these materials. Links and descriptions of the records are stored in the Harmony database. Harmony provides consistent and simple access to this widely heterogeneous data collection, through a combination of keyword search and specialized metadata searches. Basis Technology components enable key capabilities in this application, including multilingual full-text search on the text of the foreign language collected material in Arabic and Farsi, interactive keyword translation for cross-lingual searches from English into Arabic content, and named entity extraction and word-by-word lexicographic analysis on demand for Arabic text. This talk will describe these National Harmony capabilities and their use of the Basis Technology components that support them, with specific emphasis on examples and test cases in Arabic.
In a cross-script search environment, proper nouns written in their native script are not difficult for native speakers or even for computers. But what happens when your user base is unfamiliar with the target language? This talk presents lessons learned from a multi-billion document, cross-script search system in which the majority of users are familiar with only the Latin alphabet. Even with a perfect F-score (a measure of search relevancy), users may skip over relevant documents or misinterpret results if high-quality name translation is unavailable.
Talking points include language-specific challenges; the difficulty of “double-transliterated” names; the inverse relationship between name translation and name matching; and a brief overview of linguistic resources that are required to maximize the user experience.
Annotating large and complex collections of data is a very old problem, and through the years, there have been may proposed solutions to it. This talk looks at the different types of annotations that organizations need. It will also look at the many annotation methods used, from the pre-computer age through the present day, from the Dewey Decimal System and MEMEX, right up through html and entity extraction. Finally, it will answer the question: what should we expect from a state-of-the-art document mark-up system?
From thousands of documents, find all the references of this person of interest whether his name is written in English, Arabic or another language. Then, find the individual’s connections to related people, places and organizations -- again in English, Arabic or another language -- oh, and while you’re at it, translate the most pertinent documents to English. A pipe dream of a tired intelligence analyst?
Learn about what such an end-to-end translingual search and exploitation system looks like in this talk that explores a solution incorporating next-generation multilingual named-entity technology based on a voted-perceptron algorithm; next-generation name-indexing, name-search, and name-translation technologies.
Basis Technology’s Rosette linguistics platform is a feature-rich toolkit for building sophisticated text processing applications. An understanding of the full suite of platform components is an essential prerequisite for selecting the best technologies applicable for any particular information processing task. Participants will learn the capabilities of the Rosette components, how they are used, how they interact and how they can be tuned and customized. This tutorial is suitable for both a technical and non-technical audience.
No matter how well-trained an entity extraction system may be, it will always perform best on the type of text it was trained on, which frequently is not your text. This tutorial will detail step-by-step how to customize the Rosette Entity Extractor (REX) to achieve the extraction results on the text you must process, and the features of REX to handle text from tables and databases.
We will cover writing new regular expressions to extract entities with regular patterns, creating gazetteer databases of entities, and configuring the redactor to return the desired entity when there is a possibility of more than one. This tutorial is targeted at engineers and developers.
Learn how to use the latest release of Transliteration Assistant, a productivity-boosting application designed for analysts, linguists, and translators. Its key components are:
Participants in this hands-on tutorial will learn how to prepare reports using standardized transliterations, how to automatically translate lists of names, and how to exploit online reference materials. The first five participants to register for this tutorial will be able to use a Basis-supplied laptop computer during the session.
Susan Feldman is Vice President of Search and Discovery Technologies at IDC and is director of IDC’s Content Technologies Group. Ms. Feldman’s area of specialization includes market research on search engines, text analytics, unified access to information, categorization and other information retrieval technologies, as well as digital marketplace dynamics.
Ms. Feldman won the 2003 James Peacock Research award at IDC for her work on modeling and forecasting the search and retrieval technology markets, and an Innovation Award from IDC in 2007 for developing a new research program on the digital marketplace. Her current work includes creating an interactive model for the digital marketplace. She has written and edited numerous articles and books about the Internet and information retrieval technology for which she has won several national and international awards.
Arabic is spoken in over 25 countries in the world with dialects both regional and national. How do these dialects vary? What are the telltale characteristics of one dialect compared to another? This talk will discuss the similarities of many linguistic structures that define an Arabic dialect as well as the differences that draw non-geographical boundaries, and then show how this affects Arabic search.
Challenges and Solutions Financial institutions across the globe face tightening compliance regulations in over 100 countries and a proliferation of watch lists covering terrorism, crime, fraud, and sanctions. The complexity is compounded by the fact that organizational and regulatory data may reside in different languages. How can an organization comply with international sanctions programs when their customer names are in one language and the sanctions list is in another? The problem is further magnified when the lists themselves contain multiple languages or transliterated data – like the U.S. Treasury Department’s OFAC List. There is no single standard for name transliteration, and manually translating names is time-consuming and error-prone. This talk will discuss these challenges and look at practical solutions for handling customer names in multiple languages, including Basis Technology’s Rosette Name Indexer and Rosette Name Translator.
Before you can search, you need text, so what do you do when most of your documents are document images (PDFs) and in Arabic? The very first challenge in analysis may be extracting text from PDFs.
This talk presents a solution to the problem of extracting Arabic text from PDFs through modifications of the open source software PDFBox (www.pdfbox.org). We will start by looking at the basics of PDF structure, then look at how Arabic is stored in PDF and how to get it out using a custom-modified PDFBox. Real-life examples will be brought in as appropriate.
Nicknames express intimacy, a special relationship or even infamy. In the Arab culture, the number of nicknames for a person may seem endless. You often see them in chat, emails, or in oral communication. Dealing with multiple nicknames is a tricky problem for fields such as compliance, intelligence gathering and name resolution, since they could be used as aliases. For example “Nom de guerre” names (i.e. war name) are used by resistance fighters, terrorists, and guerilla fighters as pseudonyms to hide their identities and/or protect themselves and their families from harm. Understanding begins with looking at the usage of nicknames and nickname formation in Arabic. In this talk we will delve into the different types of Arabic nicknames. For example, Osama Bin Laden has the following aliases: Hajj, The Emir, The Prince, Sword of Islam, Sword of God, Samaritan, Imam Mehdi, Abu Abdallah, etc, and many others.
This talk will start with a tuturial dubbed Arabic Script for Dummies, a brief reminder of what is generally overlooked by industry and scholars alike. Then an overview of the Arabic-scripted world will be given, with a breakdown of the languages involved. Next will be a breakdown of the Arabic-scripted world according to preferred writing styles. Finally a demonstration will be given of a Nastaliq computer model that covers the whole Unicodefor Arabic-scripted languages, notably Arabic, Persian, Dari, Uyghur, Pashto, Baluchi, Kashmiri, Sindhi and Urdu - for the first time in the history of typography.
Using search, you can only find what you know exists, but with Attivio’s Active Intelligence, learn what is hidden in your data. The Active Intelligence Engine (AIE) combines structured data and unstructured data, revealing patterns for the user to explore and navigate in a structured way. AIE shifts the user from finding information to using information.
This talk includes a live demonstration of active intelligence technology integrating Attivio’s AIE and Basis Technology’s entity extraction and name translation technologies, to extend Active Intelligence to multiple languages.
Endeca’s unique Information Access Platform helps people find, analyze, and understand information in ways never before possible. This talk will describe how advanced Arabic linguistics—including name matching and name translation—have been incorporated into the Information Access Platform to create a powerful new analytic tool. This technology empowers intelligence analysts to expose connections and discover patterns in data which would otherwise be hidden from legacy search engines.
Our intelligence and military organizations continually struggle with a broad array of information, from highly structured data to unstructured, textual content. Basis Technology’s Rosette linguistics platform provides the means to create more structured information from sources that have free-form, natural-language text. Mark Logic provides an XML Server that integrates database and search capabilities for a broad array of data, metadata, and content. By integrating the Rosette linguistics platform with MarkLogic, we can take an approach of enhancing textual content, adding extracted named entities and name translations in-line with the original text. Applications built on the MarkLogic platform then present analysts and warfighters with a consistent, in-context view of the original textual source and extracted data, providing integrated query and visualization of time, location, people, organizations, etc.
Apache Solr, an open source search engine built on Lucene, powers search and findability in many enterprise and government sector applications. Solr features incredibly fast indexing and querying speeds, top-notch relevancy and scoring flexibility, faceting, spell checking, distributed search, and much more. Not only can Solr serve as a traditional search engine, it provides a powerful development framework for textual analysis especially with the Rosette Solr integration provided by Basis Technology.
This talk demonstrates Solr’s ease of use with live examples taking raw data to attractive usable interface in minutes, using only open source Solr and Basis Technology plug-ins. Several Solr-based application examples will also be presented.
Lucid Imagination is a venture-funded commercial entity exclusively dedicated to the Lucene/Solr technology. It provides SLA-based support, training, consulting, and value-add software to organizations already using or evaluating Lucene/Solr for their search solutions.
Even with a sustained data transfer of 80 megabytes per second, It still takes 3 hour and 28 minutes to read the contents of a 1TB hard drive—and that’s assuming that you can stream from one end to the other with no interruptions.
This talk will present a new media analysis and exploitation technique based on the statistical sampling of drive sectors. Using this approach it is possible to make highly accurate statements about the contents of a 1TB disk with less than 10 seconds of analysis, and with a margin of error of less than 1%. Making these statements requires a number of new advances in recognition technology and fast database lookups which will be discussed in this talk.
BrightPlanet is the OSINT leader in harvesting high quality, “relevant” content from inaccessible Deep Web and Surface Web sources.
With over 10 years of Deep Web extraction expertise, the company provides access to Deep Data (“known”, “unknown”, “protected” or “hidden”) from the Open Source Public Web – information not found by conventional search technologies. BrightPlanet can then provide normalized, qualified content for analysts and analytics technology - the content needed to further refine unstructured data.
Open source Web content is notorious for being poorly structured with small fragments of meaningless navigation text which cause entity extraction tools to produce a large number of false tags which require extensive tuning and filtering. This presentation will focus on the techniques developed by BrightPlanet to further normalize open source Web content (both Deep & surface data) before using the Rosette Entity Extractor to yield higher quality extractions.
What are the next challenges for name processing? What languages, phenomena, or entity types are most important? What sorts of trade-offs need to be made among accuracy, speed, and integration complexity?
Share problems and experiences with other MEDEX and CELLEX specialists. What are the major challenges that are being faced? Is it speed, data formats, languages, multimedia, or something else? Are “fast scan” tools that are used for onsite analysis effective at finding actionable intelligence? What is the future of analysis in theater?
Here, our keynote speaker will discuss the variables for selecting language-based software, the types of applications, and the technologies themselves. Ms. Feldman will talk to the attendees about which applications they are already using, find out their plan, offer advice, and facilitate an open discussion between attendees so that they can learn from each other.