The Sleuth Kit and Open Source Digital Forensics Conference 2011 (6/14/2011)
Open Source Search Conference 2011 (6/15/2011)

Stephanie O’Sullivan was named Director for Science and Technology in August 2005. Since June 2003, Ms. O’Sullivan had been the Associate Deputy Director for Science and Technology. In that time, the DS&T focused on expanding technical and field support to HUMINT operations, delivering unique technical collection capabilities, building the CIA research cadre, and expanding the mission and application of open source intelligence (OSINT).
Ms. O’Sullivan will open the conference by drawing upon her 19 years of experience with technology R&D in the Intelligence Community.
Basis Technology develops software products which solve difficult problems in text analysis, content extraction, information retrieval, and identity resolution. Our products have been applied to a wide range of missions across the intelligence community in areas requiring human language technology (HLT), including DOMEX, CELLEX, HUMINT, SIGINT, and GEOINT.
This talk will survey the capabilities of Basis Technology’s latest product releases and discuss our directions for future development.
When searching documents or analyzing text, often the most critical pieces of information are the names of people, places, and organizations. But in a real-world environment, how can you be sure that one name is the same as another, especially if it’s written in a different script or language? How can you be sure that you’ve found all occurrences of names on a watch list or in a database? How can you translate a name into a language you can recognize and process?
This talk will explore challenges of multilingual name resolution, retrieval, and translation. We will also demonstrate Basis Technology products which enable rapid identification of names in multiple languages and automatic, high-accuracy translation of those names into English.
As criminal and counter-terror investigations cross national and language boundaries, the challenges include not only finding the right documents and evidence from among terabytes of data spread across thousands of hard drives, but also searching for keywords or names in different languages, and then interpreting search results in languages unfamiliar to the investigator.
R&D initiatives at Basis Technology are focussed on these very problems. Our Digital Forensics initiative addresses the first half of the problem, and our Machine Translation initiative addresses the second. This talk will review both initiatives and connect them with Basis Technology’s broader text analytic and name matching solutions.
The information structures to collect, share, and disseminate user relevant information among government agencies and NGOs fighting the war on terror is woefully outdated. Semantic networking, a capability that guarantees critical information is immediately disseminated to users that need it, has already provided a solution to this issue for national level information, and protoypes for this solution for small tactical groups.
This talk will describe an integrated capability known as “Tango” which helps collect, organize, and deliver semantically connected information.
Countless competitions and numerous organizations have attempted to define metrics which characterize the quality of entity extraction tools, but what do those scores mean when these tools are applied to the real world? Do those extractors which score the best in controlled competitions operating on clean data deliver the best results in a production environment operating on dirty data?
This tutorial surveys the types of measurements used for entity extraction quality, and discusses techniques to better extract the data you’re looking for when general language models don’t fit your needs.
The combined strength of an enterprise-scale search engine based on the popular Lucene open-source search index core and Basis Technology’s multilingual natural language processing products represent a new opportunity in enterprise search. This talk will discuss the ins and outs of the burgeoning Lucene marketplace, and how Basis Technology’s Rosette linguistics platform puts multilingual search within easy grasp of Lucene users.
In a cross-script search environment, proper nouns written in their native script are not difficult for native speakers or even for computers. But what happens when your user base is unfamiliar with the target language? This talk presents lessons learned from a multi-billion document, cross-script search system in which the majority of users are familiar with only the Latin alphabet. Even with a perfect F-score (a measure of search relevancy), users may skip over relevant documents or misinterpret results if high-quality name translation is unavailable.
Talking points include language-specific challenges; the difficulty of “double-transliterated” names; the inverse relationship between name translation and name matching; and a brief overview of linguistic resources that are required to maximize the user experience.
Although people commonly speak of “Chinese,” the truth is that Mandarin is only one dialect among many mutually unintelligible spoken Chinese dialects, all of which share a common writing system. Yet even in their written form, Chinese dialects may use different words and characters to refer to the same ideas. Localized variants of Mandarin Chinese occupy a gray area between differences of accent and dialect.
These variants and dialects present a problem to statistical natural language processing (NLP) algorithms due to the addition of new words, dissimilar semantics for the same word, and differences in pronunciation and grammar. This talk will explore the taxonomy of modern Chinese and illustrate the aforementioned difficulties through case studies of a dialect, Wu Chinese (spoken in the Shanghai area) and a Mandarin variant, Sichuanese (as spoken in Chengdu, the capital of Sichuan province).
Identity resolution systems indicate if two individuals really are the same person. Identity retrieval systems help you find the individual you’re after. These systems appear anywhere from analysts’ desks to border crossings. But how can you tell if a system is any good before it’s deployed? You need to understand the problems it should tackle and how to measure how well it’s doing.
This talk will consider metrics and data for evaluating identity resolution and retrieval systems. It will also explore the linguistic challenges these systems face.
The rapid growth of Arabic content on the Internet has increased the need for Arabic-savvy search. The latest generation of Arabic search techniques draws on advances in natural language processing (NLP), taking search beyond simple string comparisons to a more intelligent search that can understand that kitaab (“book”) is similar to kutub (“books”) by analyzing the lemma of each word. This talk will demonstrate how a search engine with knowledge of the linguistic components of Arabic — the roots, lemmas and stems — can greatly boost the relevancy of search results.
Persian is a complex language with many dialects—including Farsi, Dari, and Tajiki—spoken in many countries—including Iran, Afghanistan, and Tajikistan. Understanding Persian has become increasingly important in the fields of text mining and analysis.
This talk presents a brief history of the language, its speakers, and its dialects. We will compare Persian to other Arabic script languages such as Arabic, Pashto, and Urdu. We will then delve into linguistic aspects of the language, which are important to natural language processing and analysis applications such as, orthography, typography rules, phonology, and spelling variants.
Entity extraction has been widely deployed as a powerful technique for document triage and social network analysis. But what do you do if your documents are in a foreign language? Expensive “machine translation” systems frequently fail to produce output of acceptable quality and frequently fail to recognize names of key individuals, places, and organizations.
This tutorial will demonstrate how to rapidly construct an application which extracts names from foreign language documents, indexes those names, and automatically generates a high-quality translation into English according to the applicable agency transliteration standard. Real—world examples will be presented in Arabic, Chinese, Korean, Pashto, Persian, and Russian, for a total of six scripts and nine languages. This tutorial is appropriate for participants with a basic understanding of programming concepts.
Keyword search is a useful tool for identifying one critical document worthy of extra scrutiny from a collection of thousands. But what happens when keywords are unavailable or unknown? We will discuss a large-scale forensics system capable of ingesting a hard drive or flash memory device and answering abstract questions, such as “What makes this data different?” or “What about this drive is similar to other drives in our collection?”
This talk will discuss findings of research into automated Document and Media Exploitation (DOMEX) to develop tools which can automatically detect which hard drives and flash memory devices in a collection were previously used by members of terrorist networks.
With discovery, the search bar is the UI of last resort: it works if you know what you are looking for. But discovery is not just about exploring what you know; it’s about uncovering what the content can tell you. It means mining content for patterns and allowing you to explore and navigate through them. If you can then use the results to launch a process, notify the right people or update a system, you have entered the world of active intelligence and driving the shift from finding information to using information. In this discussion we explore the latest discovery techniques and how you can exploit them for active intelligence. See a live demonstration of active intelligence technology integrating Attivio’s AIE (Active Intelligence Engine) and Basis Technology’s entity extraction and name translation.
While two entities may seem to have 23 degrees of separation to the human eye, FMS Advanced Systems Group’s Sentinel Visualizer may reveal a much closer existing relationship through its analysis functions, including identifying central players and hidden patterns, finding the shortest path between two entities, and performing timeline or geospatial analysis. Learn about the advantages of this tool for intelligence analysis and law enforcement when combined with name translation, name standardization and entity extraction technology of Basis Technology.
This tutorial offers hands-on training with Basis Technology’s Arabic Desktop Suite, an integrated collection of productivity-boosting applications designed for analysts, linguists, and translators.
Transliteration Assistant — plug-in module for Microsoft Word, Excel, and Access which automatically standardizes names of people, places, and organizations into one of six formal transliteration systems, including the Congressionally-mandated IC transliteration standard for Arabic.
Knowledge Center — a single point of access for dictionaries, glossaries, gazetteers, name lists, and other reference materials which can be searched in English, Arabic, or transliterated Arabic. — a single point of access for dictionaries, glossaries, gazetteers, name lists, and other reference materials which can be searched in English, Arabic, or transliterated Arabic.
Tutorial participants will learn to prepare reports using standardized transliterations, to automatically translate lists of names, and to exploit online reference materials.
This tutorial offers hands-on training with Basis Technology’s Arabic Desktop Suite, an integrated collection of productivity-boosting applications designed for analysts, linguists, and translators.
Geoscope — Access a library of high-resolution maps and pinpoint locations obtained from search queries in Arabic or via fuzzy matching of transliterated Arabic.
Arabic Editor — Rapidly compose, analyze, and edit Arabic documents using a standard Western keyboard with an input system which can be learned in less than one hour.
Participants in this hands-on tutorial will learn to access maps of the Middle East; to quickly identify locations on maps, and to type fully vocalized Arabic.
See the list of possible roundtable discussions. Based on feedback from our registration forms, we will select and announce the roundtable topics a few weeks prior to our conference.