Basis Technology’s linguistic software addresses national language problems in intelligence gathering, foreign name transliteration, and improving the efficiency of scarce linguists. CTO Benson Margulies speaks about how the company is moving to solve larger pieces of the problems facing government users by delivering modules of functionality –– entity extraction, entity translation, name matching, and geospatial fusion –– either pre-assembled into desktop applications or as enterprise software.
When searching documents or analyzing text, often the most critical pieces of information are the names of people, places, and organizations. But in a real-world environment, how can you be sure that one name is the same as another, especially if it’s written in a different script or language? How can you be sure that you’ve found all occurrences of names on a watch list or in a database? How can you translate a name into a language you can recognize and process?
This year, Basis Technology is introducing two new products to help cope with these problems:
Rosette Name Translator (RNT) translates, transliterates, and normalizes Arabic, Chinese, Korean, Pashto, Farsi (Persian) and Urdu names.
Rosette Name Seach Engine (RNSE) catalogs names in a variety of languages with intelligent handling of script, spelling, nicknames, or other variations.
This talk will explain how to build multilingual name search and translation capabilities into your application by leveraging these innovative products.
Criminal and counter-terror investigations are increasingly called upon to cross national and language boundaries. However, the most widely used tools for digital forensics and media exploitation fall short when called upon to analyze multilngual data.
Searching hard drives containing text in foreign language presents technical complexities which most investigators are unaware of: multiple encoding schemes, orthographic variations, spelling variations, and online “chat” dialects.
This talk will introduce a prototype system, which has been specifically designed to address these linguistic issues. It will also present an survey of Unicode and legacy code pages to introduce the issues facing investigators working in a multilingual environment, and conclude with a demonstration of the prototype operating on real-world data.
Dr. Carrier is the author of “The Sleuth Kit”, a widely-used, open-source forensics system, and author of the book File System Forensic Analysis.
Software developers are increasingly faced with the challenge of adapting information systems to handle data written in or derived from Arabic. This talk will provide a concise introduction to the Arabic language and writing system for those with no experience in the subject matter. Basic issues of the script (the writing system); Modern Standard Arabic (MSA) orthography (how the letters are used together); morphology (how words are composed); and grammar will be presented, as well as geographical and historical context (specifically, relation to other languages). Other topics include the basics of the representation of Arabic in computers and bidirectional text processing.This is a repeat presentation of a popular talk from last year’s conference.
The first step in analyzing Chinese text is to divide it into individual words, a process known as “segmentation” This task is harder than it sounds since Chinese is written as a continuous sequence of characters, without any spaces between words. Most text processing systems — whether based on dictionaries, statistics, or some combination — are powered by a database of hand-segmented text prepared by a small army of Chinese linguists. Lacking such an army, we implemented a system that automatically segments Chinese text into words by using a large quantity of text mined from the Web.
Using the Web in this way is neither simple nor easy. This talk discusses how we made our process work, the problems we overcame (or avoided), and how it all turned out, both as a problem of Chinese linguistics and as a challenge of downloading, filtering, and processing terabytes of raw web pages from the Internet.
This talk is recommended for software developers, software architects, and technical program managers.
Unicode, the universal character set encoding, is used everywhere today: in web pages, text documents, and software products. Its use has increased the flexibility and power of applications, but also the complexities which challenge natural language processing.
This talk begins with a look at how Unicode, established in 1991, has changed the way computers process text, with particular emphasis on Arabic, Chinese, Japanese, and Korean. For the non-programmer, this talk will briefly present foundational concepts of encodings, characters, glyphs, code points, and the design principles behind Unicode.
Digital forensics and media exploitation have historically asked “What is on this hard drive?” Today, the question increasingly being asked is “Who does this hard drive know?” Just as law enforcement keeps track of “known associates” of suspects, CI and CT investigators must exploit large collections of captured hard drives to discover who has been exchanging files and e-mail with whom.
This talk describes correlation techniques for the analysis of large volumes of digital data, and presents results from ten years of research on real-world drives.
KiLLeH Mn O5OoYuH e93’eeR!! :)
Cat walking on a keyboard, or Romanized Arabic chat?
While transliterated Arabic poses its own issues of multiple standards and inconsistent use, asking linguistic software to make sense of Arabic chat is another matter entirely. How are words, parts of words, and sentence boundaries detected? What about non-linguistic expressions using mixed case letters, dialectical differences, and emoticons?
This talk decodes the representation of Arabic sounds in the Romanized shorthand commonly used in chatrooms and blogs by presenting findings from field analyses of Egyptian, Gulf, Iraqi, and Levantine online dialects.
Understanding Persian name structure, its influences, and structural rules is critical to natural language processing applications which extract names from documents or which match them against lists of names.
This talk begins with the basics of Persian phonology and name morphology, and delves into the rich influences of other languages; cultural naming preferences (such as the decline of Arabic-based names after the fall of the Shah in Iran); historical roots; and regional customs.
Finally, we will see how this understanding of Persian names is used to increase the accuracy of natural language processing applications.
Criminal organizations and terrorist cells use chat rooms and other forms of computer mediated communication (CMC) to coordinate their activities. Yet while search engine technology for conventional Arabic text is relatively mature, online chat — which uses numbers, punctuation, and symbols to represent Arabic phonetics — has remained largely hidden from text mining or natural language processing applications.
Basis Technology’s Arabic Chatroom Reverse Transliterator (ACRT) pulls back the curtain, converting Romanized Arabic chat back to its original form and making it accessible to text mining applications. This talk discusses ACRT’s natural language processing techniques, such as N-gram analysis and hidden Markov models, which enable ACRT to work its magic.
Basis Technology’s Arabic Desktop Suite is an integrated collection of productivity-boosting applications designed for analysts, linguists, and translators. Its key components are:
Transliteration Assistant -- a Microsoft Office plug-in which automatically standardizes names of people, places, and organizations into one of six formal transliteration systems, including the Congressionally-mandated IC transliteration standard for Arabic.
Knowledge Center -- a single point of access for dictionaries, glossaries, gazetteers, name lists, and other reference materials which can be searched in English, Arabic, or transliterated Arabic.
Geoscope -- accesses a library of high-resolution maps and pinpoints locations obtained from search queries in Arabic or via fuzzy matching of transliterated Arabic.
Arabic Editor -- enables rapid composition, analysis, and editing of Arabic documents using a standard Western keyboard with an input system which can be learned in less than one hour.
Participants in this hands-on tutorial will learn how to prepare reports using standardized transliterations, how to automatically translate lists of names, how to identify places on maps, how to type fully-vocalized Arabic, and how to exploit online reference materials.
Arabic is one of the most difficult scripts to render. Font rendering engines still struggle just to achieve legibility.
Tasmeem provides a platform for both modern and traditional Arabic capturing the accumulated expertise of past calligraphers and typographers as a plug-in to applications such as the Middle Eastern version of Adobe InDesign. The result is extreme typographic and linguistic accuracy without compromising any existing functionality, including Unicode compliance.
Tasmeem’s typefaces are the first computer fonts in the true sense of the word. Rather than digitizing legacy technology, they are based on novel analysis from which exact Arabic logograms are synthesized. Used in combination with Basis Technology’s Arabic Editor, Tasmeem constitutes the most advanced Arabic typographic system currently available.
The Arabic script is used to write over a dozen languages, and each language uses the script in different ways.
This talk explores the history of the script in various Arabic script languages, the structure and characteristics of the Arabic alphabet, the alphabet used, the phonological structure, the borrowings, and the differences between Arabic and these languages. Languages that use the Arabic script include Arabic, Persian (and dialects, such as Dari), Kurdish, Pashto, Kashmiri, Urdu, Sindhi, Ottoman Turkish, Uyghur, Malay (Jawi), Hausa, and Swahili.
The proliferation of Unicode has both simplified multilingual computing and added complexity to Arabic natural language processing applications by increasing the number of ways users can use (or abuse) Arabic characters.
Since the Arabic script is used for many Middle Eastern languages – such as Arabic, Kurdish, Farsi (Persian), Sindhi, Uighur, and Urdu – users may mix characters specific to one language into text written in another language.
Arabic Unicode allows users to introduce orthographic varieties into text that might affect the accuracy of natural language processing (NLP). This study was developed to (1) explore the occurrences of orthographic variations in Arabic, Farsi, and Urdu as seen in news corpora; (2) to provide ways to normalize these variations for accurate retrieval; and (3) to show whether normalizing characters enhances dictionary lookup. This talk suggests a multi-level normalization for handling various Arabic script orthographic variations that appear in current news corpora.
Jim Kerins, SVP of LexisNexis Special Services, will discuss the internationalization of their Information Management Supply Chain platform, which also incorporates Basis Technology’s name technology for extracting, normalizing and transliterating Arabic script names from documents.
Most commercial analysis and extraction tools are designed to work with clean, well–edited text, such as news stories and government documents. They perform poorly when applied to “dirty” data, such as informal message traffic. Saffron Technology teamed with Basis Technology to integrate customized text analytics to enable SaffronWeb to be deployed as an analytic tool for IED detection.
This talk will describe the approach the Basis/Saffron team used to create specialized extraction models for entities in message traffic and the results obtained. We will also describe how SaffronWeb’s associative memory platform uses the extracted information to discover relationships between the entities.
SAIC’s nLing, a multilingual full-text search engine based on Lucene, now incorporates Rosette Base Linguistics (RBL) to provide indexing and querying services in eighteen different languages.
This talk will survey the challenges and solutions to integrating complex linguistics into this popular open-source application.
The National Geospatial-Intelligence Agency (NGA) is the authoritative source of geographic data from around the world for the United States Government. Learn about how this data is collected and maintained, and what users can expect to find in it. This talk will also survey the challenges of recording names in multiple languages; the different ways NGA data is being used in practical applications; and future plans for development and refinement of this data.
Imagine looking for a certain abandoned airfield on a map of the Middle East. We don’t know where it’s located; the only information available is the name of a nearby village. The myriad complex ways of writing Arabic place names using the Latin alphabet can make locating places effectively impossible. Basis Technology’s Geoscope Map Viewer uses fuzzy match technology to locate places using transliterated queries. The underlying geographic smarts are provided by NGA’s detailed data resources. This same data is also employed by Basis Technology’s Transliteration Assistant to standardize the spelling of Arabic place names.
This talk will explain how NGA’s data is presently exploited by the Arabic Desktop Suite and future directions.
BrightPlanet is the industry leader in the discovery, harvest, management, and qualification of ‘deep web’ document content. The deep web represents the vast number of high-value searchable databases on the Internet which can contain from 10 to 500 times more content than can be obtained through standard search engines. BrightPlanet has developed innovative software technology to identify, simultaneously query, characterize, and harvest from these dynamic sources, providing analysts and researchers the most comprehensive and extensive web content available.
This talk will introduce BrightPlanet’s product, technology, and unique placement in the ‘deep web’ space, and will also present the close relationship with Basis Technology. The Rosette linguistics platform has been an integral part of BrightPlanet’s flagship Deep Query Manager (DQM) product since 2004. Rosette’s linguistic technology is becoming more important as BrightPlanet significantly expands deeper into international and multilingual domains. Technologies of particular interest are tokenization, encoding detection, base linguistics, named entity extraction, and name matching over a broad range of languages.