“Arabizi”, an informal dialect of Arabic typed on mobile phones and computer keyboards using the Latin alphabet took center stage in early 2011 during the Arab Spring uprisings. The writing is widely used in cellphone text messaging, social networks, chat rooms and other online media. Analyzing messages written in this dialect is a challenge for analysts in government and industry because of wide variations in spelling, grammar, and diction. This whitepaper explores the complexities of this online language and solutions to how to analyze this voluminous data source for meaningful intelligence.
Search has gone from a convenience to an indispensable requirement of doing business. In enterprises, governments, and non-profits, software engineers are tasked with finding a search engine which is both full-featured and which can quickly retrieve relevant search results in English or any language worldwide. This whitepaper goes into technical detail about the why’s and how’s of one widely used solution to that problem: the combination of the dtSearch Engine integrated with the linguistic capabilities of Basis Technology’s Rosette.
We explore three scenarios (news search, real estate condominium search, and product mentions in customer feedback forms) where the searcher cannot know the exact search terms needed to find the desired information. Adding entity extraction to your search engine enables “discovery” in these cases by helping the searcher find information that cannot be fully specified at query time. We contrast the technologies behind this “next-generation” search technique with “first-generation” keyword-only search.
Learn how to achieve quality search with Rosette and Apache Lucene and Solr. Highlighted languages include German—and the handling of its compound words and plurals—and Japanese—vis-à-vis finding words in this language without spaces between words. The Rosette platform is one linguistic technology which provides these linguistic smarts. Technical details are included to show how Rosette easily integrates into Lucene or Solr.
Explore the problem space of name matching where a name could be written with different spellings or even in different scripts, then learn about the pros and cons of matching names by looking at similar characters in two names versus a knowledge-based approach examining linguistic patterns.
For languages without spaces between words, such as Chinese, Japanese and Korean, there are two common approaches to indexing these texts for a search engine: n-gram and morphological analysis. Learn about the pros and cons of each approach when used in a search engine.
In today’s global economy, companies with international operations need sound legal discovery strategies that operate in multiple languages. E-discovery/e-disclosure technology is required to produce every single “relevant” document, even if they may span several languages. Here we address the problems and solutions of e-discovery technology, ranging from determining the languages present in a set of documents to maximizing search accuracy and speed to save valuable legal staff time.
Explore the strengths and weaknesses of four name matching methods, including common key (e.g., Soundex), lists of name variations, edit distance, and statistical similarity, which serve crucial de-duplication and name recognition needs in the fields of financial compliance, law enforcement, homeland security and others.
Review the technologies available for monitoring multilingual news and information on persons and topics of interest to open source intelligence gathering, from global Internet sources including blogs, webpages, and tweets.
Explore the pros and cons of name matching systems used by financial institutions, for financial compliance with anti-money laundering laws and “know-your-customer” requirements. We will compare and contrast a “naïve” approach with a “knowledge-based” approach in name-matching.