The Microsoft/FAST ESP Enterprise Search Platform (ESP) has limited support for processing Arabic content. Features such as lemmatization, named entity extraction, word and name translation, etc. are not supported by ESP out of the box. Basis Technology’s Rosette linguistics platform provides these features and more.
“Arabizi”, an informal dialect of Arabic typed on mobile phones and computer keyboards using the Latin alphabet took center stage in early 2011 during the Arab Spring uprisings. The writing is widely used in cellphone text messaging, social networks, chat rooms and other online media. Analyzing messages written in this dialect is a challenge for analysts in government and industry because of wide variations in spelling, grammar, and diction. This whitepaper explores the complexities of this online language and solutions to how to analyze this voluminous data source for meaningful intelligence.
Search has gone from a convenience to an indispensable requirement of doing business. In enterprises, governments, and non-profits, software engineers are tasked with finding a search engine which is both full-featured and which can quickly retrieve relevant search results in English or any language worldwide. This whitepaper goes into technical detail about the why’s and how’s of one widely used solution to that problem: the combination of the dtSearch Engine integrated with the linguistic capabilities of Basis Technology’s Rosette.
We focus on the challenges of providing high quality search results in French, German, Italian, and Spanish, and the technology that makes this possible. Specifically, we will look at lemmatization—a natural language processing (NLP) technique to identify the dictionary form of a word—versus stemming—a naïve alternative.
We explore three scenarios (news search, real estate condominium search, and product mentions in customer feedback forms) where the searcher cannot know the exact search terms needed to find the desired information. Adding entity extraction to your search engine enables “discovery” in these cases by helping the searcher find information that cannot be fully specified at query time. We contrast the technologies behind this “next-generation” search technique with “first-generation” keyword-only search.
Learn how to achieve quality search with Rosette and Apache Lucene and Solr. Highlighted languages include German—and the handling of its compound words and plurals—and Japanese—vis-à-vis finding words in this language without spaces between words. The Rosette platform is one linguistic technology which provides these linguistic smarts. Technical details are included to show how Rosette easily integrates into Lucene or Solr.
Analysts need to find relevant information. More and more information is available to them, but the tools to locate and correlate it to a problem at hand have not kept pace. Full text search is a powerful tool, but as the volume grows, it becomes increasingly difficult to provide analysts with relevant information purely by “Googling”.
Explore the problem space of name matching where a name could be written with different spellings or even in different scripts, then learn about the pros and cons of matching names by looking at similar characters in two names versus a knowledge-based approach examining linguistic patterns.
Online communication and social media are powerful enablers of change in today’s Arabic speaking world. Decoding the orthography of Romanized Arabic is essential to understanding the content of this communication.
This whitepaper explores how the combination of entity extraction and name matching is enabling linkage between diverse data and thus revolutionizing the decision-making and problem solving process in e-discovery, social media monitoring, government intelligence, financial compliance, and publishing.
For languages without spaces between words, such as Chinese, Japanese and Korean, there are two common approaches to indexing these texts for a search engine: n-gram and morphological analysis. Learn about the pros and cons of each approach when used in a search engine.
In today’s global economy, companies with international operations need sound legal discovery strategies that operate in multiple languages. E-discovery/e-disclosure technology is required to produce every single “relevant” document, even if they may span several languages. Here we address the problems and solutions of e-discovery technology, ranging from determining the languages present in a set of documents to maximizing search accuracy and speed to save valuable legal staff time.
Explore the strengths and weaknesses of four name matching methods, including common key (e.g., Soundex), lists of name variations, edit distance, and statistical similarity, which serve crucial de-duplication and name recognition needs in the fields of financial compliance, law enforcement, homeland security and others.
Review the technologies available for monitoring multilingual news and information on persons and topics of interest to open source intelligence gathering, from global Internet sources including blogs, webpages, and tweets.
Explore the pros and cons of name matching systems used by financial institutions, for financial compliance with anti-money laundering laws and “know-your-customer” requirements. We will compare and contrast a “naïve” approach with a “knowledge-based” approach in name-matching.