The prototype described below was built to show how our text analytic and digital forensics technologies can be integrated. It is a proof-of-concept to show the range of custom development Basis Technology can perform as well as illustrate a use case for two of our highest profile technology areas. This prototype is not available for general purchase; however, if you need a digital forensics solution that incorporates advanced natural language processing, then a custom solution can be developed for you.
More and more, digital investigators are encountering hard disks with text in languages that they do not speak. This makes the challenge of finding and interpreting the relevant data greater than in a typical investigation because the investigator may not know about nuances in word variations, the analysis tool may not know about all file formats and text encodings, and the investigator may not be able to read text that he finds.
The Multilingual Keyword Search prototype tool helps to find and interpret the relevant documents by applying advanced linguistic processing techniques. The linguistic techniques allow the investigator to find files that would not be found using typical forensics tools.
Most languages have words with spelling variations that convey the same meaning. In English, “color” and “colour” both refer to the visual attributes of something and “carry” and “carried” both refer to holding something. When keyword searching, the investigator needs to take these variations into account.
When the keywords are in the investigator’s native language, he will be aware of the variations he needs to consider. However, he may not know the variations in the keyword language. For example, Arabic vocalizations may not always be present (لُغَوِيَّة versus لغوية) and the alef maksura character (ى) may be used interchangeably with the yeh character (ي). In Japanese, location names may be spelled differently depending on the region of Japan (三河槇原 versus 三河槙原).
This prototype uses the Rosette® linguistics platform to preprocess multilingual text with its text normalization functions (see sidebar). It also uses the normalized Arabic, Chinese, Japanese, Korean, Farsi (Persian), and Urdu text to build a search index. In an operational system, an analyst would then type in search terms through a simple graphical interface to search this linguistically enhanced index. A single search allows them to find variations of their keyword, including numbers that were written in a different numbering system.
Finding additional files does not help solve the case if the file content cannot be interpreted. Frequently, translators are used to help with this process, but they can be both hard to find and expensive.
This prototype could help the investigator triage documents by identifying and translating names in a file. Rosette Entity Extractor is used to identify the names of people, places, and organizations in the file. Next, Rosette Name Translator is used to translate the names into Latin script. The translation of the name is highlighted and presented to the user. This process allows the investigator to identify the documents that are most relevant and that should be translated first.
This process allows the investigator to identify the documents that are most relevant and that should be translated first.
The text normalization and name translations are the base features of this prototype, but other linguistic processing techniques can also be incorporated for custom solutions. Rosette Name Indexer can be incorporated to highlight names that are in a watch list or to allow the analyst to search for transliterated name variations.
Rosette Chinese Script Converter can be incorporated to convert between Traditional and Simplified Chinese text so that a search for a Simplified Chinese keyword would also find Traditional Chinese documents. A custom system could also convert between the Japanese kanji, hiragana, and katakana scripts.
This prototype analyzes both disk images and local files. When processing a disk image, it can use the NIST National Software Reference Library (NSRL) hash database to ignore known files. This saves indexing time and reduces false positives.
Disk images can be stored in either raw or Expert Witness (E01) formats. This prototype uses The Sleuth Kit to process NTFS, FAT, Ext3, and UFS file systems and recover deleted files. It uses a custom algorithm to analyze binary data to locate Unicode text in different languages.