כלי אינטואיציה מלאכותית לחילוץ משמעויות מטקסט

שמואל ובר בתוך: מידע וטקסט, כרך כא' חוברת 2, דצמבר 2014 17.07.2014 03:04
כלי אינטואיציה מלאכותית לחילוץ משמעויות מטקסט


אתגר א': 1. יצירת פלט אחיד, ללא תלות בשפת הקלט 2. פתירת רב משמעויות, הנובעות מהקשר שונה, או משלב ותחום שונה. הפתרון של אינטו-וויו (פיתוח ה-IntuScan): 1. זיהוי השפה, המשלב, וכן זיהוי מילים המתועתקות משפה אחרת – ושחזור המילה המקורית. 2. קטלוג סטטיסטי לנושאיםתחומיםהקשר דתי וכו' – על מנת להפחית כפל משמעות. 3. בחינת הקשר, כלומר הטקסט שמופיע לפני ואחרי ביטוי מסוים – על מנת לפתור כפל משמעות. 4. קישור יחידות לקסיקליות לעצמים אונטולוגים חד משמעיים, המשותפים ליחידות הלקסיקליות בכל שפות הקלט – כך ניתן ליצור פלט אחיד. 5. שימוש בעצמים שחולצו לביצוע קטלוג סטטיסטי נוסף (תתי קטגוריות). אתגר ב': התאמת ישויות על פי שמן, על אף מגוון הוריאציות והשפות בהן השמות יכולים להופיע. הפתרון של אינטו-וויו (Entity Matcher): 1. שימוש באלגוריתמים סטטיסטיים ובחוקים על מנת לזהות ישויות (בני אדם, ארגונים, מקומות וכו'). 2. שחזור שמות מתועתקים לשמות בשפת המקור. 3. ניתוח השם וחילוץ מידע נוסף (מין, שם פרטימשפחהשם האבשייכות שבטית ואתנית וכדומה). 4. אגרגציה של כל הוריאציות והמופעים של אותה ישות. 5. מציאת קשרים בין ישויות – כגון אב ובנו, אדם ומקום מוצאו וכו'. 6. יצירת מזהה לכל ישות מאוחדת, כולל כל הקשרים והמופעים.



Artificial Intuition Tool for Extraction of Meaning from Texts

Dr. Shmuel Bar, CEO IntuView

The Problem

Ever since the Tower of Babel, the human race has been occupied with translation to bridge the gap between languages, cultures, societies and nations. Translation serves many purposes: it enables us to broaden the scope of our cultural perspective, to see the world in a way that others – friends and foes – do, to retrieve ancient knowledge that, otherwise, would be lost to mankind and to communicate between people on a day to day basis. However, in a global environment challenged with enormous amounts of information, a challenge has arisen that cannot be solved by translation. This is the need to identify affinities and dis-affinities between semantic units in different languages in order to normalize streams of information and mine the “meaning” within them regardless of their original language. When we look for information or generate alerts – particularly in domains that are global – we should not be restricted to streams of information in one language; when we are interested in information – be it alerts on terrorism, fraud, cyber attacks or financial developments – we should not care if the origin is in English, French or Chinese. Methods to deal with this problem have generally been based on multi-lingual dictionaries that enable key words spotting (by input of a key word in one language, the search engine can add the nominal corresponding terms in other languages) or by automated translation of texts and application of the search criteria in the target language.

Current technology for automated translation leaves much to be desired. Existing automated translation technology and tools for document categorization can provide guidelines for translators or rough categorization of texts in different languages; none however can provide the user with real-time operable intelligence and a detailed understanding of the meaning of the document. The more esoteric the text is, the less automated translation and categorization can extract the sentiment or the “hermeneutics” of the text. Therefore current cross-language information extraction relies, in the end, on human translation and analysis of the content of each document before any operational decision. This process suffers from several major deficiencies: long processing times, high percentage of false negatives and false positives, loss of intelligence due to ignorance of central cultural nuances. Named entity recognition has also not achieved the holy grail of automated entity creation and relationary mapping of all entities, which may relate to the user’s requirement.

The need, therefore, is for technology that scans the entire gamut of information, identify the language and the language register of the texts, perform domain and topic categorization and match the information conveyed in different languages in order to create normalized data for assessment of the scope and nature of a problem. In the absence of a “silver bullet” of one technology that can be applied to all domains, the solution is based on emulation of the “intuitive” links that domain experts find between concatenations of lexical occurrences and appearances of a document and conclusions regarding the authorship, inner meaning and intent of the document. In essence, this approach looks at a document as a holistic entity and deduces from combinations of statements meanings, which may not be apparent from any one statement. These meanings constitute the “hermeneutics” of the text, which is manifest to the initiated (domain specialist or follower of the political stream that the document represents) but is a closed book to the outsider. The crux of this concept is to extract not only the prima facie identification of a word or string of words in a text, but to expand the identification to include implicit context-dependent and culture-dependent information or "hermeneutics" of the text. Thus, a word or quote in a text may "mean" something that even contradicts the ostensible definition of that text.

A language is, in essence, a group of “dialects”. The decision to call Swedish, Danish and Norwegian separate languages on one hand, and Moroccan, Libyan, Saudi Arabia and Egyptian all “Arabic” is political and not linguistic. Even within a language, the litmus test of inter-intelligibility is not always passed. For example: the correlation between the linguistic register of Shakespeare (or even a 19th Century writer) and the New York times may be no more than 60-70%. A modern English speaker would find it difficult to understand an “English speaker” from the Elizabethan era. This is also true regarding the register of a fatwa by al-Qaeda and a modern Kuwaiti newspaper.

Even within the same language register, words, quotations, idioms or historic references can be “polysemic”; they have different meanings according to the domain and the context of the surrounding text. Waterloo can be a place in Belgium, a reference to a final defeat, a London railway station, or a song by ABBA. It all depends in what context and who referred to it. A verse in the Quran may mean one thing to a moderate or mainstream Muslim and the exact opposite to a radical. The word “court” may have the following meanings: a royal court, a sports court, a school courtyard, a courthouse, or to court a person. In different languages, all these meanings may – or usually  - are represented by different words altogether. When a person reads the word in a document, the word is immediately “understood” as holding the meaning typical of that type of document and the other meanings fade into the background. This process enables humans to “scan” texts without having to process each sentence to get a “gist” of the meaning and to identify the nature of the text. As the person accumulates more information through other features (statements, spelling, references) in the text, he either strengthens his confidence in the initial interpretation or changes it. A person who is accustomed to reading two newspapers – “The Telegraph”, “The Times” and “The Guardian” – may be shown reprints without the name of the newspaper and recognize which newspaper it is from by the fonts and alignment. Even if he were shown the two articles retyped with the same font, he would most likely identify them by vocabulary, style and other features. Frequently, if asked, he would be hard put to explain his judgment.

Extra-linguistic knowledge also provides us implicit information on a name’s bearer: gender, ethnicity, tribal and family relations, nicknames, status, religious affiliation and even age. We expect a person by the name of Nigel or Alistair, to be British and not American. We would also expect an American woman by the name of “Ethyl”, “Lois” “Doris” or “Dorothy” to be an elderly woman. Indeed, the first two names are far more common in the UK than in the US and the female names were given to approximately 7% of the girls born in the United States in the 1930’s and to less than 0.05% in the 1970’s on.

IntuView has approached this problem through combining language-specific and language-register specific NLP (Natural Language Processing) with domain-specific ontologies. The IntuView technology[1] extracts such implicit meaning from a text or the hermeneutics of the text. It employs the relationship between lexical instances in the text and ontology - graph of unique language-independent concepts and entities that defines the precise meaning and features of each element and maps the semantic relationship between them. As a result of these insights, the process of disambiguation of meaning in texts is based on a number of stages:

  1. Identification of the “register” of the language.  The register of the language may represent a certain period of the language), dialects, social strata etc. In the global world today, however, it is not enough to identify languages; the world is replete with "hybrid languages" (e.g. “Spanglish” written and spoken by Hispanics in the US; “Frarabe” written and spoken by people of Lebanese and North African origin in France and Belgium) that are created when a person inserts secondary language into a primary (host) language, transliterates according to his own literacy, accent etc. It is necessary, therefore, to take the non-host language tokens, discover their original language, back transliterate them and then find the ontological representation of that word and insert it back into the semantic map of the document. 
  2. Statistical categorization of a document as belonging to a certain domain, topic, or cultural or religious context in order to reduce ambiguity.
  3. Use of the lexical tokens that directly precede or follow the token or tokens in question – in order to provide additional disambiguating information.
  4. Based on the identification of the domain of the text, the lexical units (words, phrases etc.) are linked to ontological instances with a unique meaning (as opposed to words which may have different meanings in different contexts) that can be "ideas", "actions", "persons" "groups" etc. An idea may be composed of statements in different parts of the document, which come together to signify an ontological instance of that idea.
  5. The ontological digest of the document then is matched with pre-processed statistical models to perform categorization.

This approach, therefore, is not merely “data mining” but "meaning mining". The purpose is to extract meaning from the text and to create a normalized data set that allows us to compare the “meaning” extracted from a text in one language with that, which is extracted from another language.

This methodology is applies also to entity extraction. Here, the answer to Juliette’s question, “what’s in a name” is – quite a lot. A name can tell us gender, ethnicity, religion, social status, family relationships and even age or generation. In order to extract the information, however, we must first be able to resolve entities that do not look alike but may be the same entity (e.g. names of entities written in different scripts English, Arabic, Devanagari, Cyrillic) and to disambiguate entities that look the same but may not be (different transliterations of the same name in a non-Latin source language or culturally acceptable permutations of the same name). This entails: statistical modeling of the names to identify possible ethnicity in order to apply the appropriate naming conventions for disambiguation and matching; and then . extracting implicit and contextual information from that entity, matching and analysis and matching of names based on culture-specific naming conventions. This calls for identification of the likely ethnicity of the name (much the same way that humans intuitively understand that a name is Irish, Hispanic or Indian), back-transliterating it to the source language (if not written in its original language); validating the re-constructed names, parsing it for identification of constituent parts (given name, patronymics etc.) finding all its alternatives by application of cultural naming conventions and aggregating and matching the information implicit in it or in its context in order to discover links between the entities.

One of the basic tasks of text analytics is named entity recognition - extraction from unstructured text of entities (persons, groups, locations, ideas and actions) which are relevant to the user’s requirements, aggregation and matching of ostensibly separate entities, collection and validation of information on them, linking them to other entities of interest and adding them to the intelligence database for future use. The IntuView methodology automates this task while bridging the gap between languages. In essence, using algorithms based on knowledge of patterns and conventions of pertinent source languages and cultures, it recognizes entities within unstructured or structured texts, matches their occurrences in different sources and extracts information from the contexts in which they appear, creating a “virtual entity” which is uniquely identified by its definition as an instance within a domain-specific hierarchal ontology and by its internal attributes, which link it to other entities.

The tasks that the sematic analyzer performs include:

1.       Identification of entities within a structured to unstructured text, using both statistical and rule-base algorithms to categorize them as possible persons, groups, locations, addresses (URLs, dates, bank accounts), ideas, actions, and basically every type which is defined in the ontology. The data base entity or unstructured text is analyzed, using NLP tools or data mining in the DB to determine if it contains relevant entities and what type of entities they represent.

2.       Reverse transliteration of names back into the source language in order to determine the original form or forms of the entity. The identified named entity, written in any given script (Latin, Cyrillic, Hindi or Arabic, etc.) is analyzed to determine its possible linguistic origin and then processed to extract possible spelling variants in the language of origin of the name. Thus, the name of the entity is restored to the original name in the source language or possible source languages. Information from the transliteration, which provides further identification, is retained for later analysis.

3.       Parsing of the named entity and performs cultural-linguistic sensitive analysis of its components. The analysis of these components provides further validation or reassessment of the categorization of the type of the entity (an entity which may initially be considered a person-entity may actually be a place entity named after that person). This engine fills the attribute slots of the virtual entity (gender, location, size, object, predicate, ethnicity etc.) to provide further qualities for identification and matching. The name is vetted to determine its validity (a possible name or corrupt one).

4.       Aggregation of all the variants and aliases referring to an entity within the different inputs, while maintaining their source identity for possible regression.

5.       Matching of entities by finding relations between identified entities (family relations in person entities, person-location relationships etc.).

6.       Creation of a virtual “identity card” of an entity with all the aggregated information that is collected in the inputs about it.

Ultimately, we should be able to differentiate between situations in which we need to “translate” foreign language documents and those in which we only need to extract information from them. Translation is a tool – not an end unto itself - that should facilitate the ultimate goal of identifying relevant intelligence and sending it to where it may bring the most value. We believe that for a large part of the intelligence tasks, hermeneutic analysis, and automated entity creation and matching and automated summarization will play a critical role in expediting the intelligence tasks.

Application of this concept to identification of threats in cyberspace would be based on the following:

1.       Pre-generation of lexical based domain models for different supported languages and for various threat types based on previously identified texts. These models will be applied in crawlers on the traffic in the Internet in order to perform initial triage.

2.       Analysis of the triaged texts to extract ontological digests.

3.       Matching the ontological digests with models of corpora of texts that have been pre-categorized as reflecting different features.

4.       Matching the ontological digests against each other to identify threads of communication (since the matching is performed on the ontological level it is cross-linguistic). These threads then are batched together.

5.       Identification of a suspect entity in one document in the thread then can be the basis for unraveling the thread.

Application of the IntuView technology to identification of cyber-threats would greatly facilitate international cooperation of law enforcement and regulatory bodies against groups of international cyber-criminals.

 



[1] See US patent - Decision-support expert system and methods for real-time exploitation of documents in non-english languages, US 8078551 B2, PCT/IL2006/001017

מצגת ההרצאה




הוספת תגובה
  מגיב אנונימי
שם או כינוי:
חסימת סיסמה:
  זכור אותי תמיד במחשב זה

כותרת ראשית:
אבקש לקבל בדואר אלקטרוני כל תגובה לטוקבק שלי
אבקש לקבל בדואר אלקטרוני כל תגובה למאמר הזה