כלי אינטואיציה מלאכותית לחילוץ משמעויות מטקסט
אתגר א': 1. יצירת פלט אחיד, ללא תלות בשפת הקלט 2. פתירת רב משמעויות, הנובעות מהקשר שונה, או משלב ותחום שונה. הפתרון של אינטו-וויו (פיתוח ה-IntuScan): 1. זיהוי השפה, המשלב, וכן זיהוי מילים המתועתקות משפה אחרת – ושחזור המילה המקורית. 2. קטלוג סטטיסטי לנושאיםתחומיםהקשר דתי וכו' – על מנת להפחית כפל משמעות. 3. בחינת הקשר, כלומר הטקסט שמופיע לפני ואחרי ביטוי מסוים – על מנת לפתור כפל משמעות. 4. קישור יחידות לקסיקליות לעצמים אונטולוגים חד משמעיים, המשותפים ליחידות הלקסיקליות בכל שפות הקלט – כך ניתן ליצור פלט אחיד. 5. שימוש בעצמים שחולצו לביצוע קטלוג סטטיסטי נוסף (תתי קטגוריות). אתגר ב': התאמת ישויות על פי שמן, על אף מגוון הוריאציות והשפות בהן השמות יכולים להופיע. הפתרון של אינטו-וויו (Entity Matcher): 1. שימוש באלגוריתמים סטטיסטיים ובחוקים על מנת לזהות ישויות (בני אדם, ארגונים, מקומות וכו'). 2. שחזור שמות מתועתקים לשמות בשפת המקור. 3. ניתוח השם וחילוץ מידע נוסף (מין, שם פרטימשפחהשם האבשייכות שבטית ואתנית וכדומה). 4. אגרגציה של כל הוריאציות והמופעים של אותה ישות. 5. מציאת קשרים בין ישויות – כגון אב ובנו, אדם ומקום מוצאו וכו'. 6. יצירת מזהה לכל ישות מאוחדת, כולל כל הקשרים והמופעים.
Artificial Intuition Tool for Extraction of Meaning from Texts
Dr. Shmuel Bar, CEO
IntuView
The Problem
Ever since the Tower of Babel, the human race has been occupied
with translation to bridge the gap between languages, cultures, societies and
nations. Translation serves many purposes: it enables us to broaden the scope
of our cultural perspective, to see the world in a way that others – friends
and foes – do, to retrieve ancient knowledge that, otherwise, would be lost to
mankind and to communicate between people on a day to day basis. However, in a
global environment challenged with enormous amounts of information, a challenge
has arisen that cannot be solved by translation. This is the need to identify
affinities and dis-affinities between semantic units in different languages in
order to normalize streams of information and mine the “meaning” within them
regardless of their original language. When we look for information or generate
alerts – particularly in domains that are global – we should not be restricted
to streams of information in one language; when we are interested in
information – be it alerts on terrorism, fraud, cyber attacks or financial
developments – we should not care if the origin is in English, French or
Chinese. Methods to deal with this problem have generally been based on
multi-lingual dictionaries that enable key words spotting (by input of a key
word in one language, the search engine can add the nominal corresponding terms
in other languages) or by automated translation of texts and application of the
search criteria in the target language.
Current technology for automated translation leaves much to
be desired. Existing automated translation technology and tools for document
categorization can provide guidelines for translators or rough categorization
of texts in different languages; none however can provide the user with
real-time operable intelligence and a detailed understanding of the meaning of
the document. The more esoteric the text is, the less automated translation and
categorization can extract the sentiment or the “hermeneutics” of the text.
Therefore current cross-language information extraction relies, in the end, on
human translation and analysis of the content of each document before any
operational decision. This process suffers from several major deficiencies:
long processing times, high percentage of false negatives and false positives,
loss of intelligence due to ignorance of central cultural nuances. Named entity
recognition has also not achieved the holy grail of automated entity creation
and relationary mapping of all entities, which may relate to the user’s
requirement.
The need, therefore, is
for technology that scans the entire gamut of information, identify the
language and the language register of the texts, perform domain and topic
categorization and match the information conveyed in different languages in
order to create normalized data for assessment of the scope and nature of a
problem. In the absence of a “silver bullet” of one technology that can
be applied to all domains, the solution is based on emulation of the
“intuitive” links that domain experts find between concatenations of lexical
occurrences and appearances of a document and conclusions regarding the
authorship, inner meaning and intent of the document. In essence, this approach
looks at a document as a holistic entity and deduces from combinations of
statements meanings, which may not be apparent from any one statement. These
meanings constitute the “hermeneutics” of the text, which is manifest to the
initiated (domain specialist or follower of the political stream that the
document represents) but is a closed book to the outsider. The crux of this
concept is to extract not only the prima facie identification of a word or
string of words in a text, but to expand the identification to include implicit
context-dependent and culture-dependent information or "hermeneutics"
of the text. Thus, a word or quote in a text may "mean" something
that even contradicts the ostensible definition of that text.
A language is, in essence, a
group of “dialects”. The decision to call Swedish, Danish and Norwegian
separate languages on one hand, and Moroccan, Libyan, Saudi Arabia and Egyptian
all “Arabic” is political and not linguistic. Even within a language, the litmus
test of inter-intelligibility is not always passed. For example: the
correlation between the linguistic register of Shakespeare (or even a 19th
Century writer) and the New York times may be no more than 60-70%. A modern
English speaker would find it difficult to understand an “English speaker” from
the Elizabethan era. This is also true regarding the register of a fatwa by
al-Qaeda and a modern Kuwaiti newspaper.
Even within the same language
register, words, quotations, idioms or historic references can be “polysemic”;
they have different meanings according to the domain and the context of the
surrounding text. Waterloo can be a place in Belgium, a reference to a final defeat, a London railway station, or a song
by ABBA. It all depends in what context and who referred to it. A verse in the Quran may mean one thing
to a moderate or mainstream Muslim and the exact opposite to a radical. The
word “court” may have the following
meanings: a royal court, a sports court, a school courtyard, a courthouse,
or to court a person. In different
languages, all these meanings may – or usually
- are represented by different words altogether. When a person reads the
word in a document, the word is immediately “understood” as holding the meaning
typical of that type of document and the other meanings fade into the
background. This process enables humans to “scan” texts without having to
process each sentence to get a “gist” of the meaning and to identify the nature
of the text. As the person accumulates more information through other features
(statements, spelling, references) in the text, he either strengthens his
confidence in the initial interpretation or changes it. A person who is
accustomed to reading two newspapers – “The Telegraph”, “The Times” and “The
Guardian” – may be shown reprints without the name of the newspaper and
recognize which newspaper it is from by the fonts and alignment. Even if he
were shown the two articles retyped with the same font, he would most likely
identify them by vocabulary, style and other features. Frequently, if asked, he
would be hard put to explain his judgment.
Extra-linguistic knowledge also
provides us implicit information on a name’s bearer: gender, ethnicity, tribal
and family relations, nicknames, status, religious affiliation and even age. We
expect a person by the name of Nigel or Alistair, to be British and not
American. We would also expect an American woman by the name of “Ethyl”, “Lois”
“Doris” or “Dorothy” to be an elderly woman. Indeed, the first two names are
far more common in the UK than in the US and the female names were given to
approximately 7% of the girls born in the United States in the 1930’s and to
less than 0.05% in the 1970’s on.
IntuView has approached this
problem through combining language-specific and language-register specific NLP
(Natural Language Processing) with domain-specific ontologies. The IntuView
technology[1]
extracts such implicit meaning from a text or the hermeneutics of the text. It
employs the relationship between lexical instances in the text and ontology -
graph of unique language-independent concepts and entities that defines the
precise meaning and features of each element and maps the semantic relationship
between them. As a result of these insights, the process of disambiguation of
meaning in texts is based on a number of stages:
- Identification of the “register” of the language. The register of the language may
represent a certain period of the language), dialects, social strata etc. In
the global world today, however, it is not enough to identify languages;
the world is replete with "hybrid languages" (e.g. “Spanglish”
written and spoken by Hispanics in the US; “Frarabe” written and spoken by
people of Lebanese and North African origin in France and Belgium) that are
created when a person inserts secondary language into a primary (host)
language, transliterates according to his own literacy, accent etc. It is
necessary, therefore, to take the non-host language tokens, discover
their original language, back transliterate them and then find the
ontological representation of that word and insert it back into the
semantic map of the document.
- Statistical categorization of a document as belonging to a
certain domain, topic, or cultural or religious context in order to reduce
ambiguity.
- Use of the lexical tokens that directly precede or follow
the token or tokens in question – in order to provide additional
disambiguating information.
- Based on the identification of the domain of the text, the
lexical units (words, phrases etc.) are linked to ontological instances with
a unique meaning (as opposed to words which may have different meanings in
different contexts) that can be "ideas", "actions",
"persons" "groups" etc. An idea may be composed of
statements in different parts of the document, which come together to signify
an ontological instance of that idea.
- The ontological digest of the document then is matched
with pre-processed statistical models to perform categorization.
This approach, therefore, is not merely “data mining” but "meaning
mining". The purpose is to extract meaning from the text and to create a
normalized data set that allows us to compare the “meaning” extracted from a text
in one language with that, which is extracted from another language.
This methodology is applies also to entity extraction. Here,
the answer to Juliette’s question, “what’s in a name” is – quite a lot. A name
can tell us gender, ethnicity, religion, social status, family relationships
and even age or generation. In order to extract the information, however, we
must first be able to resolve entities that do not look alike but may be the
same entity (e.g. names of entities written in different scripts English,
Arabic, Devanagari, Cyrillic) and to disambiguate entities that look the same
but may not be (different transliterations of the same name in a non-Latin
source language or culturally acceptable permutations of the same name). This
entails: statistical modeling of the names to identify possible ethnicity in
order to apply the appropriate naming conventions for disambiguation and
matching; and then . extracting implicit and contextual information from that
entity, matching and analysis and matching of names based on culture-specific
naming conventions. This calls for identification of the likely ethnicity of
the name (much the same way that humans intuitively understand that a name is
Irish, Hispanic or Indian), back-transliterating it to the source language (if
not written in its original language); validating the re-constructed names,
parsing it for identification of constituent parts (given name, patronymics
etc.) finding all its alternatives by application of cultural naming
conventions and aggregating and matching the information implicit in it or in
its context in order to discover links between the entities.
One of the basic tasks of text analytics is named entity recognition - extraction from
unstructured text of entities (persons, groups, locations, ideas and actions)
which are relevant to the user’s requirements, aggregation and matching of
ostensibly separate entities, collection and validation of information on them,
linking them to other entities of interest and adding them to the intelligence database
for future use. The IntuView methodology automates this task while bridging the
gap between languages. In essence, using algorithms based on knowledge of
patterns and conventions of pertinent source languages and cultures, it
recognizes entities within unstructured or structured texts, matches their
occurrences in different sources and extracts information from the contexts in
which they appear, creating a “virtual entity” which is uniquely identified by
its definition as an instance within a domain-specific hierarchal ontology and
by its internal attributes, which link it to other entities.
The tasks that the sematic analyzer performs include:
1. Identification of entities within a structured to unstructured
text, using both statistical and rule-base algorithms to categorize them as
possible persons, groups, locations, addresses (URLs, dates, bank accounts),
ideas, actions, and basically every type which is defined in the ontology. The
data base entity or unstructured text is analyzed, using NLP tools or data
mining in the DB to determine if it contains relevant entities and what type of
entities they represent.
2. Reverse transliteration of names back into the source language
in order to determine the original form or forms of the entity. The identified
named entity, written in any given script (Latin, Cyrillic, Hindi or Arabic,
etc.) is analyzed to determine its possible linguistic origin and then
processed to extract possible spelling variants in the language of origin of
the name. Thus, the name of the entity is restored to the original name in the
source language or possible source languages. Information from the transliteration,
which provides further identification, is retained for later analysis.
3. Parsing of the named entity and performs cultural-linguistic
sensitive analysis of its components. The analysis of these components provides
further validation or reassessment of the categorization of the type of the
entity (an entity which may initially be considered a person-entity may
actually be a place entity named after that person). This engine fills the
attribute slots of the virtual entity (gender, location, size, object,
predicate, ethnicity etc.) to provide further qualities for identification and
matching. The name is vetted to determine its validity (a possible name or
corrupt one).
4. Aggregation of all the variants and aliases referring to an
entity within the different inputs, while maintaining their source identity for
possible regression.
5. Matching of entities by finding relations between identified
entities (family relations in person entities, person-location relationships
etc.).
6. Creation of a virtual “identity card” of an entity with all the
aggregated information that is collected in the inputs about it.
Ultimately, we should be able to differentiate between
situations in which we need to “translate” foreign language documents and those
in which we only need to extract information from them. Translation is a tool –
not an end unto itself - that should facilitate the ultimate goal of
identifying relevant intelligence and sending it to where it may bring the most
value. We believe that for a large part of the intelligence tasks, hermeneutic
analysis, and automated entity creation and matching and automated
summarization will play a critical role in expediting the intelligence tasks.
Application of this concept to identification of threats in
cyberspace would be based on the following:
1.
Pre-generation of lexical based
domain models for different supported languages and for various threat types
based on previously identified texts. These models will be applied in crawlers
on the traffic in the Internet in order to perform initial triage.
2.
Analysis of the triaged texts to
extract ontological digests.
3.
Matching the ontological digests
with models of corpora of texts that have been pre-categorized as reflecting
different features.
4.
Matching the ontological digests
against each other to identify threads of communication (since the matching is
performed on the ontological level it is cross-linguistic). These threads then
are batched together.
5.
Identification of a suspect entity
in one document in the thread then can be the basis for unraveling the thread.
Application of the IntuView technology to identification of
cyber-threats would greatly facilitate international cooperation of law
enforcement and regulatory bodies against groups of international
cyber-criminals.
[1] See US patent - Decision-support expert system and
methods for real-time exploitation of documents in non-english languages, US 8078551 B2, PCT/IL2006/001017
מצגת ההרצאה