Scientio has been working for several years on using the space of concepts rather than words to perform various text mining applications. See, for instance, this paper.
Using the tools we’ve created you can search for phrases in large volumes of text based on meaning, sentiment mine, text mine, categorize using the concepts implied in text rather than unwieldy word frequencies. This technique combines the best bits of “bag of words” text mining and Natural Language Processing, and opens new fields of research.
A Concept is a somewhat nebulous idea. What we mean by it is a common meaning that is language independent, normally, and often common to several words. It is the meaning intended for a word in a piece of text, though that meaning may be obscured by ambiguity.
To give you an example, the noun “post” can be a piece of wood or metal, concept 1, or the mail, concept 2,or a record in a log, concept 3. If we consider it’s use as a verb, to post, there are even more meanings.
Various attempts have been made to classify all words in a given language into a set of concepts. The one that we make use of is WordNet, created by Princeton University. There are now WordNets for almost all the world’s languages. A WordNet is a giant thesaurus and dictionary, and one can look up the concepts associated with any word, along with other important information.
Scientio has concentrated on a particular property of concepts that others have not made much use of. They tend to form into trees.
There are several relationships that WordNet tracks, that have long grammatical names. The important ones to us are the “is a kind of” relationship, known as hypernymy, the “is a part of” relationship, known as meronymy, and the “is opposite to” relationship, known as antonymy.
Almost every noun concept is involved in a hypernymy relationship, and they form massive trees, with a small number of root nodes representing concepts that cannot be further simplified or made more abstract or general. In these trees of noun concepts the children are more specific examples of the parent.
To give you an example of one path through a tree from root to tip, consider the following:
- A Palamino is a kind of pony.
- A pony is a kind of horse.
- A horse is a kind of ungulate.
- An ungulate is a kind of animal.
- An animal is a kind of entity.
The same kinds of structures apply to adjectives and verbs too.
So, what’s the use of this? Well, words are unordered, other than alphabetically, and it is this unordered nature that makes text mining difficult and computationally expensive. Text mining, search, etc. are concerned with the frequencies of large numbers of different words. The space of concepts has structure, because of these trees, and so we can find ways to compare and order concepts that are much more compact compared to using words.
The drawback, as you’ll have guessed, is that which concept is meant for a given word in a given sentence is often ambiguous.
So we can convert a sentence to a string of concepts just by looking them up in WordNet, but there will be uncertainty in two areas: (1) the part of speech (POS) associated with each word, and (2) the concept intended for each word.
Concept Strings
Scientio’s approach is to invent a new data structure, the Concept String, that holds all the ambiguity associated with a piece of text. In creating Concept Strings, Scientio’s software does it’s best to reduce any ambiguity, for instance by using word order to infer POS, but it holds all the concepts for each word that might reasonably be intended, and thus all the possible alternate readings for a piece of text.
The above illustrates the structure of a concept string, where the red arrows indicate one particular reading.
To make life easier a long piece of text is usually broken into sentences or phrases, and these are processed into individual Concept Strings.
This gives us something very powerful, the ability to look at two pieces of text and to determine if they might, in one of their interpretations, mean the same thing.
Comparing Concept Strings
Comparison between two concept strings is much more complicated than comparing normal strings. Firstly we look to see if the parts of speech agree, then if there is a common concept in each words list of possible concepts, but much more subtly, using the trees we discussed above, whether there are matches further up the tree.
In this case “I'm moving to the bus” would match with “I'm running to the bus”, “I’m jogging to the bus”, “I’m walking to the bus”, as well, of course, as “I’m running to the coach”.
This is because running,walking,jogging are all kinds of moving.
Now, again, as you’ll have guessed, the comparison above relies on a particular ordering of parts of speech. It’s possible to say the same thing with lots of different orderings of these, but at least we have simplified things dramatically. It is now possible to search large amounts of text for important statements, such as “the bomb is on the plane” using just a couple of templates, whereas to do the same thing in the space of words would require the specification of a large number of alternatives.
In my next blog I’ll look at structures we’ve found for efficiently indexing concept strings and applications.
No comments:
Post a Comment