Thursday 26 January 2012

Introduction to Concept Strings

Scientio has been working for several years on using the space of concepts rather than words to perform various text mining applications.  See, for instance, this paper.

Using the tools we’ve created you can search for phrases in large volumes of text based on meaning, sentiment mine, text mine, categorize using the concepts implied in text rather than unwieldy word frequencies. This technique combines the best bits of “bag of words” text mining and Natural Language Processing, and opens new fields of research.

A Concept is a somewhat nebulous idea. What we mean by it is a common meaning that is language independent, normally, and often common to several words.  It is the meaning intended for a word in a piece of text, though that meaning may be obscured by ambiguity.

To give you an example, the noun “post” can be a piece of wood or metal, concept 1, or the mail, concept 2,or a record in a log, concept 3. If we consider it’s use as a verb, to post, there are even more meanings.

Various attempts have been made to classify all words in a given language into a set of concepts. The one that we make use of is WordNet, created by Princeton University. There are now WordNets for almost all the world’s languages. A WordNet is a giant thesaurus and dictionary, and one can look up the concepts associated with any word, along with other important information.

Scientio has concentrated on a particular property of concepts that others have not made much use of. They tend to form into trees.

There are several relationships that WordNet tracks, that have long grammatical names.  The important ones to us are the “is a kind of” relationship, known as hypernymy, the “is a part of” relationship, known as meronymy, and the “is opposite to” relationship, known as antonymy.

Almost every noun concept is involved in a hypernymy relationship, and they form massive trees, with a small number of root nodes representing concepts that cannot be further simplified or made more abstract or general. In these trees of noun concepts the children are more specific examples of the parent.

To give you an example of one path through a tree from root to tip, consider the following:

  • A Palamino is a kind of pony.
  • A pony is a kind of horse.
  • A horse is a kind of ungulate.
  • An ungulate is a kind of animal.
  • An animal is a kind of entity.

The same kinds of structures apply to adjectives and verbs too.

So, what’s the use of this? Well, words are unordered, other than alphabetically, and it is this unordered nature that makes text mining difficult and computationally expensive. Text mining, search, etc. are concerned with the frequencies of large numbers of different words. The space of concepts has structure, because of these trees, and so we can find ways to compare and order concepts that are much more compact compared to using words.

The drawback, as you’ll have guessed, is that which concept is meant for a given word in a given sentence is often ambiguous.

So we can convert a sentence to a string of concepts just by looking them up in WordNet, but there will be uncertainty in two areas: (1) the part of speech (POS) associated with each word, and (2) the concept intended for each word.

Concept Strings

Scientio’s approach is to invent a new data structure, the Concept String, that holds all the ambiguity associated with a piece of text. In creating Concept Strings, Scientio’s software does it’s best to reduce any ambiguity, for instance by using word order to infer POS, but it holds all the concepts for each word that might reasonably be intended, and thus all the possible alternate readings for a piece of text.

image

The above illustrates the structure of a concept string, where the red arrows indicate one particular reading.

To make life easier a long piece of text is usually broken into sentences or phrases, and these are processed into individual Concept Strings.

This gives us something very powerful, the ability to look at two pieces of text and to determine if they might, in one of their interpretations, mean the same thing.

Comparing Concept Strings

image

Comparison between two concept strings is much more complicated than comparing normal strings. Firstly we look to see if the parts of speech agree, then if there is a common concept in each words list of possible concepts, but much more subtly, using the trees we discussed above, whether there are matches further up the tree.

In this case “I'm moving to the bus” would match with “I'm running to the bus”, “I’m jogging to the bus”, “I’m walking to the bus”, as well, of course,  as “I’m running to the coach”.

This is because running,walking,jogging are all kinds of moving.

Now, again, as you’ll have guessed, the comparison above relies on a particular ordering of parts of speech. It’s possible to say the same thing with lots of different orderings of these, but at least we have simplified things dramatically. It is now possible to search large amounts of text for important statements, such as “the bomb is on the plane” using just a couple of templates, whereas to do the same thing in the space of words would require the specification of a large number of alternatives.

In my next blog I’ll look at structures we’ve found for efficiently indexing concept strings and applications.

Sunday 15 January 2012

Azure HPC Scheduler–integrating into an existing website

Scientio is a creator of text mining, data mining, rule based and time series analysis software. Although they are designed to be as quick as possible, they are still potentially large scale consumers of processing power, especially if applied to large data sets. We’ve been looking for several years at offering access to these products as a service. The costs have always been prohibitive or the available technology too slow. Finally it looks like technology has caught up in the shape of Microsoft Azure HPC Scheduler, which offers the opportunity to run large computing clusters in the cloud. (Get the SDK here.) We’re just at the start, but we hope to be able to permit registered users to upload data to our blob storage and then run potentially large and lengthy tasks on the HPC cluster using our products using the existing HPC web based interface or the REST API.

At the time of writing the Azure HPC Scheduler software is very new and the documentation is skimpy. Microsoft have provided an example service that runs a Linq-HPC, MPI and SOA examples. They’ve not provided much in the way of documentation apart from that. The following are a few notes on integrating HPC into your own Azure hosted site. You should try running the sample service first, it will make the following a bit clearer, and create the database you need for you.

I’ll look initially at just getting the composite site going – in later blogs I’ll look at issues like controlling customer access, provisioning customers, logging, billing etc.

Configuration

There’s lots to configure with the HPC scheduler. The approach taken with the sample service was to create a WPF application that collected information from the user about accounts etc. which then dynamically created the azure configuration files and uploaded the whole thing to Azure. This won’t do if you have an existing site, like Scientio, you are integrating HPC into.

Also, since this service is experimental, we wanted to cut down on our Azure bill by not having a separate instance running as a head node.  I should explain: HPC requires three types of instances, the web front end, the head node (responsible for scheduling jobs) and worker nodes (which do the work). It’s possible to configure HPC to combine the front end and the head node. It’s not clear yet at what point you have to have an independent head node as you increase the number of workers. Anyway, we wanted to start without, and the configuration app in the sample service doesn’t do this.

Finally, the HPC front end requires secure sockets access, and we’ve already got an SSL certificate for our domain, so we want to make the HPC front end use that, accessible as <domain name>/Portal/

So how to achieve all this? There are several stages:

1) configure the existing site to accept the HPC front end

2) write an application that fills in the azure configuration

3) write HPC -friendly wrappers for each Scientio product.

 

Modifying the site

The first thing is to switch off the web config for the master site to stop it affecting the scheduler front end:

<location path="."  inheritInChildApplications="false">
…..
</location>


Place the above round the system.web element in the web.config, and separately around the system.webserver element.



This specifically prevents any dll clashes.



Next you need to edit the ServiceDefinition.csdef file in the azure project. Here’s an example:



<?xml version="1.0" encoding="utf-8"?>
<ServiceDefinition name="<your service name>" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
<WebRole name="<web site name>" vmsize="Medium">
<Sites>
<Site name="Web">
<VirtualApplication name="Portal" physicalDirectory="C:\Program Files\Windows Azure HPC Scheduler SDK\v1.6\hpcportal" />
<Bindings>
<Binding name="HttpIn" endpointName="HttpIn" />
<Binding name="HPCWebServiceHttps" endpointName="Microsoft.Hpc.Azure.Endpoint.HPCWebServiceHttps" />
</Bindings>
</Site>
</Sites>
<ConfigurationSettings>
<Setting name="DiagnosticsConnectionString" />
<Setting name="DataConnectionString" />
</ConfigurationSettings>
<Certificates>
<Certificate name="<your https certificate name>" storeLocation="LocalMachine" storeName="My" />
</Certificates>
<Endpoints>
<InputEndpoint name="HttpIn" protocol="http" port="80" />
</Endpoints>
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcWebFrontEndHeadNode" />
<Import moduleName="RemoteAccess" />
<Import moduleName="RemoteForwarder" />
</Imports>
</WebRole>
<WorkerRole name="ComputeNode" vmsize="Small">
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcComputeNode" />
<Import moduleName="RemoteAccess" />
</Imports>
</WorkerRole>
</ServiceDefinition>



There are several things to note: first, the web role instance is size“medium”. Anything less is unreliable at start up. This seems to be due to memory limitations.



Secondly, we have no head node, unlike the example, but import “HpcWebFrontEndHeadNode” which combines front end and head node.



Filling in the configuration



The Azure HPC SDK supplies a class, ClusterConfig, that you can use to fill in the configuration fields.



I’ve created a command line application that calls this and modifies the configuration directly. It reads the definition file above to work out what to modify.



using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hpc.Azure.ClusterConfig;
using System.Security.Cryptography.X509Certificates;

namespace UpdateAzureHPCConfig
{
class Program
{
static void Main(string[] args)
{
ClusterConfig config = new ClusterConfig();

config.SetCsdefFile(@"C:\<path to the service definition>\ServiceDefinition.csdef");

config.EnableSOA();

// Fill in the Azure account info
config.SubScriptionId = new Guid("{<your azure account subscription id}");
config.ServiceName = "<the service name>";
config.SetStorage("<storage account name>", "<storage account key>");

// Fill in the SQL Azure account info
config.DBServer = "<sql server name>";
config.DBUser = "<sql user name>";
config.DBPassword = "<sql user password>";
config.DBName = "<database name>";

// Fill in the certificate thumbprints
X509Certificate2 sslcert = CertificateHelper.LoadCert(@"<ssl certificate>.pfx", "<cert password>");

config.AddCertificate("Microsoft.Hpc.Azure.Certificate.SSLCert", sslcert);
config.AddCertificate("Microsoft.Hpc.Azure.Certificate.PasswordEncryptionCert", sslcert););

// You can override some preconfigured settings
config.ClusterName = "scientio";

// Setup the built-in cluster admin account
config.ClusterAdmin = "<cluster admin name>";
config.ClusterAdminEncryptedPassword = CertificateHelper.EncryptWithCertificate("<password>", sslcert);

config.Generate(@"C:\<path to configuration>\ServiceConfiguration.cscfg", @"C:\<path to configuration>\ServiceConfiguration.cscfg");
}
}
}


There is a bug in the ClusterConfig class in that it renames the serviceConfiguration ServiceName – just rename it back.



You’ll note there’s a SQL server database involved. I created one of these using the sample service and then re-used it.



There are various Azure storage Blob containers and tables used – these should be generated automatically.



You will need to create a ComputeNode project. This is just an empty worker, all the clever stuff is done with the imports.



Wrappers for the products



The standard form of application you can run on HPC is a command line app. The great big gotcha at the moment is that these must be compiled with .Net3.5. Microsoft when asked wouldn’t say when .Net 4.0 would be available.



Rather than accessing the local file system these can be configured to access azure BLOB storage. If you look at the Linq-HPC example  in the sample service you can see how to do this.



As has been publicised, Microsoft has decided to not continue with Linq- HPC and the underlying Dryad distributed storage. Instead Microsoft is going with Hadoop on Azure.  There’ss obviouisly a good fit between our products and a Hadoop cluster, especially with our text and concept mining products, so we’ll be investigating this soon.



I hope this helped you to get underway with creating your own Azure HPC clusters.