Intelligence as a Service
This blog is about the next wave in SaaS - intelligence as a service. Although there are lots of web services you can use to create mashups, they are mostly data services. This blog concentrates on creating and using internet resources that supply intelligence, like data mining, text mining, expert systems etc.
Thursday 28 July 2016
The Introvert Entrepreneur
Friday 16 May 2014
Azure API Management - Almost there
We downed tools when we saw the release of what had been Apiphany's offering re-launched as a part of Azure, wondering whether should ditch our infrastructure and use it,
The answer is no, but maybe in future. Here's why, for the benefit of any Microsoft marketing people:
Documentation
Documenting your API is vital. Our web site is an MVC 5.1 site using Web.API. This comes with an excellent help generator that creates documentation from the controller code.Not only does this document the calls, but more importantly, since ours are quite complex, the Json/XML structures returned and expected.
The API manager in Azure imports WADL or Swagger. There's recentish code on NuGet for swagger with no instructions for use, and old code for WADL.
These don't seem viable options. Doing it by hand seems tedious, with no options to document the structures. So if we were to use this, we'd provide a considerably worse set of documentation.
Billing integration
There isn't any. So we'd have to respond to requests to register by email and lead customers to a second site to do it. The site permits quick initial trial sign up, but no more.Marketing
Again, there isn't any. It's unclear if you sign up for this service that you'd get everybody and his dog viewing your APIs because it's part of a popular site, or nobody.The structure
There are APIs, Products and applications. The former seem limited to one address. Products are really service levels, and applications are you or other vendors who have created an APP that made use of your API.Our idea of products, Like Concept Forms or Concept Strings are each composed of several APIs.
You sign up to a product, start at a free level of service and move up.
What we need to do does not therefore seem to line up with the API manager.
The look of the website and extensibility
Having spent months making a website (hopefully) consistent with necessary features like the ability to download tooling, The built in site looks very constricted. I understand that apiphany felt they needed this, but as part of the Azure ecosystem with loads of different web CMS platforms available the built in system looks stopgap, and doesn't fit our needs.The good bit is the intermediation - though even there, there is no advice given as to how to prevent users just bypassing the manager.
We'd be happy to use this if the above issues were addressed, as I'm sure they will be in the next couple of months, Right now you'd have to be a very particular kind of vendor to want it.
Well done to Apiphany though, I hope they are sitting back counting the money...
Wednesday 31 July 2013
How to monitor threats and abuse on the Internet with minimal effect to civil liberties
Overview
Thursday 26 January 2012
Introduction to Concept Strings
Scientio has been working for several years on using the space of concepts rather than words to perform various text mining applications. See, for instance, this paper.
Using the tools we’ve created you can search for phrases in large volumes of text based on meaning, sentiment mine, text mine, categorize using the concepts implied in text rather than unwieldy word frequencies. This technique combines the best bits of “bag of words” text mining and Natural Language Processing, and opens new fields of research.
A Concept is a somewhat nebulous idea. What we mean by it is a common meaning that is language independent, normally, and often common to several words. It is the meaning intended for a word in a piece of text, though that meaning may be obscured by ambiguity.
To give you an example, the noun “post” can be a piece of wood or metal, concept 1, or the mail, concept 2,or a record in a log, concept 3. If we consider it’s use as a verb, to post, there are even more meanings.
Various attempts have been made to classify all words in a given language into a set of concepts. The one that we make use of is WordNet, created by Princeton University. There are now WordNets for almost all the world’s languages. A WordNet is a giant thesaurus and dictionary, and one can look up the concepts associated with any word, along with other important information.
Scientio has concentrated on a particular property of concepts that others have not made much use of. They tend to form into trees.
There are several relationships that WordNet tracks, that have long grammatical names. The important ones to us are the “is a kind of” relationship, known as hypernymy, the “is a part of” relationship, known as meronymy, and the “is opposite to” relationship, known as antonymy.
Almost every noun concept is involved in a hypernymy relationship, and they form massive trees, with a small number of root nodes representing concepts that cannot be further simplified or made more abstract or general. In these trees of noun concepts the children are more specific examples of the parent.
To give you an example of one path through a tree from root to tip, consider the following:
- A Palamino is a kind of pony.
- A pony is a kind of horse.
- A horse is a kind of ungulate.
- An ungulate is a kind of animal.
- An animal is a kind of entity.
The same kinds of structures apply to adjectives and verbs too.
So, what’s the use of this? Well, words are unordered, other than alphabetically, and it is this unordered nature that makes text mining difficult and computationally expensive. Text mining, search, etc. are concerned with the frequencies of large numbers of different words. The space of concepts has structure, because of these trees, and so we can find ways to compare and order concepts that are much more compact compared to using words.
The drawback, as you’ll have guessed, is that which concept is meant for a given word in a given sentence is often ambiguous.
So we can convert a sentence to a string of concepts just by looking them up in WordNet, but there will be uncertainty in two areas: (1) the part of speech (POS) associated with each word, and (2) the concept intended for each word.
Concept Strings
Scientio’s approach is to invent a new data structure, the Concept String, that holds all the ambiguity associated with a piece of text. In creating Concept Strings, Scientio’s software does it’s best to reduce any ambiguity, for instance by using word order to infer POS, but it holds all the concepts for each word that might reasonably be intended, and thus all the possible alternate readings for a piece of text.
The above illustrates the structure of a concept string, where the red arrows indicate one particular reading.
To make life easier a long piece of text is usually broken into sentences or phrases, and these are processed into individual Concept Strings.
This gives us something very powerful, the ability to look at two pieces of text and to determine if they might, in one of their interpretations, mean the same thing.
Comparing Concept Strings
Comparison between two concept strings is much more complicated than comparing normal strings. Firstly we look to see if the parts of speech agree, then if there is a common concept in each words list of possible concepts, but much more subtly, using the trees we discussed above, whether there are matches further up the tree.
In this case “I'm moving to the bus” would match with “I'm running to the bus”, “I’m jogging to the bus”, “I’m walking to the bus”, as well, of course, as “I’m running to the coach”.
This is because running,walking,jogging are all kinds of moving.
Now, again, as you’ll have guessed, the comparison above relies on a particular ordering of parts of speech. It’s possible to say the same thing with lots of different orderings of these, but at least we have simplified things dramatically. It is now possible to search large amounts of text for important statements, such as “the bomb is on the plane” using just a couple of templates, whereas to do the same thing in the space of words would require the specification of a large number of alternatives.
In my next blog I’ll look at structures we’ve found for efficiently indexing concept strings and applications.
Sunday 15 January 2012
Azure HPC Scheduler–integrating into an existing website
Scientio is a creator of text mining, data mining, rule based and time series analysis software. Although they are designed to be as quick as possible, they are still potentially large scale consumers of processing power, especially if applied to large data sets. We’ve been looking for several years at offering access to these products as a service. The costs have always been prohibitive or the available technology too slow. Finally it looks like technology has caught up in the shape of Microsoft Azure HPC Scheduler, which offers the opportunity to run large computing clusters in the cloud. (Get the SDK here.) We’re just at the start, but we hope to be able to permit registered users to upload data to our blob storage and then run potentially large and lengthy tasks on the HPC cluster using our products using the existing HPC web based interface or the REST API.
At the time of writing the Azure HPC Scheduler software is very new and the documentation is skimpy. Microsoft have provided an example service that runs a Linq-HPC, MPI and SOA examples. They’ve not provided much in the way of documentation apart from that. The following are a few notes on integrating HPC into your own Azure hosted site. You should try running the sample service first, it will make the following a bit clearer, and create the database you need for you.
I’ll look initially at just getting the composite site going – in later blogs I’ll look at issues like controlling customer access, provisioning customers, logging, billing etc.
Configuration
There’s lots to configure with the HPC scheduler. The approach taken with the sample service was to create a WPF application that collected information from the user about accounts etc. which then dynamically created the azure configuration files and uploaded the whole thing to Azure. This won’t do if you have an existing site, like Scientio, you are integrating HPC into.
Also, since this service is experimental, we wanted to cut down on our Azure bill by not having a separate instance running as a head node. I should explain: HPC requires three types of instances, the web front end, the head node (responsible for scheduling jobs) and worker nodes (which do the work). It’s possible to configure HPC to combine the front end and the head node. It’s not clear yet at what point you have to have an independent head node as you increase the number of workers. Anyway, we wanted to start without, and the configuration app in the sample service doesn’t do this.
Finally, the HPC front end requires secure sockets access, and we’ve already got an SSL certificate for our domain, so we want to make the HPC front end use that, accessible as <domain name>/Portal/
So how to achieve all this? There are several stages:
1) configure the existing site to accept the HPC front end
2) write an application that fills in the azure configuration
3) write HPC -friendly wrappers for each Scientio product.
Modifying the site
The first thing is to switch off the web config for the master site to stop it affecting the scheduler front end:
<location path="." inheritInChildApplications="false">
…..
</location>
Place the above round the system.web element in the web.config, and separately around the system.webserver element.
This specifically prevents any dll clashes.
Next you need to edit the ServiceDefinition.csdef file in the azure project. Here’s an example:
<?xml version="1.0" encoding="utf-8"?>
<ServiceDefinition name="<your service name>" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
<WebRole name="<web site name>" vmsize="Medium">
<Sites>
<Site name="Web">
<VirtualApplication name="Portal" physicalDirectory="C:\Program Files\Windows Azure HPC Scheduler SDK\v1.6\hpcportal" />
<Bindings>
<Binding name="HttpIn" endpointName="HttpIn" />
<Binding name="HPCWebServiceHttps" endpointName="Microsoft.Hpc.Azure.Endpoint.HPCWebServiceHttps" />
</Bindings>
</Site>
</Sites>
<ConfigurationSettings>
<Setting name="DiagnosticsConnectionString" />
<Setting name="DataConnectionString" />
</ConfigurationSettings>
<Certificates>
<Certificate name="<your https certificate name>" storeLocation="LocalMachine" storeName="My" />
</Certificates>
<Endpoints>
<InputEndpoint name="HttpIn" protocol="http" port="80" />
</Endpoints>
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcWebFrontEndHeadNode" />
<Import moduleName="RemoteAccess" />
<Import moduleName="RemoteForwarder" />
</Imports>
</WebRole>
<WorkerRole name="ComputeNode" vmsize="Small">
<Imports>
<Import moduleName="Diagnostics" />
<Import moduleName="HpcComputeNode" />
<Import moduleName="RemoteAccess" />
</Imports>
</WorkerRole>
</ServiceDefinition>
There are several things to note: first, the web role instance is size“medium”. Anything less is unreliable at start up. This seems to be due to memory limitations.
Secondly, we have no head node, unlike the example, but import “HpcWebFrontEndHeadNode” which combines front end and head node.
Filling in the configuration
The Azure HPC SDK supplies a class, ClusterConfig, that you can use to fill in the configuration fields.
I’ve created a command line application that calls this and modifies the configuration directly. It reads the definition file above to work out what to modify.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Hpc.Azure.ClusterConfig;
using System.Security.Cryptography.X509Certificates;
namespace UpdateAzureHPCConfig
{
class Program
{
static void Main(string[] args)
{
ClusterConfig config = new ClusterConfig();
config.SetCsdefFile(@"C:\<path to the service definition>\ServiceDefinition.csdef");
config.EnableSOA();
// Fill in the Azure account info
config.SubScriptionId = new Guid("{<your azure account subscription id}");
config.ServiceName = "<the service name>";
config.SetStorage("<storage account name>", "<storage account key>");
// Fill in the SQL Azure account info
config.DBServer = "<sql server name>";
config.DBUser = "<sql user name>";
config.DBPassword = "<sql user password>";
config.DBName = "<database name>";
// Fill in the certificate thumbprints
X509Certificate2 sslcert = CertificateHelper.LoadCert(@"<ssl certificate>.pfx", "<cert password>");
config.AddCertificate("Microsoft.Hpc.Azure.Certificate.SSLCert", sslcert);
config.AddCertificate("Microsoft.Hpc.Azure.Certificate.PasswordEncryptionCert", sslcert););
// You can override some preconfigured settings
config.ClusterName = "scientio";
// Setup the built-in cluster admin account
config.ClusterAdmin = "<cluster admin name>";
config.ClusterAdminEncryptedPassword = CertificateHelper.EncryptWithCertificate("<password>", sslcert);
config.Generate(@"C:\<path to configuration>\ServiceConfiguration.cscfg", @"C:\<path to configuration>\ServiceConfiguration.cscfg");
}
}
}
There is a bug in the ClusterConfig class in that it renames the serviceConfiguration ServiceName – just rename it back.
You’ll note there’s a SQL server database involved. I created one of these using the sample service and then re-used it.
There are various Azure storage Blob containers and tables used – these should be generated automatically.
You will need to create a ComputeNode project. This is just an empty worker, all the clever stuff is done with the imports.
Wrappers for the products
The standard form of application you can run on HPC is a command line app. The great big gotcha at the moment is that these must be compiled with .Net3.5. Microsoft when asked wouldn’t say when .Net 4.0 would be available.
Rather than accessing the local file system these can be configured to access azure BLOB storage. If you look at the Linq-HPC example in the sample service you can see how to do this.
As has been publicised, Microsoft has decided to not continue with Linq- HPC and the underlying Dryad distributed storage. Instead Microsoft is going with Hadoop on Azure. There’ss obviouisly a good fit between our products and a Hadoop cluster, especially with our text and concept mining products, so we’ll be investigating this soon.
I hope this helped you to get underway with creating your own Azure HPC clusters.
Thursday 22 December 2011
Tolkien on Engineering and Invention
In Tolkien’s Silmarillion he provides a backstory to The Lord Of The Rings and The Hobbit, and talks about the demigods that form the Valar, the controllers of the world.
Here is what he has to say about the smith god Aulë:
“but the delight and pride of Aulë is in the deed of making and in the thing made, and neither in possession nor in his own mastery; wherefore he gives and hoards not, and is free from care, passing ever on to some new work.”
Doesn’t that sum up our profession, or at least how it ought to be?
Friday 2 December 2011
Automated Medical Diagnosis and XmlMiner
Scientio is getting towards the end of a successful collaboration with a medical devices start up. Basically the product works, and barring some tinkering and approvals the initial development phase is over. I’m not going to talk about this company; there’ll be a separate splash when they are ready to publicise things, but, obviously, having built up this expertise we’d like to reuse it.
My feeling is that there are other enterprises like this, who may not be aware of what we do, or that what we do can be done. I don’t intend to break any confidences so I’m going to talk about our experiences in general terms in this post.
This company had a unique way of interpreting and conditioning a kind of sensor that is frequently used. They also had a set of tests built around this and other sensors, and an expert who could detect a range of conditions using this set up. Obviously with only one expert and only so many hours in a day the earning potential of this idea was limited, so how could they automate and reproduce this idea, so it would be available across America?
Scientio’s interest in this was the automation of the expert’s diagnostic knowledge, and the provision of this as a central cloud based diagnosis engine. The result is that this diagnostic method is now leveraged so that thousands of tests can be handled in the time required for one manual test. This previous post talks about the architecture we used.
We’ve discovered through this process that Scientio’s engine is ideal for such tasks.
First of all, in an environment where approvals and compliance are important, The rules, though stored as XML, are easily displayed as English language if…then text, so the function of the system can be easily verified.
The rules are testable, either as a complete functional block or individually, and we supply software in our Lacuna product that can find any unintentional gaps in the rule sets, i.e. combinations of inputs that ought to produce a valid result but don’t.
When you add a new fact to a conventional expert system you have no idea how long it will process before stable results are generated.
XmlMiner uses pre-processing to format the rules for straight through processing. The run time is defined and exceedingly speedy.
The power of fuzzy logic also makes it easier to transfer the expert’s knowledge to a set of rules. Scientio’s fuzzy logic inference engine is entirely capable of handling competing solutions and handling them in a rational way. Fuzzy logic makes for very expressive rules: we were amazed how small the set of rules used in the final product were.
Smaller rule sets mean lower maintenance costs and easier approvals.
XmlMiner can tell you when a set of input data is outside of the circumstances the rule set was created to handle. This means it’s easy to flag exceptional circumstances for human supervision or monitoring.
So, if you are trying to make that jump from a human expert based process to an automatic, semi automatic or human supervised process contact Scientio, we’d be happy to hear from you.