The client – Minerva Intelligence
This project was a six-month contract with Minerva Intelligence Inc. (“Minerva”), of Vancouver, BC, to act as the company’s Chief Semantics Officer (CSO) while the company was ramping up to go public with its products and services and list on the Toronto Stock Exchange Venture Exchange (TSX-V). It lasted from April to September 2018, and the company listed in Oct. 2019. Minerva provides Artificial Intelligence (AI) solutions mainly for the Mining & Minerals industry. Their AI platform is used in this particular industry, and in Earth Sciences in particular, because the available data in these fields is insufficient for machine learning applications. Machine learning can only take place where sufficient, and increasing quantities of data sets are inputted into the system. To make a human do this kind of input and analysis is impracticable, even unfeasible. A computer can do it so much faster and make no human errors.
At the time this project took place, Minerva’s CEO, Clinton Smyth (now the Chief Technology Officer), and his fellow director Prof. David Poole of the Dept. of Computer Science at the University of British Columbia, had been developing their AI software for at least two decades. By 2018, with world-wide standardization of computer languages (through the W3C) and the predominance of Internet-based data repositories, AI had become the hottest thing in computing.
Chief Semantics Officer role
Marthe Bijman was given this contract because of her M.A. in Linguistics, her background of working in the Mining industry, and her exposure to Geology and IT systems. Smyth commissioned her to produce a small and complicated part of the very, very early-stage foundational substructure of the Minerva products, the very beginning of some of them in fact. These products have since been refined to the point that Minerva’s potential clients are presented with a “Black Box” of which they neither need to know or want to know the functionality or structure, their only interest being the results of the queries that Minerva’s AI can produce.
Minerva’s product suites in 2020 are the TERRA Mining AI Suite and the GAIA Natural Hazard AI Suite. What these products have in common is that they are built using the same process for development, and that they address the same type of problems with domain-specific data: standardization, accessibility, interoperability and classification.
People joke when they say that when they join a new company, they have to learn the “company-speak” – but they are not wrong. Each business has its own “language” and its own performance data and Intellectual Properties (IP). How this information is stored, used and shared is a key indicator of business maturity and success.
What actually happens in machine learning
Minerva’s system combines human domain expertise (for example, mineral exploration or landslides) with information from public and private databases in a computer reasoning system. Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general knowledge, or domain-independent knowledge. It’s the specialized information that people learn about a job. For a doctor, for instance, it could be anatomy, for an epidemiologist, it could be virology.
The purpose of harvesting this information is to use computers to carry out the complicated locating, sorting, comparing, argumentation, authentication and ranking of information, because computers are faster and more thorough than human beings. The more information is fed into a machine learning program, the better it works. Moreover, computers do not use vague, personalized, random, colloquialism-littered, inference-riddled, plain wrong language. Humans do, even when they talk business. But that causes difficulties with converting human language into machine language.
Why do businesses need taxonomies?
A key element of the effective operation of a business is the domain knowledge base and models that it uses. Thus, an engineering firm would have a domain such as wastewater treatment, which would constitute all their information about wastewater treatment – all the terms, phrases, models, calculations, plans, processes, diagrams, standards, client specs etc. pertaining to what they do in wastewater treatment. In the old days we used to call this the company’s “IP”. If, for instance, a company stores its information on a SharePoint site, and a user searches the database for, let’s say, phosphate removal, up should pop every available document with that term in it. The problem is that in most cases, this information is neither electronic, nor on an internet platform, nor standardized nor ordered. Mostly it’s a stack of paper plans sitting in a store room or a bunch of electronic folders on someone’s PC.
Today, the information that a company owns really is money. Companies like Minerva and PoolParty get revenue from building semantic suites for businesses, that lets the businesses “implement semantic search and build applications that understand user intent, making information easy and quick to find.” Any business worth its salt should have a proper semantic suite of all its domain data. It should be web-based, not based in someone’s brain.
The role of a CSO
The project allocated to Marthe Bijman as Minerva’s temporary CSO turned out to be a series of small, intricate steps at the start of an extremely long process of converting masses of information that is written in human language, into a machine language, code in other words, so that it can be imported into a software suite, and can be filtered and manipulated to produce answers. In short, it was creating the basis for a web-based semantic network.
CSO is a bit of a rare title. Look it up and you will find that it is someone who deals with the Semantic Web.
The Semantic Web, as Tim Berners-Lee put it way back in 2001, is “…an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” It means that all the words on the world wide web (the Internet) are programmed to have specific meanings, and particular contexts, and positions in classification systems. And in order for the words to be findable on the Web, they need to go through a series of transformations.
For example, when you look up on Google, say “Granny Smith apple”, you will get about 50,400,000 results (!) in a few seconds. The only thing we know of that determines what you see first, and see next, etc., is the Google Pagerank algorithm. This means that what you see first is not necessarily the most important information, but the most frequently accessed information. So, businesses, governments and organizations who want to organize and access their information need to develop their own Semantic Webs. Which is a terrible, brain-breaking job that takes a heck of a long time.
This is also why an AI company working with the Semantic Web needs a CSO, Semantics being the study of the meaning, form and function of words. An example of an AI company which explains its semantic suite services quite well, is Vienna, Austria-based PoolParty.
Semantics and Syntax
To create a Semantic Suite you have to know more than Semantics. The focus is actually Syntax, since the process starts with parsing – the breaking down of sentences into functional units, verbs, nouns, pronouns, and so on. The “values” as these words are called are then functionally reorganized and classified to create taxonomies, and the taxonomies are then combined to form ontologies. The ontologies are then converted into a Web Ontology Language (OWL). The OWL data is then rewritten into Java-based open source ontology editor, Protégé (as one type). From the stage that the data is in the form of code and in hands of the geek squad of Minerva. The conversion of the data from one language into others was something like this:
The project was specifically started to provide proof of concept and document the workflow of getting the domain-specific data from A – just words – to Z – computer code. The added complication was that Smyth wanted the data not only to be in an internationally accepted format, but also “Aristotelianized”. This obscure term means that the data is hierarchically organized. Therefore, if someone uses Minerva’s program to find a particular piece of information, the results would be organized and prioritized in terms of importance, relevance or attribution and correctly associated with related information.
The specific project – HILUCS Land Use
The data that was used in the proof of concept exercise, was the European HILUCS Land Use database. The European Commission uses INSPIRE data specifications for the codelist of the Hierarchical Land Use Classification System, HILUCS. The INSPIRE (Infrastructure for Spatial Information in Europe) Directive aims to create a European Union spatial data infrastructure for the purposes of EU environmental policies and policies or activities.
Smyth selected HILUCS as a starting point because it was freely available, and because Canada is lagging somewhat behind Europe in the standardization of spatial information. Therefore, Minerva looked to European organizations for cooperation. INSPIRE includes thematic domains such as Statistics & Health, Land Cover & Use, Environmental Monitoring & Observations, Facilities & Utilities. Land Cover and Use includes domain specific terms and definitions about all aspects of Land Cover (what is on top of the ground) to Land Use (what the things on the land are used for).
The Land Use Taxonomy eventually produced 803 unique values (or concepts) distributed amongst 98 classes of values. It is a relatively small taxonomy compared to, for instance, mineral resources, though it took something like 640 manhours to create. Obviously there were many stops and starts and returns to the starting point, since no-one on the team had any idea of how to do all of it, but some idea of how to do parts of it, and some could explain only in Mathematics terms, others only in Programming language, and others only in Geography-speak.
Software applications that are as raw and as early in development as this, are very difficult to describe. One of the responsibilities of Marthe Bijman was to produce promotional materials for these projects. This caused much disagreement between the staff members since the very words used were problematic – the terms were either contentious, confusing, vague or made up since what they were describing had not existed ‘til then. The brochures and presentations were eventually not used, but probably moved the company a little way forward in its branding and positioning in that it gave the team numerous “scratch pads” to fiddle with.
The experience of the project development process was actually like product development is depicted in the TV show “Silicon Valley”, at times surreal, mostly confusing, and very intense and difficult. The video, below, by Red Pennant, was an early attempt at describing what Minerva does.
Two years on, Minerva’s brand in all its iterations is much more sophisticated and visually appealing.
Marthe Bijman did, during the final days of her contract, produce a Land Use Taxonomy which could be programmed into Protégé, and which did deliver the expected results when the program was run for the first time. In short, it worked, and it could be replicated (with difficulty) to other knowledge bases and domains. Smyth and his team used those lessons learned about the conversion of human language into machine language to present their theories to international organizations and to refine their products and services. The process, in this instance, was probably more useful than the product.
Smyth has grown his business at an impressive pace, securing investors, making new appointments, listing on the TSX and finding ways to explain Minerva’s esoteric products with words that ordinary humans can understand.