Unless you're a member of an isolated ancient tribe living under one of the six remaining trees in what used to be the Amazon rainforest, you have almost certainly heard the term "Machine Learning" floating past within the last few years.
This blog post will explain the main ideas behind machine learning and try to show you why you should care about any of them in the first place.
[update] The example problem used throughout this blog post has been changed in response to readers' feedback to which we couldn't turn a blind eye.
A computer is capable of remarkable feats. In fact, personally, I'm convinced that if humanity doesn't eradicate itself prematurely, there won't be anything left humans can do that can't be done much better, faster and cheaper by a suitably designed and programmed computer (or a network of them).
Even though a computer can do just about anything, making it do what you want it to do can be very hard indeed. Depending on the problem at hand a programmer needs to be rather ingenious and resourceful.
Machine learning is mainly concerned with clever ways of programming a computer - although for most (if not all) techniques, optimised hardware can significantly improve performance.
To highlight the value of machine learning properly, an example should help. I'm going with one any cat owner (slave) should be familiar with. So, let's imagine you've got a cat, called Ada.
One fine Sunday afternoon, you're reading a novel while Ada's sitting on a bookshelf in front of you, staring idly into the middle distance. She hasn't moved for at least ten minutes. Next to her on the shelf is a small stone you like and have put there as an ornament.
Now you notice that some time during the last few seconds, she has moved her paw to a new position. While it was previously resting delicately on the shelf next to its sibling, it's now suddenly hovering above the shelf, just behind the stone.
You've only got time to yell 'nooooo' before she gently pats the stone on the back a few times, shoving it from the shelf onto the ground. She then returns to her earlier peaceful posture - this time you hear an added soft, content purr while she had been completely silent before.
What does Ada try to tell you, exactly? What does that purr mean?
A multitude of possible answers present themselves to you:
Now which is it? You just don't know, and I certainly don't either.
However, computers are doing all sorts of stuff you wouldn't know how to do yourself these days, like parsing PDF files and running Facebook. So can't anyone build software to solve this problem for you?
Might it be possible to write software that understands anything your cat is actually trying to have you come up with all by yourself based on the sounds she makes and her behaviour?
To state the question more clearly, as you wish she would:
Is it possible to create software that can interpret and decode anything your cat does, translating it to a message that actually tells us what she wants from us?
I'll call this the 'Ada Interpretation Problem' or 'AIP'. (Which might also mean 'Artificial Intelligence Problem'. This must, no doubt, have some sort of cosmic significance).
In order to write software that can solve the Ada Interpretation Problem, an old-school programmer would have to come up with all the rules (in perfect detail!) that will take the problem specification of any sound your cat makes, and what she was doing while she made it, as input and end up with what we need to know: a sentence communicating the actual message she is sending.
Solving the problem comes down to correctly reconciling a host of very complex aspects:
These are some aspects of the problem - I will have overlooked some, cause who knows, really? - in which we hope lie the clues that should suffice to decode the actual message.
That means that these aspects should be included, in some way or other, as inputs to any software program (or app, as all those hip youngsters call it nowadays) that is tasked with doing the translation for you, in order to provide it with enough information to do its job.
Now, the programmer's job is to map any specific problem instance - an occurrence, a specific sound your cat made plus all the circumstantial details of tone, volume and body language, like the 'stone shove' example above - to a solution:
In short, the programmer would have to understand your cat perfectly in order to teach a computer to reach the same 'understanding' if he wants to succeed using a classic approach.
The rules for interpreting sound volume alone would have to be dauntingly complex, let alone those for correctly evaluating a series of actions like the one in our example in combination with purr volume and the exact total time the stone has been on that shelf.
Let's say our programmer assigned a number to each sound category he could recognise, and then built lots of rules based on that assignment. What if next time Ada makes a weird sound nobody's ever heard before (and she will, rest assured) and for which he consequently doesn't have a number yet?
Also, the programmer is bound to have failed to identify some aspect of Ada's behaviour which is of paramount importance in decoding her messages and not having incorporated this aspect as input to his algorithms will prevent them from ever producing correct answers to start with - they just won't have the information needed to arrive at a correct answer.
It should be obvious, then, that a classic approach to solving the Ada Interpretation Problem will fail miserably. If you don't know how to program you wouldn't know:
And if you do know how to program, you wouldn't know any of those things either. The rules involved are just too complex.
A programmer has no chance at all of capturing every hard rule, soft rule, nuance and exception in a formal playbook that can be formulated and specified using computer code resulting in a program that can figure out The Hidden Message for any Ada Interpretation Problem instance one cares to throw at it.
However, there is a slight glimmer of hope on the horizon and its name is Machine Learning. A programmer has no chance at all of keeping track of all the rules and complexities involved in solving your problem, but a computer might.
The crux of programming software to solve a problem is nothing more than mapping questions/problems (input) to answers/solutions (output). In figuring out the steps in between lies the challenge and for many problems a programmer can't be expected to rise to it.
In order to solve those tough problems, here's a clever bit:
"Why not have the computer figure out how that mapping should be done?"
In other words: instead of telling the computer exactly how to solve a problem, why don't we tell the computer to figure it out for us? Why don't we just tell the thing about our problems and have it deal with it while we go and complain about them to a bartender?
That is what Machine Learning is all about. In general, it comes down to giving a computer lots of data about the problem at hand. Feed it everything you know - this must be nuanced, see Feature selection and extraction - in a form computers can work with and adopt some algorithm that goes to town on that data and figures out the mapping from problem instances to solutions for you.
Why does a computer have better chances of coming up with correct mapping rules? Because it is much faster than you, has a perfect memory, and doesn't get bored or a headache as easily.
The first step in having a computer solve your problems, is to offer said problems to it in a suitable form.
Problem instances can be listed as a table. Each row of the table is a single problem instance and each column contains one of the aspects of your problem expressed as numbers, so an algorithm can do calculations with it - it can't do that on a bit of text, for example. These columns are called features of the problem domain. This table (or matrix) representation is very convenient and efficient to work with for most machine learning algorithms.
For the Ada Interpretation Problem this might look something like the following:
An important point to notice here is that pretty much everything you know about how your cat behaved, within the context of each problem instance, as a human observer is also in the data since we don't make any attempt to categorise sound or facial expressions ourselves. We didn't just pick a couple of facial expressions we could recognise and represented those as numbers.
Instead we use a recording of the complete situation:
The sounds Ada made, and what she looked like or was doing at the time, are in this list in the form of pixels (photo or video) and spectrograms (basically a graphical representation of sound, a 'photo' of a sound fragment, which are pixels as well).
This way of describing a problem instance digitally makes sure we don't lose subtle information embedded in the sounds or posture of Ada - aspects we would have missed using a classic approach of choosing a few characteristics we could recognise, as a human being, and assign numbers to.
You might argue a 'classical' programmer might use these digital representations as well rather than trying to recognise facial expressions himself, but this would have been even harder to do. Could you list the exact steps you'd take - by looking at individual pixels! - to recognise a cat in a photo?
An important thing to watch out for is irrelevant data. An algorithm that has to spend a lot of time trying to take into account factors that don't have any bearing on the problem at all will just make progress slower and less efficient.
In the context of your Ada Interpretation Problem: after yet another destructive event Ada has made happen especially for you, you might wonder whether or not an airplane to Barbados will take off within the next few days. But including that information will not help anyone (human nor computer) figure out what your cat was actually trying to tell you. It's irrelevant data in the quest for understanding.
The more features you feed to an algorithm the more work it has. Feeding an algorithm too many features to chew might bring it down to its knees or flat on its belly, face in the sand on a beautiful beach in what will hopefully turn out to be Barbados (the curse of dimensionality).
So it is important to explore the data you have and make an effort to give only relevant information to your algorithm, in other words: only those features that have an actual contribution to make in determining the solution to your problem. Barbados: irrelevant. Stone shattered: hardly relevant. Third whisker from top right slightly elevated: almost certainly very relevant indeed.
Features can be created as well, for example: sometimes instead of including two features (= two numbers) you have gathered to describe your problem you might include their ratio (= one number) instead and leave the separate numbers out.
Often, some features carry the same - or about the same - information and including them all will not improve your algorithm's solution while discarding all but one will yield equivalent (or better, even!) results.
Identifying good features - finding a set of as little features as possible carrying enough relevant information to discriminate problem instances - is usually the main challenge of a successful machine learning project.
Many algorithms work better if all features are expressed as numbers within similar ranges. If one feature is a number between 0 and 1, and a second feature is a number between -1.000.000 and 1.000.000, many algorithms will work a lot slower (or not at all) because of the imbalance in the orders of magnitude of these features.
To mitigate this problem, feature scaling is applied to each feature that requires it, this is a procedure to map the range of values for a feature to a 'more reasonable' interval, like the interval [0 - 1], in order to have each feature always be a number within a similar range as all the others.
This ensures each feature throws about the same weight in a machine learning algorithm's scales and basically avoids one moderately relevant feature which is represented by comparatively large values drowning out much more relevant features with smaller ranges.
For every problem, there is a large set of machine learning algorithms that can be applied to it. Some algorithms might fit a particular problem well, and others might not be suited at all.
When solving a particular problem:
There are three major classes of machine learning algorithms that can be discerned based on what the computer knows about possible solutions beforehand:
A supervised learning algorithm is supervised in the sense that it is given examples of solved problem instances it should learn from, where a human expert has labelled a number of cases with a correct answer. The algorithm is learning in the sense that it is becoming increasingly better at performing a given task by looking at examples of what it is supposed to achieve.
What would a supervised Machine Learning algorithm do then, to make sense of your Ada Interpretation Problem's features? It would train a model for you. (Not the kind of model that earns money from having light bounce off of herself and through a lens, sometimes on a beach in Barbados.)
A model in this context refers to a rule orset of rules which takes a problem instance as input and comes up with an answer. This model can be arbitrarily complex and represents the very rules the programmer couldn't figure out himself.
The model is the solution for your problem in general, which gives answers to problem instances specifically.
In other words, the model answers specific cases from your problem domain:
Each algorithm starts with a 'raw' model, a blueprint which yields 'solutions' that initially are not even close to what they should be. The model has a number of nuts and bolts (model parameters) that determine the answers the model gives.
It's these nuts and bolts the machine learning algorithm will tweak, molding the model until it gives reasonably accurate answers (what it deems reasonable is determined by the programmer).
A machine learning algorithm sculpts a model that fits your data.
It does this by looking at each example you have given it (called the training set) and trying it out on the model. It will then compare the answer the model gives to the correct answer, and tweak the nuts and bolts of the model in order to make it produce a slightly better answer than it did.
This is repeated for each example in the training set (sometimes multiple times over), and in small steps the algorithm will arrive at a model that gives good answers to all of the training examples.
The idea is now that the model that was trained on the training and gives good answers to the examples it was given will be generally applicable to the problem, meaning it will also give good answers to examples it was not trained on.
Supervised learning is mostly used for teaching a computer to perform tasks of which we have lots of examples to which we know the answers we're expecting, but we can't really (or at all) say which steps to take to reach those correct answers.
This type of algorithm is most prevalent and you have most likely made use of its applications on many occasions.
Some common applications include:
Unsupervised algorithms are given no answers at all, and are supposed to find structure in data all by itself. An algorithm like this is fed lots of examples without knowing what any answer is supposed to be.
These types of algorithms usually come up with a new representation of the data they were given, and point out patterns, groups, similarities or deviations.
This is a bit of a special beast. Reinforcement learning is a method for training agents (think of robots, or software bots) to perform a task by having it modify its actions based on a reward signal coming from its environment.
Basically, an agent will have an internal model (called a policy) selecting actions to perform based on what it can perceive from its environment. It starts acting according to that model, and will be rewarded if it does well at the task at hand and possibly punished if fails.
That reward (or lack of it) will be used to change the policy for selecting actions in a way that will optimise the reward signal. The policy itself can be extremely complex and involve an arbitrary number of machine learning sub-problems on its own.
This is a very popular research topic, it's been used to have software master a whole host of Atari games, which it figured out how to play by being fed the game screen's pixels (environment) and a number indicating how much its score was (reward signal).
Currently, attempts are made to crack the problem of playing StarCraft II - the famous RTS game - on a competitive level (and you can help).
Reinforcement learning has found its way in practical applications as well, for example:
Machine learning is in its infancy - about to get its first tooth or so - and papers detailing new advances are published all the time. Many of these focus on a specific subfield of machine learning called Deep Learning, which I believe deserves special mention in this overview article.
To give an adequate overview of what deep learning entails would require a lengthy separate post. Suffice to say it involves building what's called neural networks and is taking center stage in most of machine learning.
Actual neurons found, hopefully, in abundance in your own brain and nervous system are rather complicated cells. Your cat's cells are no more complicated than yours, most differences between your brain and your cat's are caused by the number of cells and the connections between those cells.
An artificial neural network uses a thoroughly stripped down and stylised version of its organic counterpart as a building block to create networks in which these units are connected to each other.
Such an artificial neuron looks something like the following:
When combining multiple of these neurons by connecting their inputs to the outputs of others, and vice versa, the result is called a neural network.
A typical (smallish) neural network may be structured like this:
Now, usually it's the output of such a network which is of interest to you when you want it to solve problems. What makes a network yield outputs that happen to be solutions to a problem you're interested in?
There are two important ingredients here:
Machine learning algorithms can tweak and tune these properties of a network in order to have them yield solutions to a particular problem by changing input weights of the neurons in the network and/or the network's topology.
We can give a supervised learning algorithm a neural network to play with and some examples of what we want it to achieve, and the algorithm will manipulate the network until it does what we wanted it to do.
And so, neural networks can play the role of machine learning models (mentioned earlier in this post). They've turned out to be stunning oscar-worthy actors indeed and I hope they'll all get a part in the next Game of Thrones season.
Neural networks have been used in all areas of machine learning - supervised, unsupervised and reinforcement learning - with spectacular results and there is absolutely no indication of our having exhausted its possibilities. This field has been especially successful in having computers perform tasks that previously only humans could bring to a satisfactory conclusion, like recognising objects (including cancer cells!) in images, turning speech into text, ...
What makes these networks so powerful is that we actually have very little theoretical knowledge of how and why they work, but we do have the practical knowledge to train them in order to perform tasks previously considered impossible to achieve by computers. This is pretty much what makes machine learning so interesting in the first place, as I've mention in the first part of this article: we can make computers perform tasks without having to tell them how exactly.
As impressive as results in deep learning have been so far, they are just a few pretty snowflakes residing on the tip of a humongous iceberg. A tiny selection of what this remarkable field has achieved in recent years follows.
You have no doubt noticed Facebook correctly suggesting tags for people in your photos. The same face is recognised across photos using an unsupervised learning approach that detects a cluster of occurrences of the same face, and once a single instance of that face is tagged by a user (supervised learning) Facebook can tag that face everywhere it encounters it (the whole cluster of occurrences).
So basically, with just one human intervention their algorithms are capable of recognising you in any photo any facebook user may upload. This is sometimes called semi-supervised learning because the amount of human input needed is so tiny.
DeepMind is an artificial intelligence research lab owned by Google, which has made some truly fascinating advances in deep learning.
One of their accomplishments has recently made worldwide headlines. Researchers at DeepMind trained an algorithm that beat the world champion at the game of Go, a task very much more difficult than winning at chess for example. This achievement is made all the more impressive because even the best Go players can't really explain many of the moves they make because they rely heavily on 'intuition'. This suggests that at some level, the algorithm must have acquired something akin to what we call intuition (which might be just a word for rules too complex for human consciousness to make sense of) of its own.
This achievement even made the cover of the most prestigious science journals on the planet (Nature).
Differentiable Neural Computers
Recently they've also published a new approach to combining neural networks with a memory, allowing it to remember what it has learned and apply it to related problems. Called Differentiable Neural Computers, they have been trained to navigate a number of transit systems and showed it to be capable of applying that knowledge to get around London using the underground.
This is an incredibly exciting subfield of deep learning in which two types of neural networks interact, for example:
These are pitched against each other, both networks getting better at their job - locked into very much the same sort of evolutionary arms race that yielded insects that look exactly like poisonous plants, for example.
This approach has yielded astonishing results in which textual descriptions of an object are converted to a photorealistic (but entirely fake!) image of that object:
Another mind-blowing application is able to modify images into related images after having learned a 'make my image look like this' type of rule. What on earth am I talking about? How about this:
This model has learned what winter and summer look like and apply that to unseen photos in order to translate winter images into summer images and vice versa.
A popular app that has GANs at its core is Prisma, which takes any photograph you care to throw at it and turns it into a Van Gogh or a pencil sketch. Basically, the app employs a Deep Neural Network that is trained to recognise a Van Gogh and modifies your image so that network can't distinguish the modified image from a Van Gogh anymore.
Nobody had to sit down and figure out the rules for converting any image, pixel by pixel, into a Van Gogh.
Recently, researchers associated with Adobe (from Photoshop fame) have developed a similar trick to apply the visual style of one photograph to another, allowing you to take, for example, a boring day-time photo of one city and a beautiful night-time photo of a different city, and turn your day-time photo into a similar-looking night-time photo. You have to see it to understand how spectacular this is:
I'm going to leave a host of extra exclamation marks right here to point out how incredible all this is: !!!!!!
Hopefully by now you have some idea of how machine learning might prevail over a classical approach on some types of problems and why it might help you figure out whether or not to open the door for Ada next time she seems to require this of you but you're sure she will definitely refuse to go through.
Sadly, writing software that understands your cat is even harder than you might have guessed. How might each of the three approaches to machine learning listed above be of any help in your quest?
This approach would require a list of solved problem instances. The bad news is embodied by the word 'solved', obviously. In order to label any example correctly we would have to ask your cat what she actually meant in that particular case, the answer to which would fail to present itself in any way, shape or form.
We might conceivably be able to discover associations between some body language and certain aspects of Ada's purring, or find hidden patterns in her communication, providing at least some insight in how she operates, but to generate a sentence from this type of information that communicates her actual message seems at the very least A Challenge and possibly impossible.
This approach might help you to find the correct response to anything your cat does. Each time she makes a sound or just goes ahead and lies there on her back for a good while, you pick one of the answers that occur to you and act on it, and have your 'environment' either reward or punish you in order to be able to make a better guess next time - hoping that eventually you'll start guessing correctly every time.
You won't understand why, but at least she will think you do. Which at least solves the problem from her perspective, and yours by extension. Also, this is the approach most people owned by a cat live by.
These insights might render you despondent. On top of that, there's the very real possibility that you can't figure out what Ada wants because she is generating insufficient information to make a successful interpretation in the first place, in which case writing any sort of software to do the job in your stead would be an extremely foolish fool's errand.
All this just to say: Don't despair! You're on your own.
Machine Learning is a very powerful trick which humanity's only just started to poke at with a stick. The ideas at its heart are not very new but only recently hardware has become powerful enough to convert theoretical potential into practical applications that can be sold to people - which is, for the foreseeable future, still the main requirement to make large-scale advances in any industry.
Now, it's here, and not going anywhere. We've only just learned to crawl. In the near future we will be tripping over the carpet, hitting our knees against tables and stumbling into closets, but in time we will be running towards the horizon at top speed with the wind of revolution in our backs.
Science fiction is turning into reality at a staggering rate and the sky will not be a limit whatsoever.