Sunday, August 21, 2016

Machine Learning and the Importance of 'Cat Face'

From the London Review of Books:

The Concept of ‘Cat Face’
Over the course of a week in March, Lee Sedol, the world’s best player of Go, played a series of five games against a computer program. The series, which the program AlphaGo won 4-1, took place in Seoul, while tens of millions watched live on internet feeds. Go, usually considered the most intellectually demanding of board games, originated in China but developed into its current form in Japan, enjoying a long golden age from the 17th to the 19th century. Famous contests from the period include the Blood Vomiting game, in which three moves of great subtlety were allegedly revealed to Honinbo Jowa by ghosts, enabling him to defeat his young protégé Intetsu Akaboshi, who after four days of continuous play collapsed and coughed up blood, dying of TB shortly afterwards. Another, the Ear Reddening game, turned on a move of such strength that it caused a discernible flow of blood to the ears of the master Inoue Genan Inseki. That move was, until 13 March this year, probably the most talked about move in the history of Go. That accolade probably now belongs to move 78 in the fourth game between Sedol and AlphaGo, a moment of apparently inexplicable intuition which gave Sedol his only victory in the series. The move, quickly named the Touch of God, has captured the attention not just of fans of Go but of anyone with an interest in what differentiates human from artificial intelligence.

DeepMind, the London-based company behind AlphaGo, was acquired by Google in January 2014. The £400 million price tag seemed large at the time: the company was mainly famous for DQN, a program devised to play old Atari video games from the 1980s. Mastering Space Invaders might not seem, on the face of it, much to boast about compared to beating a champion Go player, but it is the approach DeepMind has taken to both problems that impressed Google. The conventional way of writing, say, a chess program has been to identify and encode the principles underpinning sound play. That isn’t the way DeepMind’s software works. DQN doesn’t know how to repel an invasion. It doesn’t know that the electronic signals it is processing depict aliens – they are merely an array of pixels. DeepMind searches the game data for correlations, which it interprets as significant features. It then learns how those features are affected by the choices it makes and uses what it learns to make choices that will, ultimately, bring about a more desirable outcome. After just a few hours of training, the software is, if not unbeatable, then at least uncannily effective. The algorithm is almost completely generic: when presented with a different problem, that of manipulating the parameters controlling the cooling systems at one of Google’s data centres with the aim of improving fuel efficiency, it was able to cut the electricity bill by up to 40 per cent.

Demis Hassabis, the CEO of DeepMind, learned to play chess at the age of four. When he was 12 he used his winnings from an international tournament to buy a Sinclair ZX Spectrum computer. At 17 he wrote the software for Theme Park, a hugely successful simulation game. He worked in games for ten more years before studying for a PhD in cognitive neuroscience at UCL, then doing research at Harvard and MIT. In 2011 he founded DeepMind with, he has said, a two-step plan to ‘solve intelligence, and then use that to solve everything else’.
In 1965 the philosopher Hubert Dreyfus published a critique of artificial intelligence, later worked up into a book called What Computers Can’t Do, in which he argued that computers programmed to manipulate symbolic representations would never be able to complete tasks that require intelligence. His thesis was unpopular at the time, but by the turn of the century, decades of disappointment had led many to accept it. One of the differences Dreyfus identified between human intelligence and digital computation is that humans interpret information in contexts that aren’t explicitly and exhaustively represented. Someone reading such sentences as ‘the girl caught the butterfly with spots,’ or ‘the girl caught the butterfly with a net,’ doesn’t register their ambiguity. Our intuitive interpretation in such cases seems to arise from the association of connected ideas, not by logical inference on the basis of known facts about the world. The idea that computers could be programmed to work in a similar way – learning how to interpret data without the programmer’s having to provide an explicit representation of all the rules and concepts the interpretation might require – has been around for almost as long as the kind of symbol-based AI that Dreyfus was so scathing about, but it has taken until now to make it work. It is this kind of ‘machine learning’ that is behind the recent resurgence of interest in AI.

The best-known example of early machine-learning was the Perceptron, built at Cornell in 1957 to simulate a human neuron. Neurons function as simple computational units: each receives multiple inputs but has only a single output – on or off. Given numerical data about examples of a particular phenomenon, the Perceptron could learn a rule and use it to sort further examples into sets. Imagine that the Perceptron was trained using data on credit card transactions, some of which were known to be fraudulent and the rest above board. To begin with, each element of information fed to the Perceptron – it might be the size of the transaction, the time since the previous transaction, the location, or information about the vendor – is assigned a random weight. The machine submits the weighted values of the elements in each case to an algorithm – in the simplest case, it might just add them up. It then classifies the cases (fraud or not fraud) according to whether the total reaches an arbitrary threshold. The results can then be checked to find out whether the machine has assigned the example to the right or wrong side of the threshold. The weights given to the various inputs can then gradually be adjusted to improve the machine’s success rate.

Given enough data and a well-structured problem the Perceptron could learn a rule that could then be applied to new examples. Unfortunately, even very simple problems turned out to have a structure that is too complex to be learned in this way. Imagine that only two things are known about credit card transactions: their amount, and where they take place (since both must be expressed as numbers, let’s assume the location is expressed as the distance from the cardholder’s home address). If fraud is found to occur only with large purchases or only with distant ones, the Perceptron can be trained to distinguish fraudulent from bona fide transactions. However, if fraud occurs in small distant purchases and in large local ones, as in Figure 1, the task of classification is too complex. The approach only works with problems that are ‘linearly separable’ and, as should be clear from Figure 1, no single straight line will separate the fraud cases from the rest.
Figure 1
Interest in the approach faded for a while, but at the end of the 1970s people worked out how to tackle more complex classification tasks using networks of artificial neurons arranged in layers, so that the outputs of one layer formed the inputs of the next. Consider the network in Figure 2.
Figure 2
Imagine the two nodes in the input layer are used to store, respectively, the size and location of each credit card transaction. If the left-hand node in the middle layer can be trained to detect just the cases in the top left of Figure 1 (which are linearly separable) and the right-hand node can be trained to detect only the cases to the bottom right, the two inputs to the output layer would measure the extent to which a case is a) small and distant, and b) large and local. Bona fide transactions will score low on both measures, fraud transactions will score highly on one or the other, so the two classes can now be divided by a straight line. The difficult part of all this is that the network has to identify the concepts to be captured in the hidden middle layer on the basis of information about how changing the weights on the links between the middle and output layers affects the final classification of transactions as fraud or bona fide. The problem is solved by computing a measure of how a change in the final set of weights changes the rate of errors in the classification and then propagating that measure backwards through the network.

For a while multi-layer networks were a hot topic, not least because people were excited by the explicit analogy with human perception, which depends on a network of cells that compute features in a hierarchy of increasing abstraction. But, as before, early promise gave way to disappointment. The backwards propagation of errors seemed a hopelessly inefficient training algorithm if more than one or two layers separated the input and output layers. Shallow networks couldn’t be programmed to complete challenging tasks in vision or speech recognition, and given simpler tasks they were outperformed by other approaches to machine learning....MORE
May 2015
Baidu Artificial Intelligence Beats Google, Microsoft In Image Recognition
January 2015
"Inside Google’s Massive Effort in Deep Learning" (GOOG)
Dec 14 2014
Deep Learning: "A Common Logic to Seeing Cats and Cosmos"

And way back in 2012 we saw "Artificial Intelligence: Why There is No Reason to Fear The Singularity/HAL 9000":
Google researchers and Stanford scientists have discovered that if you show a large enough computing system millions of images from random YouTube videos for three days, the computer will teach itself to recognize ... cats.