From Quanta:
Big Data’s Mathematical Mysteries
At a dinner I attended some years ago, the distinguished differential
geometer Eugenio Calabi volunteered to me his tongue-in-cheek
distinction between pure and applied mathematicians. A pure
mathematician, when stuck on the problem under study, often decides to
narrow the problem further and so avoid the obstruction. An applied
mathematician interprets being stuck as an indication that it is time to
learn more mathematics and find better tools.
I have always loved this point of view; it explains how applied
mathematicians will always need to make use of the new concepts and
structures that are constantly being developed in more foundational
mathematics. This is particularly evident today in the ongoing effort to
understand “big data” — data sets that are too large or complex to be understood using traditional data-processing techniques.
Our current mathematical understanding of many techniques
that are central to the ongoing big-data revolution is inadequate, at
best. Consider the simplest case, that of supervised learning, which has
been used by companies such as Google, Facebook and Apple to create
voice- or image-recognition technologies with a near-human level of
accuracy. These systems start with a massive corpus of training samples —
millions or billions of images or voice recordings — which are used to
train a deep neural network to spot statistical regularities. As in
other areas of machine learning, the hope is that computers can churn
through enough data to “learn” the task:
Instead of being programmed with the detailed steps necessary for the
decision process, the computers follow algorithms that gradually lead
them to focus on the relevant patterns.
In mathematical terms, these supervised-learning systems are given a
large set of inputs and the corresponding outputs; the goal is for a
computer to learn the function that will reliably transform a new input
into the correct output. To do this, the computer breaks down the
mystery function into a number of layers of unknown functions called
sigmoid functions. These S-shaped functions look like a street-to-curb
transition: a smoothened step from one level to another, where the
starting level, the height of the step and the width of the transition
region are not determined ahead of time.
Inputs enter the first layer of sigmoid functions, which spits out
results that can be combined before being fed into a second layer of
sigmoid functions, and so on. This web of resulting functions
constitutes the “network” in a neural network. A “deep” one has many
layers.
Decades
ago, researchers proved that these networks are universal, meaning that
they can generate all possible functions. Other researchers later
proved a number of theoretical results about the unique correspondence
between a network and the function it generates. But these results
assume networks that can have extremely large numbers of layers and of
function nodes within each layer. In practice, neural networks use
anywhere between two and two dozen layers.*
Because of this limitation, none of the classical results come close to
explaining why neural networks and deep learning work as spectacularly
well as they do.
It is the guiding principle of many applied mathematicians that if
something mathematical works really well, there must be a good
underlying mathematical reason for it, and we ought to be able to
understand it. In this particular case, it may be that we don’t even
have the appropriate mathematical framework to figure it out yet. (Or,
if we do, it may have been developed within an area of “pure”
mathematics from which it hasn’t yet spread to other mathematical
disciplines.)
Another technique used in machine learning is unsupervised learning,
which is used to discover hidden connections in large data sets. Let’s
say, for example, that you’re a researcher who wants to learn more about
human personality types. You’re awarded an extremely generous grant
that allows you to give 200,000 people a 500-question personality test,
with answers that vary on a scale from one to 10. Eventually you find
yourself with 200,000 data points in 500 virtual “dimensions” — one
dimension for each of the original questions on the personality quiz.
These points, taken together, form a lower-dimensional “surface” in the
500-dimensional space in the same way that a simple plot of elevation
across a mountain range creates a two-dimensional surface in
three-dimensional space....MORE