From Logic Magazine:
You can’t understand the AI systems that are transforming our world without understanding the datasets they are built on.
On the night of March 18, 2018, Elaine Herzberg was walking her bicycle across a dark desert road in Tempe, Arizona. After crossing three lanes of a four-lane highway, a "self-driving" Volvo SUV, traveling at thirty-eight miles per hour, struck her. Thirty minutes later, she was dead. The SUV had been operated by Uber, part of a fleet of self-driving car experiments operating across the state. A report by the National Transportation and Safety Board determined that the car's sensors had detected an object in the road six seconds before the crash, but the software "did not include a consideration for jaywalking pedestrians." In the moments before the car hit Elaine, its AI software cycled through several potential identifiers for her—including “bicycle,” “vehicle,” and “other”—but, ultimately, was not able to recognize her as a pedestrian whose trajectory would be imminently in the collision path of the vehicle.
How did this happen? The particular kind of AI at work in autonomous vehicles is called machine learning. Machine learning enables computers to “learn” certain tasks by analyzing data and extracting patterns from it. In the case of self-driving cars, the main task that the computer must learn is how to see. More specifically, it must learn how to perceive and meaningfully describe the visual world in a manner comparable to humans. This is the field of computer vision, and it encompasses a wide range of controversial and consequential applications, from facial recognition to drone strike targeting.
Unlike in traditional software development, machine learning engineers do not write explicit rules that tell a computer exactly what to do. Rather, they enable a computer to “learn” what to do by discovering patterns in data. The information used for teaching computers is known as training data. Everything a machine learning model knows about the world comes from the data it is trained on. Say an engineer wants to build a system that predicts whether an image contains a cat or a dog. If their cat-detector model is trained only on cat images taken inside homes, the model will have a hard time recognizing cats in other contexts, such as in a yard. Machine learning engineers must constantly evaluate how well a computer has learned to perform a task, which will in turn help them tweak the code in order to make the computer learn better. In the case of computer vision, think of an optometrist evaluating how well you can see. Depending on what they find, you might get a new glasses prescription to help you see better.
To evaluate a model, engineers expose it to another type of data known as testing data. For the cat-detector model, the testing data might consist of both cats and other animals. The model would then be evaluated based on how many of the cats it correctly identified in the dataset. Testing data is critical to understanding how a machine learning system will operate once deployed in the world. However, the evaluation is always limited by the content and structure of the testing data. For example, if there are no images of outdoor cats within the testing data, a cat-detector model might do a really good job of recognizing all the cats in the testing data, but still do poorly if deployed in the real world, where cats might be found in all sorts of contexts. Similarly, evaluating Uber’s self-driving AI on testing data that doesn’t contain very many jaywalking pedestrians will not provide an accurate estimate of how the system will perform in a real-world situation when it encounters one.
Finally, a benchmark dataset is used to judge how well a computer has learned to perform a task. Benchmarks are special sets of training and testing data that allow engineers to compare their machine learning methods against each other. They are measurement devices that provide an estimate of how well AI software will perform in a real-world setting. Most are circulated publicly, while others are proprietary. The AI software that steered the car that killed Elaine Herzberg was most likely evaluated on several internal benchmark datasets; Uber has named and published information on at least one. More broadly, benchmarks guide the course of AI development. They are used to establish the dominance of one approach over another, and ultimately influence which methods get utilized in industry settings.
The single most important benchmark in the field of computer vision, and perhaps AI as a whole, is ImageNet. Created in the late 2000s, ImageNet contains millions of pictures—of people, animals, and everyday objects—scraped from the web. The dataset was developed for a particular computer vision task known as “object recognition.” Given an image, the AI should tag it with labels, such as “cat” or “dog,” describing what it depicts.
It is hard to overstate the impact that ImageNet has had on AI. ImageNet has inaugurated an entirely new era in AI, centered on the collection and processing of large quantities of data. It has also elevated the benchmark to a position of great influence. Benchmarks have become the way to evaluate the performance of an AI system, as well as the dominant mode of tracking progress in the field more generally. Those who have developed the best-performing methods on the ImageNet benchmark in particular have gone on to occupy prestigious positions in industry and academia. Meanwhile, the AI systems built atop of ImageNet are being used for purposes as varied as refugee settlement mapping and the identification of military targets—including the technology that powers Project Maven, the Pentagon’s algorithmic warfare initiative.
The assumption that lies at the root of ImageNet’s power is that benchmarks provide a reliable, objective metric of performance. This assumption is widely held within the industry: startup founders have described ImageNet as the “de-facto image dataset for new algorithms,” and most major machine learning software packages offer convenient methods for evaluating models against it. As the death of Elaine Herzberg makes clear, however, benchmarks can be misleading....
....MUCH MORE
Related, 2016's "Machine Learning and the Importance of 'Cat Face'" and many, many more:
Disrupting Surveillance Capitalism
....Poisoning the Well
Most machine learning models are constructed according to the following general procedure:
- Collect training data.
- Run a machine learning algorithm, such as a neural network, over the training data to learn from it.
- Integrate the model into your service.
Many websites collect training data with embedded code that tracks what you do on the internet. This information is supposed to identify your preferences, habits, and other facets of your online and offline activity. The effectiveness of this data collection relies on the assumption that browsing habits are an honest portrayal of an individual.
A simple act of sabotage is to violate this assumption by generating "noise" while browsing. You can do this by opening random links, so that it's unclear which are the "true" sites you've visited—a process automated by Dan Schultz's Internet Noise project, available at makeinternetnoise.com. Because your data is not only used to make assumptions about you, but about other users with similar browsing patterns, you end up interfering with the algorithm's conclusions about an entire group of people.....
Bank of England: "Opening the machine learning black box"
"The Fundamental Limits of Machine Learning"
Actually, we have so many posts on this stuff that you could probably pass this course:
"Why Is Machine Learning (CS 229) The Most Popular Course At Stanford?"
which was a 2013 post.
2013
And which led, naturally enough, to 2014's "Deep Learning is VC Worthy".
Here's a search of the blog for Machine Learning and another for Adversarial, as in adversarial networks. If we search for the more general AI at the GOOG we see 651 hits including some old favorites.