From the University of Chicago's Booth School Of Business' Capital Ideas Magazine:
We use language to describe, instruct, argue, praise, woo,
debate, joke, gossip, relate, compare, reassure, berate, suggest,
appease, threaten, discuss, forgive, respond, propose, inspire,
complain, interject, boast, agree, soothe, harangue, confess, question,
imply, express, verify, interrupt, lecture, admonish, report, direct,
explain, persuade.
Every day, we express ourselves in 500 million tweets and 64
billion WhatsApp messages. We perform more than 250 million searches on
eBay. On Facebook, 864 million of us log in to post status updates,
comment on news stories, and share videos.
Researchers have recognized something in all this text:
data. Aided by powerful computers and new statistical methods, they are
dissecting newspaper articles, financial analyst reports, economic
indicators, and Yelp reviews. They are parsing fragments of language,
encountering issues of syntax, tone, and emotion—not to mention
emoticons—to discern what we are saying, what we mean when we say it,
and what the relationship is between what we say and what we do.
They
are making new discoveries. Businesses may be able to learn about a
product defect before anyone calls customer service. Economists could
pinpoint the start of a financial crisis and determine which policy
remedies are most effective. Political junkies can use text to
understand why the phrase “mashed potato” boded ill for Newt Gingrich’s
presidential aspirations—and learn from that, too. Investors can also
benefit from analyzing text (see “Turn text into $$$$$” at left).
“In econometrics textbooks, data arrives in ‘rectangular’
form, with N observations and K variables, and with K typically a lot
smaller than N,” write Liran Einav and Jonathan D. Levin for the
National Bureau of Economic Research, in their survey of how economists
are using big data. In contrast to those simple textbook models, text
data—where observations might be tweets, or phrases from the
Congressional Record—are unstructured. They have what researchers call
“high dimensionality,” meaning there can be a huge number of variables,
and an enormous number of ways to organize them in a form that can be
analyzed.
With the advent of cloud computing, the data can be stored on thousands
or millions of machines. It’s an engineering feat simply to ensure that
all those computers are communicating properly with one another. Einav
and Levin suggested in 2013 that economists must begin to study
computer-programming languages and machine-learning algorithms if they
hope to tackle cutting-edge questions. Two years later, researchers are
increasingly doing just that.
Newspapers reveal their biases
One of the pioneers of text analysis is Matthew Gentzkow,
Richard O. Ryan Professor of Economics and Neubauer Family Faculty
Fellow at Chicago Booth, who first became interested as a graduate
student in using text analysis to tease out the economics of the media
industry. An important vein of his research seeks to uncover economic
reasons behind seemingly ideological choices, such as whether newspapers
choose political affiliations to differentiate from their competitors,
or whether papers in markets that skew politically liberal or
conservative tend to use the words and phrases favored by their readers.
Gentzkow’s work won him the 2014 John Bates Clark Medal, given annually to the American economist under age 40 whom a committee of the American Economic Association deems has made the most significant contribution to economic thought and knowledge.
He began developing his ideas about the economics of
media—and about the process of text analysis—just as technology was
beginning to give researchers far more access to text, through online
databases and internet archives that could be analyzed with keyword
searches and other methods. “I realized there was a ton of data for this
industry that people hadn’t really exploited before,” he says. It
wasn’t for lack of trying: as recently as the early 1990s, researchers
used a laborious process to transform text into a usable data set.
Frequently, graduate students assisting in research projects burrowed
through stacks of newspapers and checked off every time a word was used
in an article. Their work revealed interesting patterns, but even with
this monastic devotion, they could analyze only a small number of
newspapers in a year, compared with the number available.
Within a few years, economists had gained access to databases of
newspaper articles, as well as scanners that use optical character
recognition, which allowed them to digitize hard copies of sources such
as, say, newspaper directories dating to the 19th century. Researchers
also began to hire people in inexpensive labor markets, including India,
to combine optical character recognition with hand searches. The work
was still tedious and painstaking—but it was speeding up.
Computers can obviously read text far faster than humans
can. But unlike humans, they have to be taught to infer meaning. A
researcher trying to teach a machine to do this must provide enough
examples, over and over, of how to categorize certain patterns, until
the computer can begin to classify the text itself. Think of your email
spam filter, which learns from the messages you choose to block. Each
time you mark an email as spam, you give the filter a new example that
helps it become more accurate.
When Gentzkow and Brown University's Jesse M. Shapiro wanted to measure
evidence of media slant in newspapers, they did something similar. To
classify newspapers as Republican or Democratic, they started with a
body of training text comprising a sample set of articles, and searched
for political buzzwords and phrases, such as “death tax” and
“undocumented worker,” to see which were widespread.
Gentzkow and Shapiro had originally thought they could
train the computer using the political platforms produced by each party.
But that didn’t work due to idiosyncratic differences in the platform
text that had nothing to do with partisanship. Using the text of
presidential debates didn’t work, either. But around the same time
Gentzkow and Shapiro were working on this puzzle, Tim Groseclose and
Jeffrey Milyo, researchers at UCLA and the University of Missouri,
respectively, were searching the Congressional Record to count how many
times Republicans and Democrats cited particular think tanks in their
congressional speeches, and comparing their counts to how often certain
newspapers cited those think tanks.
Influenced by that work, Gentzkow and Shapiro decided to train their
computer on the Congressional Record, which captures all the official
proceedings and debates of Congress. They wrote computer scripts to
scrape all the text from the searchable database. Research assistants,
the unsung heroes of text analysis, organized those messy chunks of text
in a process Gentzkow compares to painstakingly reconstructing
fragments of DNA.
Because the record identifies each speaker, the researchers
trained the computer program to recognize the differences in rhetoric
between elected Republicans and Democrats. The next step, Gentzkow
explains, was to examine the overall news content of a newspaper to
determine, “If this newspaper were a speech in Congress, does it look
more like a Republican or a Democratic speech?”
To do that, Gentzkow and Shapiro identified how often politically
charged phrases occurred in different newspapers in 2005. They
constructed an index of media slant and compared it with information
about the political preferences of the papers’ readers and the political
leanings of their owners. The aim was to find out if the Washington
Post, for example, primarily reports the slant preferred by its owner
(currently Amazon chief executive Jeff Bezos) or responds to the biases
of its customers.
Ultimately, the researchers find that customer
demand—measured by circulation data that show the politics of readers in
a particular zip code—accounted for a large share of the variation in
slant in news coverage, while the preferences of owners accounted for
little or none. Gentzkow and Shapiro conclude that newspapers use
certain terms because readers prefer them, not because the paper’s owner
dictates them.
In another study, Gentzkow, Booth PhD student Nathan Petek, Shapiro, and
Wharton’s Michael Sinkinson examined thousands of pages of old
newspaper directories—including more than 23,000 pages of text—to
determine whether political parties in the late 19th and early 20th
centuries influenced the press, measured by the number of newspapers
supporting each party, and the newspapers’ size and content.
The era they examine, 1869 to 1928, gives plenty of reason
to believe that politicians were manipulating the media for their own
advantage. Into the 1920s, half of US daily newspapers were explicitly
affiliated with a political party. State officeholders gave printing
contracts to loyal newspapers and bailed out failing ones that shared
their political agenda.
The researchers note whether each newspaper reported a Republican or
Democratic affiliation. They also looked at subscription prices, because
papers that were more popular with readers would have been able to
charge more. Did Republican newspapers increase in number and
circulation when control of a state governorship or either house of the
state legislature switched from Democratic to Republican?
In the researchers’ sample, political parties had no
significant impact on the political affiliations of the newspapers, with
one notable exception: the Reconstruction South. In nearly all of the
places where Republicans controlled state government for an extended
time, Republican newspapers reached a meaningful share of both daily and
weekly circulation while Republicans were in power. Republican shares
of weekly circulation rose to 50 percent or more in Arkansas, Florida,
and Louisiana while Republicans were in control. But Republicans’ share
declined sharply in those states and elsewhere when Democrats regained
power.
“Even if market forces discipline government intervention in most times
and places, this does not prevent governments from manipulating the
press when the market is particularly weak and the political incentives
are especially strong,” Gentzkow, Petek, Shapiro, and Sinkinson write.
Another study by Gentzkow and his coauthors, in which they
again examine newspapers’ political affiliations and circulation
figures, suggests that market competition increases newspapers’
ideological diversity. They also find, separately, that the popular
notion of an internet “echo chamber,” where people segregate themselves
by ideology, has been overblown. Writing in 2011, Gentzkow and Shapiro
assert that most online news consumption is dominated by a small number
of websites that express relatively centrist political views. In fact,
people segregate themselves according to politics much less online than
they do in face-to-face interactions with neighbors or coworkers. (For
more on whether the web causes political polarization, watch
the Big Question episode from August 2013)
Other researchers are using large-scale text analysis to parse bodies of
language ranging from financial statements and company documents to
eBay product descriptions to the Google Books corpus, Gentzkow says. “It
seems like [text analysis] is going to keep getting bigger as the
methodology improves.”...
...
MUCH MORE