Sunday, May 3, 2015

"Why words are the new numbers"

From the University of Chicago's Booth School Of Business' Capital Ideas Magazine:


We use language to describe, instruct, argue, praise, woo, debate, joke, gossip, relate, compare, reassure, berate, suggest, appease, threaten, discuss, forgive, respond, propose, inspire, complain, interject, boast, agree, soothe, harangue, confess, question, imply, express, verify, interrupt, lecture, admonish, report, direct, explain, persuade.

Every day, we express ourselves in 500 million tweets and 64 billion WhatsApp messages. We perform more than 250 million searches on eBay. On Facebook, 864 million of us log in to post status updates, comment on news stories, and share videos.
Researchers have recognized something in all this text: data. Aided by powerful computers and new statistical methods, they are dissecting newspaper articles, financial analyst reports, economic indicators, and Yelp reviews. They are parsing fragments of language, encountering issues of syntax, tone, and emotion—not to mention emoticons—to discern what we are saying, what we mean when we say it, and what the relationship is between what we say and what we do.
They are making new discoveries. Businesses may be able to learn about a product defect before anyone calls customer service. Economists could pinpoint the start of a financial crisis and determine which policy remedies are most effective. Political junkies can use text to understand why the phrase “mashed potato” boded ill for Newt Gingrich’s presidential aspirations—and learn from that, too. Investors can also benefit from analyzing text (see “Turn text into $$$$$” at left).

“In econometrics textbooks, data arrives in ‘rectangular’ form, with N observations and K variables, and with K typically a lot smaller than N,” write Liran Einav and Jonathan D. Levin for the National Bureau of Economic Research, in their survey of how economists are using big data. In contrast to those simple textbook models, text data—where observations might be tweets, or phrases from the Congressional Record—are unstructured. They have what researchers call “high dimensionality,” meaning there can be a huge number of variables, and an enormous number of ways to organize them in a form that can be analyzed.
With the advent of cloud computing, the data can be stored on thousands or millions of machines. It’s an engineering feat simply to ensure that all those computers are communicating properly with one another. Einav and Levin suggested in 2013 that economists must begin to study computer-programming languages and machine-learning algorithms if they hope to tackle cutting-edge questions. Two years later, researchers are increasingly doing just that.

Newspapers reveal their biases
One of the pioneers of text analysis is Matthew Gentzkow, Richard O. Ryan Professor of Economics and Neubauer Family Faculty Fellow at Chicago Booth, who first became interested as a graduate student in using text analysis to tease out the economics of the media industry. An important vein of his research seeks to uncover economic reasons behind seemingly ideological choices, such as whether newspapers choose political affiliations to differentiate from their competitors, or whether papers in markets that skew politically liberal or conservative tend to use the words and phrases favored by their readers. Gentzkow’s work won him the 2014 John Bates Clark Medal, given annually to the American economist under age 40 whom a committee of the American Economic Association deems has made the most significant contribution to economic thought and knowledge.


He began developing his ideas about the economics of media—and about the process of text analysis—just as technology was beginning to give researchers far more access to text, through online databases and internet archives that could be analyzed with keyword searches and other methods. “I realized there was a ton of data for this industry that people hadn’t really exploited before,” he says. It wasn’t for lack of trying: as recently as the early 1990s, researchers used a laborious process to transform text into a usable data set. Frequently, graduate students assisting in research projects burrowed through stacks of newspapers and checked off every time a word was used in an article. Their work revealed interesting patterns, but even with this monastic devotion, they could analyze only a small number of newspapers in a year, compared with the number available.

Within a few years, economists had gained access to databases of newspaper articles, as well as scanners that use optical character recognition, which allowed them to digitize hard copies of sources such as, say, newspaper directories dating to the 19th century. Researchers also began to hire people in inexpensive labor markets, including India, to combine optical character recognition with hand searches. The work was still tedious and painstaking—but it was speeding up.

Computers can obviously read text far faster than humans can. But unlike humans, they have to be taught to infer meaning. A researcher trying to teach a machine to do this must provide enough examples, over and over, of how to categorize certain patterns, until the computer can begin to classify the text itself. Think of your email spam filter, which learns from the messages you choose to block. Each time you mark an email as spam, you give the filter a new example that helps it become more accurate.

When Gentzkow and Brown University's Jesse M. Shapiro wanted to measure evidence of media slant in newspapers, they did something similar. To classify newspapers as Republican or Democratic, they started with a body of training text comprising a sample set of articles, and searched for political buzzwords and phrases, such as “death tax” and “undocumented worker,” to see which were widespread.

Gentzkow and Shapiro had originally thought they could train the computer using the political platforms produced by each party. But that didn’t work due to idiosyncratic differences in the platform text that had nothing to do with partisanship. Using the text of presidential debates didn’t work, either. But around the same time Gentzkow and Shapiro were working on this puzzle, Tim Groseclose and Jeffrey Milyo, researchers at UCLA and the University of Missouri, respectively, were searching the Congressional Record to count how many times Republicans and Democrats cited particular think tanks in their congressional speeches, and comparing their counts to how often certain newspapers cited those think tanks. 

Influenced by that work, Gentzkow and Shapiro decided to train their computer on the Congressional Record, which captures all the official proceedings and debates of Congress. They wrote computer scripts to scrape all the text from the searchable database. Research assistants, the unsung heroes of text analysis, organized those messy chunks of text in a process Gentzkow compares to painstakingly reconstructing fragments of DNA.

Because the record identifies each speaker, the researchers trained the computer program to recognize the differences in rhetoric between elected Republicans and Democrats. The next step, Gentzkow explains, was to examine the overall news content of a newspaper to determine, “If this newspaper were a speech in Congress, does it look more like a Republican or a Democratic speech?”

To do that, Gentzkow and Shapiro identified how often politically charged phrases occurred in different newspapers in 2005. They constructed an index of media slant and compared it with information about the political preferences of the papers’ readers and the political leanings of their owners. The aim was to find out if the Washington Post, for example, primarily reports the slant preferred by its owner (currently Amazon chief executive Jeff Bezos) or responds to the biases of its customers.

Ultimately, the researchers find that customer demand—measured by circulation data that show the politics of readers in a particular zip code—accounted for a large share of the variation in slant in news coverage, while the preferences of owners accounted for little or none. Gentzkow and Shapiro conclude that newspapers use certain terms because readers prefer them, not because the paper’s owner dictates them.
In another study, Gentzkow, Booth PhD student Nathan Petek, Shapiro, and Wharton’s Michael Sinkinson examined thousands of pages of old newspaper directories—including more than 23,000 pages of text—to determine whether political parties in the late 19th and early 20th centuries influenced the press, measured by the number of newspapers supporting each party, and the newspapers’ size and content.  
The era they examine, 1869 to 1928, gives plenty of reason to believe that politicians were manipulating the media for their own advantage. Into the 1920s, half of US daily newspapers were explicitly affiliated with a political party. State officeholders gave printing contracts to loyal newspapers and bailed out failing ones that shared their political agenda. 

The researchers note whether each newspaper reported a Republican or Democratic affiliation. They also looked at subscription prices, because papers that were more popular with readers would have been able to charge more. Did Republican newspapers increase in number and circulation when control of a state governorship or either house of the state legislature switched from Democratic to Republican? 
In the researchers’ sample, political parties had no significant impact on the political affiliations of the newspapers, with one notable exception: the Reconstruction South. In nearly all of the places where Republicans controlled state government for an extended time, Republican newspapers reached a meaningful share of both daily and weekly circulation while Republicans were in power. Republican shares of weekly circulation rose to 50 percent or more in Arkansas, Florida, and Louisiana while Republicans were in control. But Republicans’ share declined sharply in those states and elsewhere when Democrats regained power.

“Even if market forces discipline government intervention in most times and places, this does not prevent governments from manipulating the press when the market is particularly weak and the political incentives are especially strong,” Gentzkow, Petek, Shapiro, and Sinkinson write.

Another study by Gentzkow and his coauthors, in which they again examine newspapers’ political affiliations and circulation figures, suggests that market competition increases newspapers’ ideological diversity. They also find, separately, that the popular notion of an internet “echo chamber,” where people segregate themselves by ideology, has been overblown. Writing in 2011, Gentzkow and Shapiro assert that most online news consumption is dominated by a small number of websites that express relatively centrist political views. In fact, people segregate themselves according to politics much less online than they do in face-to-face interactions with neighbors or coworkers. (For more on whether the web causes political polarization, watch the Big Question episode from August 2013)

Other researchers are using large-scale text analysis to parse bodies of language ranging from financial statements and company documents to eBay product descriptions to the Google Books corpus, Gentzkow says. “It seems like [text analysis] is going to keep getting bigger as the methodology improves.”...
...MUCH MORE