March 26, 2014
Using big data in finance: Example of sentiment-extraction from news articles
Nitish Sinha
There is much discussion and research in finance on using "big data" to understand market "sentiment." In this note, I will draw on some of my own research in behavioral finance--Sinha (2010) and Heston and Sinha (2013)--to share my perspective the current state of affairs in this area, particularly on the meaning of "sentiment" in the context of big data research.1
The Meaning of "Sentiment" among Finance and Computer Science Researchers
Let me start with some possible confusion that might be caused by how a simple word is used quite differently in two disciplines that meet in this particular area--Finance and Computer Science. In finance, the word "Sentiment" is generally understood to be irrational belief about future cash flows.2 The key test of sentiment is outlined in Tetlock(2007) where he points out that "The sentiment theory predicts short-horizon returns will be reversed in the long run, whereas the information theory predicts they will persist indefinitely." In other words, "sentiment" is a good short-term contrarian indicator. In finance, sentiment has been measured using sunshine days in New York City, and wins in soccer games among others.3 The common feature these measures share is that short-term positive returns associated with these sentiment measures tend to be reversed over next few days.
Computer scientists and computational linguists use the word "sentiment" differently. For example, Pang and Lee (2008) explain that over time "sentiment" has morphed into opinion or subjective information. It is not clear, whether market participants' opinion should be non-informative. There are many places where investors express their opinions. Of course, traded price is where the rubber meets the road and people vote with their bank accounts. But it is possible that somebody with informed opinion, and perhaps thinner wallet might lose to somebody else with uninformed opinion and a thicker wallet.4
The other venues where investors express opinions are newspaper articles, news wires, op-eds, and now twitter feeds. In "Underreaction to News in the U.S. Stock Market," I explore "opinion-extraction" and find that market prices tend to under-react to textual information appearing in news articles. This finding differs from other related findings in the finance literature such as Tetlock(2007) and Laughran and McDonanld(2010). In "News versus Sentiment", a co-authored paper with Steve Heston at University of Maryland, we set out to find why it is that some evidence points to the market under-reacting to information contained in news articles, while other evidence points to the market properly reacting to the same type of information, and still other evidence suggests a tendency for the market to overreact. I will talk a little bit more about this paper since it also provides a window into an approach to working with big data.
One reason for apparently disparate results on the reaction of market prices to textual information could be that researches have used different texts to start with. Press releases are written by firms themselves and might not be impartial. Journalists are trained to write in impartial fashion and could blunt the opinions in a deliberate manner. Tweets are small in length, although can be quite opinionated dense in information. Tweets also present a challenge in parsing since tweeters often use a different vocabulary. We use common corpus of news articles for all text-processing techniques that allows us to specifically identify the effect of text-processing technology. In my research with Heston, we chose all news articles written by Thomson Reuters journalists because news from a wire-service news has some distinct advantages. First, wire-services cater primarily to the investment needs of their subscribers; stories do not appear because they are catchy but because they are economically relevant.5 Also at a wire-service, news does not need to out-compete all the other news articles to get published, a potential source of trunctation of information in newspapers. Comedian Seinfeld is attributed to have said, "It's amazing that the amount of news that happens in the world every day always just exactly fits the newspaper." It is also somewhat surprising the news that does not find its way into the newspaper since the newspaper was already full.6 Given all these conditions, we expect wire-services to have fewer of these biases. Thomson Reuters itself provides a measure of the tone of the news articles based on a neural network application on top of some linguistic analysis. (Please consult Sinha(2010) for the methodological details.)
A Machine-Learning Algorithm for Classifying NewsHT: The Big Picture
Perhaps a digression into the area of "machine learning" might be useful here, since it is potentially relevant to extracting "sentiment" from big data. I will somewhat oversimplify some of the issues, at the risk of maintaining brevity. Machine learning used in the context of big data is really a method for classifying data into different categories. Machine learning can be further subdivided into supervised and un-supervised learning. In an unsupervised learning method, the researcher lets the data dictate the categories, the data will be classified into. A trivial example would to let as many categories as there are observations. The methodology of supervised machine learning can be thought of as comprising three steps--"Tag", "Train," and "Classify." In the case of classifying the textual information in news articles, the first stage, "Tag," requires the researcher to carefully select some articles considered likely to representative of the broader group that will need to be classified for the project. Then she tags the articles into desired categories, "good news" and "bad news" for example. There is some temptation to tag the article as positive, if the return following the article's publication is positive, and negative if the return following the article's publication is negative. Similarly, one may want to identify some article as pertinent, if there was any market reaction following the news article's publication and not relevant if there was no reaction to the news article. Those tags would capture how market reacted to the news article, in light of all of the other information the market had at the time of news article release. For example, if a news article was written on a Friday or a day in which the market was otherwise engaged with thoughts of sunny days or Super Bowl, the research would likely run the risk of classifying the article as not even pertinent.7 In our case tagging was done by humans, one of the many reasonable classifiers. The second step or Training follows the tagging; the scientist feeds these tagged articles to the classifier while holding out some tagged articles. The classifier "learns" or establishes a relationship between article attributes and its category. The researcher tests the classifier on the held-out tagged articles till she is satisfied with the learning process. Once the machine has learned the relationship between document attributes and document category, it is ready to classify all the articles in the corpus in the final, "Classification" stage....MORE