Saturday, November 12, 2016

"The Devil in the Polling Data" (correlated error)

Our boilerplate intro to Quanta Magazine from a 2014 post:
The Simons Foundation is Jim Simons' baby, originally conceived to fund research in mathematics (Mr. Simons' specialty), the foundation has expanded its areas of interest to all basic science.
They put out a magazine called Quanta which we last visited in "Data Driven: The New Big Science: Chapter 4 Biology in the Era of Big Data".

Mr. Simons was the premier global macro quant with his Renaissance Technologies Medallion Fund averaging 35% per annum (after fees, which are a post unto themselves) since 1989.
Mr. Simons retired in 2009.
In 2013 RenTech was up 18%, trailing the S&P by 12 points but stomping on the average hedge fund's 7.4% return.
All that being said, it is actually Marilyn Simons who is the motive force. 
I see I have to update the performance numbers.

One interesting point about the recent election is that while Mr. and Mrs. Simons' politics skew left (99.99% this cycle), the firm's co-CEO Robert Mercer's, and his daughter's, skew right (100%) and in this election cycle were the fifth and ninth largest contributors nationally, at $$23,539,900 (Mercer) and $19,734,650 (Simons)

From Quanta Magazine:

The same problem that caused the 2007 financial crisis also tripped up the polling data ahead of this year’s presidential election.
Illustration by Lucy Reading-Ikkanda for Quanta Magazine
The devil in the data that left election forecasters with egg on their face this week has a familiar name — it’s the same villain that tripped up the banks that financed subprime mortgages back in 2007, causing the financial crisis. Its name is “correlated error.”

Prediction models can make very accurate forecasts based on many not-so-accurate data points, but they depend on a crucial assumption — that the data points are all independent. In election forecasting, the data points are polls, which are clearly imperfect. Every individual poll has a relatively large margin of error amounting to several percentage points, sometimes favoring one candidate, sometimes the other, all skewed by hundreds of small things — the specific respondents chosen, the means of contact, the phrasing of questions, the representation of voter demographics and so on. These errors can be magically smoothed out by poll aggregation, giving a much more accurate mean polling number — provided the errors in individual polls were all due to different causes, and were therefore independent and uncorrelated. We saw this magic in the accurate predictions made by forecasters like Nate Silver and Sam Wang in the 2012 elections.

But this year we saw something different: Almost all the swing state polls overscored Clinton’s numbers by two to six percent. This error is called “systematic” or “correlated error.” Since it affected most or all polls, it was probably caused by some common disrupting factor or factors that were outside the well-established and hitherto reliable poll methodology itself. It was this correlated error that completely threw off the prediction models. Likewise, leading up to the 2007 crisis, financial institutions misjudged the probability of massive subprime loan defaults because they failed to realize that the chances of individual defaults were correlated, not independent.

What could have caused this correlated error to skew all the polls in 2016? That’s a subject that pollsters are trying hard to research and pinpoint right now. It will take several months for their findings to be released. But here are two possible speculative causes that can explain this perfectly. I have to give credit for these to Michael Moore, who back in July wrote an amazingly prescient article in which he predicted exactly how Trump would win in excruciating detail. Both of these reasons are ultimately related to the well-documented enthusiasm gap in this election just as it was in the Brexit vote, where a similarly large polling error took place.

1) Emotional voters: All of us are familiar with the situation where our minds incline one way and our hearts tug another. Answering a poll is a boring intellectual exercise, while casting a ballot in the solitude of a voting booth is an empowering, emotional one. It is easy to imagine somewhat conflicted voters who answered “Clinton” to a pollster but in a fit of emotion cast their vote for Trump. If a small, but consistent proportion of Trump voters acted this way, it would have affected all polls and given them all the same correlated error.

2) Depressed voters: Most pollsters try to determine how likely a respondent is to vote and factor that in their final numbers. If there were sizable numbers of Clinton voters who told pollsters that they fully intended to vote but on election day did not find the will or enthusiasm to actually go cast a ballot, that could also explain some of the correlated error. As Moore put it back in August, “If people could vote from their sofa via their Xbox or remote control, Hillary would win in a landslide.”

Other factors like the inability to contact rural voters have been proposed, but it seems to me that good pollsters should have been able to overcome those kinds of problems.

So even the best of the pollsters have a lot to learn. How about the modelers?

I think modelers need to make some changes too....