How to lie with polls

Referenda – love them or hate them, they are a mark of modern democracies. In the next few weeks, the United Kingdom will vote on whether to leave or remain in the European Union. It’s a historic vote, with significant repercussions not just for the UK, but also for the future of the European project.

Of course, when the stakes are high, the prognosticators get called in. Much has been made of the very tight race in the polls, with newspapers often lauding the results of the latest poll as the final say in the debate.

But any observer with a passing interest in statistics will know this is a misguided conclusion. Just because a poll is more recent, doesn’t mean it’s more accurate. Polls vary in their quality, representativeness, and will experience some natural variation, which is why seasoned observers will follow the trend in a population of polls instead.

As a quantitatively-minded person, I take particular offence at the way polls are being displayed in summary fashion, particular this heinous graphic at the Telegraph:

Polling, as it turns out, is a fantastic example of how easy it is to misguide, misdirect or downright lie with statistics. So let us look at the different messages we can glean from the EU referendum polls. As our source, we will take every leave/remain poll result from the comprehensive tracker made available by the Financial Times, going back to 2010.

First, we want to try and go beyond just parroting what the latest poll tell us. Using the most basic summary metric, the arithmetic mean, we can get an idea of what a population of polls tells us about public opinion over a six-year span. We can see the remain camp leads across polls with 2 percentage points. Good news for Europhiles.

However, this is a misleading conclusion. Firstly, we are ignoring how much opinion varies from one poll to the other – this could be due to biases of particular polling companies, of the method used, or simply noise in the sampling process. Secondly, it's opinion now and not four years ago that decides a referendum, so we may want to apply a weighting scheme where older polls count less than more recent polls. If we include such weighting, and add the standard deviation of opinion across polls, the story becomes a bit more muddled – no party has any clear advantage above the variance in the data.


Public opinion, of course, changes with time. A traditional way of displaying poll results is to show answers to the same polling question across an arbitrary timespan. This has the advantage of revealing any strong trends in time, but also leads us to over-emphasise the most recent results, be it because they represent the will of the people or because of spurious trends. With this approach, we can tell a very different story – the leave camp appears to lead in the most recent polls.

Taking a different approach, we can also look at polling as an additive exercise. If we take the difference in responses to leave and remain at each poll, it gives us a net plus-or-minus indicator of public opinion. We can then plot these differences as a cumulative sum over time, to estimate whether a given camp gains ‘momentum’ over a sustained period of time (this approach has garnered significant favour amongst news outlets covering the US presidential primaries);

In this case the leave camp not only comes on top on recent polls, but also is shown as having gained considerable momentum in the last few months of 2016, usually taken as indicative of a step change in popular opinion. A very different story from our original poll average.

The problem with these past approaches, is that they fail to encapsulate uncertainty in a meaningful way. We can take yet another treatment of the data, and ask what does our population of polls show across all samples, and fit a mathematical model that allows us to describe uncertainty.

Here, we see a histogram of individual poll outcomes and a simple Gaussian model of the responses, across six years of polling. We see that while there is significant spread in responses, overall the stay camp has an advantage, but still sits within the confidence zone of the leave camp; in other words, it’s pretty close. But what this, and most polling trackers often fail to acknowledge is the large number of undecided voters who could swing the referendum either way. On average, 17% of those polled were undecided, with 12% still undecided in the last month. If we include the uncertainty of undecided voters into our simple model, we can see a vast widening of our confidence margin;

And no significant advantage to either the leave or remain camps. What this exercise demonstrates, is that data literacy goes beyond being sceptical of statistics in the news. Interpretation is not just dependent on knowing what you are being shown, but also on understanding that different data crunching approaches will favour different interpretations. We should be mindful of not just what data is shown, but how it is presented - data visualisation plays a large role in guiding us towards interpretation.

—————————————————————————————————

Poll trackers have been made available by the BBCEconomistTelegraph and Financial Times.

Analysing outcome likelihoods in the real world is a risky business. But if all else fails, you can always rely on the one interest group that has a consistent stake in accurate outcome prediction - betting companies. OddsChecker currently has best odds for a leave vote at 11/4 (27% likely) and stay at 1/3 (75% likely). Make of that what you will.

This article was originally published on 10 June 2016.