Wednesday, August 3, 2016

My Election Simulation for 2016

In 2008, I used a simple election simulation to identify battleground states and decide where to spend my time getting out the vote. Eight years later, there exist plenty of authoritative simulations (Sam Wang, FiveThirtyEight, The Upshot). All of them will eventually agree on where the battlegrounds are. Still, I like computer programming! I also like being able to run custom analyses that I can't get from the big players. So I'll be simulating the 2016 election in my own amateur way.

How the Simulation Works

1) For each state, I produce a probability distribution (d, r, l, g) telling how likely it is that the state's electoral votes will go Democratic, Republican, Libertarian, or Green.

2) I then simulate the election 100,000 times, using the probabilities in (1) to assign electoral votes for each state.

3) Using the ensemble of simulation outcomes, I can determine a couple of key quantities:

   a) The fraction of simulations in which each candidate wins. This is the headline win probability estimate. 

   b) The fraction of simulations in which state X goes for candidate Y when candidate Y wins the presidency.

I can use (b) to determine candidate priorities, which I define intuitively as "states that a candidate needs but doesn't yet have." In other words, if the simulations have state X going for candidate Y in a pretty high percentage of candidate Y's wins, yet currently the probability of candidate Y winning state X is pretty low, then I consider state X a priority for candidate Y. The battleground states are those that appear on both candidates' list of priorities. 

So far no states seem likely to go for third-party candidates, so quantity (b) and the ensuing priorities analysis is restricted to the D and R candidates. 

Results So Far

I've run the simulation four times since July 23. Here is today's output:



For what it's worth, the D candidate wins in 72% of today's simulations, while the R candidate wins in 27% of today's simulations. The election goes to the House of Representatives in about 1% of today's simulations.

What I'm really after however is the battleground analysis, which will inform my own decision making about how to participate in the election come Fall. Right now, the battlegrounds according to my simulation are:

Battleground (EV) D Participation D Win Prob. R Participation R Win Prob.
New York (29) 82% 0.75 44% 0.25
Illinois (20) 79% 0.75 37% 0.25
Washington (12) 78% 0.75 32% 0.25
2016-08-03

Currently I'm arbitrarily defining a priority to be a state for which the participation rate exceeds the party's probability of winning the election as a whole, but for which the win probability is less than 1. The three states above show up on both party's priority lists. (For example, New York is a battleground because 82% > 72%, 0.75 < 1, 44% > 27%, and 0.25 < 1.)

We've been hearing a lot about Pennsylvania and Ohio—why aren't they on my list? In my simulation, Ohio and Pennsylvania both participate in only 56% of D victories, so they aren't urgent enough to make it onto the D priority list according to the threshholds I set. My list is intentionally short, because I can't be everywhere; if on the other hand you're a well funded political campaign with thousands of volunteers to deploy, then you probably do include Ohio and Pennsylvania on your list of must-haves.

Every time I run the simulation, I get a histogram like the one above. To show the history of the simulation, I can turn all of my extant histograms into frames for a movie. The problem is that because I don't run the simulation every day, or on any kind of regular schedule, the histograms aren't evenly spaced in time. To fix this, I use linear interpolation on pixel values to morph one histogram into the next, in such a way that the resulting movie has one frame per day. Here is what the movie looks like now, at the beginning of the season:

video


Eventually I will write a blog post specifically to hold simulation results, and I'll probably update that post from time to time. So as the movie gets longer, that post is where you can go to watch it. 


Details (Important)

Turning state polling data (D%, R%, L%, G%) into state win probabilities (d, r, l, g)

Suppose you had a trustworthy poll telling you the percentage of likely voters planning to vote for candidates D, R, L, and G. How would you turn that information into win probabilities for the candidates? 

There are principled ways to do this, but using them would have required me to learn something before getting started, and in my fever to begin writing code I chose a less principled path. Here's what it looks like. Suppose for example that we have polling data (46.3%, 44.8%, 4%, 0.5%). What I do first is normalize the percentages so they sum to 100%. That gives (48.43%, 46.86% 4.18%, 0.52%). The idea behind this is, first, that we ought indeed to allocate undecided voters in some way—the polls are supposed to be restricted to likely voters—and, second, it seems reasonable to allocate undecided voters proportionally according to what is known about preferences among those who have made up their minds. (You could hold out hope that undecideds will break for your candidate, but what is the polling evidence that it will happen?)

Now, this poll we are talking about is based on a sample, typically around 1,000 voters. So it is an imperfect predictor of what will happen when the real "poll" happens at the polling place on election day. In the simulation, I allow that multiple outcomes on election day are possible. Specifically, I allow that the vote share for each candidate on election day might be anywhere from 6 percentage points less than the poll result to 6 percentage points greater. (Of course, I don't allow percentages to be less than zero or greater than 100!) I chose the number 6 to be two standard deviations, where one standard deviation is 3 percentage points, based on a polling sample of around 1,000. 

So in our example, the D candidate might end up with as little as 42.43% of the vote or as much as 54.43% of the vote. Of course, given the poll data we have, neither of these outcomes should be considered likely. If an outcome differs from the poll result by z%, then I assign it a probability proportional to ez2/18, again corresponding to a standard deviation of 3 percentage points. 

Working to a mesh resolution of 0.1, I generate a list of all possible four-tuples (D%, R%, L%, G%) consistent with the ±6 point realm of possibility. The probability of any one of these four-tuples is proportional to the product of the four single-candidate probabilities calculated using the exponential factor above. In this way, I produce a probability distribution over election-day outcomes for the state in question.

Now I choose one of the four-tuples to be the election day result (choosing randomly but according to the probability distribution), and simply look at the four-tuple to see which candidate got the highest vote share that day. I do that 10,000 times. The win probability for each candidate is defined as the fraction of the simulations in which that candidate wins.

Examples:
  • Given hypothetical polling data (25%, 25%, 25%, 25%), a recent debugging run produced candidate win probabilities (0.2512, 0.252, 0.2451, 0.2517). In several test runs using (25%, 25%, 25%, 25%) as an input, the output probabilities were always 0.25 when rounded to the nearest hundredth.
  • Given hypothetical polling data (0, 49, 51, 0), a recent debugging run produced candidate win probabilities (0, 0.31, 0.69, 0).
As you can see, having 51-49 lead in a trustworthy poll is a good position to be in. It gives you 2-to-1 odds of winning.

(By now I probably could have learned enough to produce these numbers in a principled fashion, but oh well.)

It takes 1–2 minutes for my laptop to turn a single state poll (D%, R%, L%, G%) into a set of win probabilities (d, r, l, g). I have to do it for all 56 electoral-vote-granting entities, which takes about 2 hours. In case you are wondering, the 56 electoral-vote-granting entities are the 48 states, the District of Columbia, Maine-At-Large, Maine Congressional District 1, Maine Congressional District 2, Nebraska-At-Large, Nebraska Congressional District 1, Nebraska Congressional District 2, and Nebraska Congressional District 3.

Once I have the state-by-state probabilities, simulating the national election 100,000 times is fast. It wouldn't be hard to avoid the simulation approach by using the polynomial method, but the very elegance of that method means that it suppresses individual state scenarios, which I'm interested in. (The histograms are also fun to look at.)

I should say that I'm accounting for the L and G candidates not because they have an appreciable chance of becoming president, but because they might conceivably affect state-level results. I think there are some states where the L candidate could attract a significant share of the vote. As for G, well, if there's a Nadering on the way it would be helpful to see it coming. In other words, I am including L and G not to estimate their win probabilities on either the state or national level, but in order to be able to model their indirect effect on the D and R probabilities at the state and hence national level. One of the reasons I built my simulator (apart from liking to code) is that I'm not sure I have faith yet in the big players' approach to modeling the effects of third-party candidates this cycle. 


What if you don't have state polling data?

Right now, polling data for each state isn't conveniently available. I don't know where to look for a trustworthy polling breakdown (D%, R%, L%, G%) for each state. Many of the polls you see reported in the news don't include L or G. And there's a whole set of decisions to make about which polls to pay attention to based on how recent they are, who conducted it and how they performed in predicting past elections, etc. 

If I can get better at consuming the information available online, I will start using poll data. In the meantime, I have a workaround. It's similar to the approach I used in my Senate race simulation. Here I look up the RealClearPolitics rating for each state and simply replace the rating with a probability for that state. Specifically,
  • "Solid" for a candidate means I assign win probability 0.95 for that candidate.
  • "Likely" for a candidate means I assign win probability 0.75 for that candidate.
  • "Leans" for a candidate means I assign win probability 0.60 for that candidate.
  • "Toss Up" means 50/50 chance for D and for R.
(There are no third parties in the workaround.) I chose the numbers 0.60, 0.70, and 0.95 more or less arbitrarily. I didn't want there to be any certain outcomes, so I made "solid" 0.95 instead of 1. I didn't want "lean" to lean too far, so I made it 0.6 instead of something like 0.67. (There didn't seem much point to making the "lean" probability less than 0.6, since it seemed that if the probability was in the fifties, RCP would just make it a toss-up.) As for the 0.75 assignment in the case of "Likely," one could argue this number should be as low as 0.67 or as high as around 0.85. I went sort of middle of the road. I haven't done a sensitivity analysis on these numbers.

At any rate, with state-by-state probabilities in hand, we proceed as described in the original method, simulating the national election 100,000 times to determine priorities and battleground states. This is the method I'm currently using, and it is the method that led to the results shown above (note for example the probabilities 0.75 and 1 − 0.75 = 0.25 in the table of battleground states, which are fiat values as noted in the bullets immediately above).

No comments: