Wednesday, August 10, 2016

The Rectangle Area Formula

In the rectangle area formula A = × W,
  • A is the number of unit squares needed to tile the rectangle
  • L is the number of length units when you measure one side of the rectangle
  • W is the number of length units when you measure an adjacent side of the rectangle.

The reason the formula works has to do with what the × symbol means.

Today let's confine our analysis of the formula to whole numbers, because I'm thinking about the formula today from the perspective of a young student who hasn't yet absorbed fractional quantities or fraction operations. For whole numbers, the simple interpretation of m × n is that it stands for the number of things in m groups of n things each.

In the context of a rectangle with, say, length 7 units and width 5 units, what are the groups, and what are the things?

If one side of the rectangle consists of 7 length units, then we can draw lines to divide the rectangle equally into 7 strips. There are as many strips in the rectangle as there are length units in the side.

We might have done the same thing on either of the two adjacent sides, which would have resulted in the number of strips being 5.

If we draw both sets of lines, then each of the strips gets divided into squares (question: why perfect squares?). For example, the 7 strips would turn into 7 groups of 5 squares each. Remembering what × means, the total number of squares must then equal 7 × 5.

This argument is pretty obviously sound for any rectangle whatsoever with whole-number side lengths, so we can say that for any such rectangle,
The number of unit squares needed to tile a rectangle equals the number of length units that fit along one side multiplied by the number of length units that fit along an adjacent side.
This is what is meant by "Area equals length times width."

Some drawbacks I have seen in the way some math textbooks teach A = L × W:

  • Doing a poor job establishing what m × n means altogether. Then the book has no chance of doing a proper job with area, or with anything else that relies on multiplication.
  • Doing a poor job with basic measurement concepts. Then it isn't clear what a length measurement means, or what an area measurement means. You can't relate two ideas you don't grasp individually.
  • Being unclear about the difference between area and length. An example of this would be blurring the distinction between iterations of length units along the boundary and iterations of area units just inside the boundary. One book says, "The number of squares in one row is the same as the length in unit squares of the rectangle." What is an object's "length in unit squares"? Unit squares are area units, not length units. A correct way to say it would be, "The number of squares in one row is the same as the number of length units in the edge of the rectangle." Maybe it sounds convoluted, but in mathematics, words mean things; you have to combine the words into statements that are true, or at least not hostile to the distinction at hand.
  • Taking too long to orchestrate the reasoning involved.

Sunday, August 7, 2016

Points In A Circle Problem: A Wild Guess for N = 3

With reference to the optimization problems in my last post, I found myself most interested by the N = 3 case:
Place 3 points in the interior of a unit disk so that the largest triangle not containing any of the points has least possible area.
I should actually clean up my language here, because the maximum area could be attained by two or more different triangles. It ought to say, "a largest triangle," not "the largest triangle." Or perhaps I could say:
Place 3 points in the interior of a unit disk so that the largest area among all triangles not containing any of the points is as small as possible.

In today's post, I thought I'd share my first wild guess at a solution to the N = 3 problem:

The dots are each located about 0.321 units from the center of the disk, and the configuration has 120-degree rotational symmetry.

Given my opening move, what is player two's best response? What is the largest area of any triangle that doesn't contain any of my points within its interior? Feel free to email me a sketch at

Remember, it's OK if the boundary of your triangle touches one or more of the points, and it's OK if one or more of your triangle's vertices lies on the boundary of the disk.

So far, the greatest area I've been able to enclose with a triangle that doesn't contain the above three points is about 0.829 square units. But I haven't searched systematically or at length, so I'm far from certain that this is the greatest area possible. If any reader sketch appears to exceed that value, I'll let you know.

One more reflection on the problem: is it well posed? There are infinitely many triangles not containing the three given points, so one must consider the possibility that among them no greatest area exists. I guess I doubt that happens here—basically, because we allow touching of boundaries—but I haven't proven it.

Saturday, August 6, 2016

An Optimization Problem

Place a point in the interior of a unit disk so that the largest triangle not containing that point has least possible area. 
To describe the problem another way, suppose you and a friend are playing a game. You begin by drawing a dot somewhere inside the disk. Your friend then tries to draw the largest possible triangle (in terms of area) that doesn't contain the dot you drew. No point of your friend's triangle may lie outside the disk, but it's fine if one or more vertices of the triangle touch the boundary of the disk. It's also OK if the boundary of the triangle touches the dot—the dot just can't lie within the triangle's interior. Your objective is to place your dot in such a way that your friend's best triangle will be as small as possible.

Looked at from a distance, the goal here is to minimize a maximum quantity. Problems like that are common in many fields, including game theory (where the goal is to give your opponent only bad options from which to choose), investments, engineering, and quantum mechanics (in my research, I once used a min-max technique to shed a little light on the Heisenberg uncertainty principle).

Getting back to the problem at hand, I think I have the answer, but I haven't made sure of it.

You might also try some generalizations—these I haven't even begun to think about:

  • Place 2 points in the interior of a unit disk so that the largest triangle not containing either of the points has least possible area.
  • Place 3 points in the interior of a unit disk so that the largest triangle not containing any of the points has least possible area.
  • Place N points in the interior of a unit disk so that the largest triangle not containing any of the points has least possible area.
  • Place N points in the interior of a [choose shape] so that the largest [choose shape] not containing any of the points has least possible [length, area, volume].
  • Place N points randomly in the interior of a unit disk. What is the expected value of the area of the largest triangle not containing any of the points?

Feel free to share your results and partial progress on any of the problems. If the problems are standard among researchers (as they may well be), feel free to post some citations as well.


Friday, August 5, 2016

Current Simulation Results

UPDATE October 25, 2016

This page will always contain my most recent simulation results. A shortcut to this page is available at Methodology for the simulation is here.

Note: The final list of battleground states for the 2016 presidential election is at Everything below is left over from the October 17, 2016 update. I won't be updating this page again.

I did test my process for converting poll data into probability; it agreed fairly closely with both PEC and FiveThirtyEight, and aggregating the recent state-by-state poll data was a lot of work. So from now on, I will just piggyback on the poll aggregation labors of others. (I'm generally using PEC probabilities because those are stored online in a .csv file that I can extract with one keystroke; using FiveThirtyEight probabilities would require me to click through dozens of pages and type the data by hand.)

Now that the state-by-state probabilities are being derived from poll data, I have started including a table of battleground states. These are selected on the basis of high need and low probability, for both parties simultaneously.

Most recent distribution of outcomes

Most recent battleground list

Probabilities from PEC

Battleground (EV) D Participation D Win Prob. R Participation R Win Prob.
Florida (29) 87.3% 0.87 96.7% 0.13
Ohio (18) 81% 0.81 82.6% 0.19
Wisconsin (10) 87.1% 0.87 46.7% 0.13
Nevada (6) 81.1% 0.81 43.5% 0.19
Mon 17 Oct 2016

Detailed state-by-state results

Probabilities from PEC

State (EV) D Participation D Win Prob. R Participation R Win Prob.
Alabama (9) 0% 0 100% 1
Alaska (3) 12.9% 0.13 85.9% 0.87
Arizona (11) 4% 0.04 100% 0.96
Arkansas (6) 0% 0 100% 1
California (55) 100% 1 0% 0
Colorado (9) 92.2% 0.92 35.9% 0.08
Connecticut (7) 100% 1 0% 0
District Of Columbia (3) 100% 1 0% 0
Delaware (3) 100% 1 0% 0
Florida (29) 87.3% 0.87 96.7% 0.13
Georgia (16) 13.1% 0.13 100% 0.87
Hawaii (4) 100% 1 0% 0
Idaho (4) 0% 0 100% 1
Illinois (20) 100% 1 0% 0
Indiana (11) 0% 0 100% 1
Iowa (6) 27.9% 0.28 90.2% 0.72
Kansas (6) 0% 0 100% 1
Kentucky (8) 0% 0 100% 1
Louisiana (8) 0% 0 100% 1
Maine (2) 100% 1 0% 0
MaineCD1 (1) 100% 1 0% 0
MaineCD2 (1) 100% 1 0% 0
Maryland (10) 100% 1 0% 0
Massachusetts (11) 100% 1 0% 0
Michigan (16) 99% 0.99 19.6% 0.01
Minnesota (10) 96.1% 0.96 16.3% 0.04
Mississippi (6) 0% 0 100% 1
Missouri (10) 4.1% 0.04 100% 0.96
Montana (3) 0% 0 100% 1
Nebraska (2) 0% 0 100% 1
NebraskaCD1 (1) 0% 0 100% 1
NebraskaCD2 (1) 0% 0 100% 1
NebraskaCD3 (1) 0% 0 100% 1
Nevada (6) 81.1% 0.81 43.5% 0.19
New Hampshire (4) 92.1% 0.92 18.5% 0.08
New Jersey (14) 100% 1 0% 0
New Mexico (5) 100% 1 0% 0
New York (29) 100% 1 0% 0
North Carolina (15) 61.2% 0.61 82.6% 0.39
North Dakota (3) 0% 0 100% 1
Ohio (18) 81% 0.81 82.6% 0.19
Oklahoma (7) 0% 0 100% 1
Oregon (7) 100% 1 0% 0
Pennsylvania (20) 98% 0.98 40.2% 0.02
Rhode Island (4) 100% 1 0% 0
South Carolina (9) 13.2% 0.13 96.7% 0.87
South Dakota (3) 0% 0 100% 1
Tennessee (11) 0% 0 100% 1
Texas (38) 0% 0 100% 1
Utah (6) 0% 0 100% 1
Vermont (3) 100% 1 0% 0
Virginia (13) 100% 1 0% 0
Washington (12) 100% 1 0% 0
West Virginia (5) 0% 0 100% 1
Wisconsin (10) 87.1% 0.87 46.7% 0.13
Wyoming (3) 0% 0 100% 1
Mon 17 Oct 2016

Cumulative Animation of Simulation Results

Wednesday, August 3, 2016

My Election Simulation for 2016

In 2008, I used a simple election simulation to identify battleground states and decide where to spend my time getting out the vote. Eight years later, there exist plenty of authoritative simulations (Sam Wang, FiveThirtyEight, The Upshot). All of them will eventually agree on where the battlegrounds are. Still, I like computer programming! I also like being able to run custom analyses that I can't get from the big players. So I'll be simulating the 2016 election in my own amateur way.

How the Simulation Works

1) For each state, I produce a probability distribution (d, r, l, g) telling how likely it is that the state's electoral votes will go Democratic, Republican, Libertarian, or Green.

2) I then simulate the election 100,000 times, using the probabilities in (1) to assign electoral votes for each state.

3) Using the ensemble of simulation outcomes, I can determine a couple of key quantities:

   a) The fraction of simulations in which each candidate wins. This is the headline win probability estimate. 

   b) The fraction of simulations in which state X goes for candidate Y when candidate Y wins the presidency.

I can use (b) to determine candidate priorities, which I define intuitively as "states that a candidate needs but doesn't yet have." In other words, if the simulations have state X going for candidate Y in a pretty high percentage of candidate Y's wins, yet currently the probability of candidate Y winning state X is pretty low, then I consider state X a priority for candidate Y. The battleground states are those that appear on both candidates' list of priorities. 

So far no states seem likely to go for third-party candidates, so quantity (b) and the ensuing priorities analysis is restricted to the D and R candidates. 

Results So Far

I've run the simulation four times since July 23. Here is today's output:

For what it's worth, the D candidate wins in 72% of today's simulations, while the R candidate wins in 27% of today's simulations. The election goes to the House of Representatives in about 1% of today's simulations.

What I'm really after however is the battleground analysis, which will inform my own decision making about how to participate in the election come Fall. Right now, the battlegrounds according to my simulation are:

Battleground (EV) D Participation D Win Prob. R Participation R Win Prob.
New York (29) 82% 0.75 44% 0.25
Illinois (20) 79% 0.75 37% 0.25
Washington (12) 78% 0.75 32% 0.25

Currently I'm arbitrarily defining a priority to be a state for which the participation rate exceeds the party's probability of winning the election as a whole, but for which the win probability is less than 1. The three states above show up on both party's priority lists. (For example, New York is a battleground because 82% > 72%, 0.75 < 1, 44% > 27%, and 0.25 < 1.)

We've been hearing a lot about Pennsylvania and Ohio—why aren't they on my list? In my simulation, Ohio and Pennsylvania both participate in only 56% of D victories, so they aren't urgent enough to make it onto the D priority list according to the threshholds I set. My list is intentionally short, because I can't be everywhere; if on the other hand you're a well funded political campaign with thousands of volunteers to deploy, then you probably do include Ohio and Pennsylvania on your list of must-haves.

Every time I run the simulation, I get a histogram like the one above. To show the history of the simulation, I can turn all of my extant histograms into frames for a movie. The problem is that because I don't run the simulation every day, or on any kind of regular schedule, the histograms aren't evenly spaced in time. To fix this, I use linear interpolation on pixel values to morph one histogram into the next, in such a way that the resulting movie has one frame per day. Here is what the movie looks like now, at the beginning of the season:

Eventually I will write a blog post specifically to hold simulation results, and I'll probably update that post from time to time. So as the movie gets longer, that post is where you can go to watch it. 

Details (Important)

Turning state polling data (D%, R%, L%, G%) into state win probabilities (d, r, l, g)

Suppose you had a trustworthy poll telling you the percentage of likely voters planning to vote for candidates D, R, L, and G. How would you turn that information into win probabilities for the candidates? 

There are principled ways to do this, but using them would have required me to learn something before getting started, and in my fever to begin writing code I chose a less principled path. Here's what it looks like. Suppose for example that we have polling data (46.3%, 44.8%, 4%, 0.5%). What I do first is normalize the percentages so they sum to 100%. That gives (48.43%, 46.86% 4.18%, 0.52%). The idea behind this is, first, that we ought indeed to allocate undecided voters in some way—the polls are supposed to be restricted to likely voters—and, second, it seems reasonable to allocate undecided voters proportionally according to what is known about preferences among those who have made up their minds. (You could hold out hope that undecideds will break for your candidate, but what is the polling evidence that it will happen?)

Now, this poll we are talking about is based on a sample, typically around 1,000 voters. So it is an imperfect predictor of what will happen when the real "poll" happens at the polling place on election day. In the simulation, I allow that multiple outcomes on election day are possible. Specifically, I allow that the vote share for each candidate on election day might be anywhere from 6 percentage points less than the poll result to 6 percentage points greater. (Of course, I don't allow percentages to be less than zero or greater than 100!) I chose the number 6 to be two standard deviations, where one standard deviation is 3 percentage points, based on a polling sample of around 1,000. 

So in our example, the D candidate might end up with as little as 42.43% of the vote or as much as 54.43% of the vote. Of course, given the poll data we have, neither of these outcomes should be considered likely. If an outcome differs from the poll result by z%, then I assign it a probability proportional to ez2/18, again corresponding to a standard deviation of 3 percentage points. 

Working to a mesh resolution of 0.1, I generate a list of all possible four-tuples (D%, R%, L%, G%) consistent with the ±6 point realm of possibility. The probability of any one of these four-tuples is proportional to the product of the four single-candidate probabilities calculated using the exponential factor above. In this way, I produce a probability distribution over election-day outcomes for the state in question.

Now I choose one of the four-tuples to be the election day result (choosing randomly but according to the probability distribution), and simply look at the four-tuple to see which candidate got the highest vote share that day. I do that 10,000 times. The win probability for each candidate is defined as the fraction of the simulations in which that candidate wins.

  • Given hypothetical polling data (25%, 25%, 25%, 25%), a recent debugging run produced candidate win probabilities (0.2512, 0.252, 0.2451, 0.2517). In several test runs using (25%, 25%, 25%, 25%) as an input, the output probabilities were always 0.25 when rounded to the nearest hundredth.
  • Given hypothetical polling data (0, 49, 51, 0), a recent debugging run produced candidate win probabilities (0, 0.31, 0.69, 0).
As you can see, having 51-49 lead in a trustworthy poll is a good position to be in. It gives you 2-to-1 odds of winning.

(By now I probably could have learned enough to produce these numbers in a principled fashion, but oh well.)

It takes 1–2 minutes for my laptop to turn a single state poll (D%, R%, L%, G%) into a set of win probabilities (d, r, l, g). I have to do it for all 56 electoral-vote-granting entities, which takes about 2 hours. In case you are wondering, the 56 electoral-vote-granting entities are the 48 states, the District of Columbia, Maine-At-Large, Maine Congressional District 1, Maine Congressional District 2, Nebraska-At-Large, Nebraska Congressional District 1, Nebraska Congressional District 2, and Nebraska Congressional District 3.

Once I have the state-by-state probabilities, simulating the national election 100,000 times is fast. It wouldn't be hard to avoid the simulation approach by using the polynomial method, but the very elegance of that method means that it suppresses individual state scenarios, which I'm interested in. (The histograms are also fun to look at.)

I should say that I'm accounting for the L and G candidates not because they have an appreciable chance of becoming president, but because they might conceivably affect state-level results. I think there are some states where the L candidate could attract a significant share of the vote. As for G, well, if there's a Nadering on the way it would be helpful to see it coming. In other words, I am including L and G not to estimate their win probabilities on either the state or national level, but in order to be able to model their indirect effect on the D and R probabilities at the state and hence national level. One of the reasons I built my simulator (apart from liking to code) is that I'm not sure I have faith yet in the big players' approach to modeling the effects of third-party candidates this cycle. 

What if you don't have state polling data?

Right now, polling data for each state isn't conveniently available. I don't know where to look for a trustworthy polling breakdown (D%, R%, L%, G%) for each state. Many of the polls you see reported in the news don't include L or G. And there's a whole set of decisions to make about which polls to pay attention to based on how recent they are, who conducted it and how they performed in predicting past elections, etc. 

If I can get better at consuming the information available online, I will start using poll data. In the meantime, I have a workaround. It's similar to the approach I used in my Senate race simulation. Here I look up the RealClearPolitics rating for each state and simply replace the rating with a probability for that state. Specifically,
  • "Solid" for a candidate means I assign win probability 0.95 for that candidate.
  • "Likely" for a candidate means I assign win probability 0.75 for that candidate.
  • "Leans" for a candidate means I assign win probability 0.60 for that candidate.
  • "Toss Up" means 50/50 chance for D and for R.
(There are no third parties in the workaround.) I chose the numbers 0.60, 0.70, and 0.95 more or less arbitrarily. I didn't want there to be any certain outcomes, so I made "solid" 0.95 instead of 1. I didn't want "lean" to lean too far, so I made it 0.6 instead of something like 0.67. (There didn't seem much point to making the "lean" probability less than 0.6, since it seemed that if the probability was in the fifties, RCP would just make it a toss-up.) As for the 0.75 assignment in the case of "Likely," one could argue this number should be as low as 0.67 or as high as around 0.85. I went sort of middle of the road. I haven't done a sensitivity analysis on these numbers.

At any rate, with state-by-state probabilities in hand, we proceed as described in the original method, simulating the national election 100,000 times to determine priorities and battleground states. This is the method I'm currently using, and it is the method that led to the results shown above (note for example the probabilities 0.75 and 1 − 0.75 = 0.25 in the table of battleground states, which are fiat values as noted in the bullets immediately above).