*x*,

*y*) points of a data set will never form a perfect polygon in the coordinate plane…but once that fanciful thought occurs to you, you can't help wondering what the best-fit line would look like.

So, imagine that the

*n*data points (

*x*

_{1},

*y*

_{1}), … (

*x*

_{n},

*y*

_{n}) form the

*n*vertices of a regular

*n*-gon. Then the best-fit line turns out to be easy to describe:

- The best-fit line will pass through the center of the regular
*n*-gon. This is because the coordinates of the center point can be found by averaging the data values—and the best-fit line always passes through the point whose coordinates are the averages of the data values.

- For any number of vertices, and for any location of the center, and for any orientation of the
*n*-gon, the slope of the best-fit line will be zero.

To illustrate this, I made an animation with 7 data points. In the first phase of the animation, the data points are moving around chaotically. Over time, they coalesce into the form of a regular 7-gon. Then the 7-gon rotates for a while. After that, the animation simply reverses: the 7-gon begins to rotate in the opposite direction, and then the points chaotically drift apart. For each frame of the animation, a best-fit line to the 7 data points is shown. You'll see that whenever the points form a 7-gon, the best-fit line is horizontal.

There's no cheating here—for each frame of the animation, the computer calculates the best-fit line based on where the data points are. It would have made for a nicer animation if I had also imparted an overall motion to the center of the heptagon…maybe some other time. I didn't go the trouble because the location of the center doesn't actually affect the slope. The slope of a best-fit line is invariant under an overall translation of the data.

How does the math work out? Here is one view of the problem. The image below shows 7 data points that form a heptagon. A candidate best-fit line through the center is shown. The total squared error will be the sum of the squared lengths of the vertical segments.

The winning candidate however is the horizontal line through the center:

This is fine I guess, but to my eye it isn't obvious that the total squared-error of the horizontal line is best. I don't have an insightful argument for why, in the case of

*n*points equally spaced around a circle, the total distance-squared to a line through the center is minimized by making the line horizontal. What I'll give instead is a calculation leading to that conclusion.

Let the

*n*-gon be inscribed in a circle with radius

*R*and center (

*h*

_{x},

*h*

_{y}). Then the vertices of the

*n*-gon will have coordinates

(

*h*_{x}+*R*cos(θ + 2π*k*/*n*),*h*_{y}+*R*sin(θ + 2π*k*/*n*))for

*k*= 0, …

*n*− 1. The parameter θ allows for different orientations of the

*n*-gon.

The total distance-squared to a line of slope

*m*that passes through (

*h*

_{x},

*h*

_{y}) can be calculated as a function

*E*(

*m*), quadratic in the variable

*m*. (The parameters

*h*

_{x}and

*h*

_{y}drop out, as expected.) Minimizing

*E*leads to the optimum value for

*m*, expressed as a ratio of sums:

\[m_{\rm best} = \frac{\sum_0^{n-1}{\cos\lambda_k\sin\lambda_k}}{\sum_0^{n-1}{\cos^2\lambda_k}}\]

where \(\lambda_k \equiv \theta + \frac{2\pi k}{n}\). The vanishing of the slope amounts to the proposition that the

*n*-gon's horizontal and vertical coordinates have zero correlation.

A straightforward evaluation of the sum in the numerator shows that it vanishes:

\[m_{\rm best} \propto \sum_0^{n-1}{\cos\lambda_k\sin\lambda_k}\]

\[ m_{\rm best} \propto \sum_0^{n-1}{\sin 2\lambda_k}\,.\]

It should be possible to see intuitively why that sum vanishes, but let's keep going to the end:

\[ m_{\rm best} \propto \sum_0^{n-1}{\left(e^{2i\lambda_k}-e^{-2i\lambda_k}\right)}\]

\[ = e^{2i\theta}\sum_0^{n-1}{e^{\frac{4\pi i k}{n}}} + e^{-2i\theta}\sum_0^{n-1}{e^{\frac{-4\pi i k}{n}}}\]

\[ = e^{2i\theta}\cdot \frac{1-(e^{4\pi i/n})^n}{1-e^{4\pi i/n}} + e^{-2i\theta}\cdot \frac{1-(e^{-4\pi i/n})^n}{1-e^{-4\pi i/n}}\]

\[ = e^{2i\theta}\cdot \frac{1-1}{1-e^{4\pi i/n}} + e^{-2i\theta}\cdot \frac{1-1}{1-e^{-4\pi i/n}}\]

\[ = 0\,.\]

So

*m*_{best}= 0 and the best-fit line to a polygon is always horizontal. If you have an explanation that's more insightful than mechanically summing a finite series of complex exponentials, please pass it along!
## No comments:

Post a Comment