Thursday, April 11, 2013

Part 5.

Previously, I had been attempting to derive and apply corrections to PITCHf/x data under the theory that an affine transformation of the PITCHf/x coordinate system in a given stadium (Tampa Bay in the example I was using) would approximate the transformation from the PITCHf/x coordinate system in that stadium to the PITCHf/x coordinate system in another (Baltimore).  This required that I derive 9 parameters, and that the 9x9 application 'matrix' to a vector of pitch parameters be block diagonal with the same 3x3 block on the diagonal.

It turns out that this approximation can either closely correct for movement OR can closely correct for release point.  But not both at the same time, and is thus insufficient for our needs.

Based on an email exchange with Alan Nathan, he suggested* that I should instead allow each 3x3 block to be independent.  I both like and dislike this suggestion at the same time.  First, by going from 9 parameters to define a correction to 27, we run the very real risk of overfitting the data, so we are going to need a strategy to deal with that.  It is very likely that many of the terms will be superfluous.  So we are going to want to cut down as much as we can on the number of significant terms.  Secondly, this now means that we are explicitly trying to approximate a nonlinear transformation with a linear transformation.  That's not so bad in and of itself...the problem could very well be locally linear, and thus a close enough approximation to use....but it makes me a little uneasy.

So naturally I went ahead and tried it.

Expecting some overfitting, the following figure is what my toy model spit out.  In this figure (and all following), green dots represent pitches thrown on July 24, 2012 at Tropicana, red dots represent pitches thrown August 4, 2012 in Baltimore, and blue dots represent pitches thrown at Tropicana, "corrected" by this model. (click to embiggen...if it's working as I think it should)

We seem to do somewhat better.  Both movements and release points are at least mostly centered in the right places, with some possible nits to pick on Chen's slider and Hellickson's curve (but not as bad as before).  However the locations of pitches seem to move quite a bit, and in a manner that is not easily discernable.  Also, it is especially apparent that Hellickson's "corrected" release point (at 50 ft), and Chen's too, have much smaller variances than the raw data at either park.  Both the effect on release and final position are likely relics of overfitting.  They possibly also arise because I am creating a mapping between each pitch of a given type in one park to every other pitch of that type in the other park.

So how can we reduce this overfitting?  Well one common method is through regularization.  What we have been doing all along is solving a least squares problem that finds the parameters of our transformation matrix that minimize a cost function:  The sum of the squared differences of our corrected pitch parameters and the mapped pitch parameters in the park we are correcting to.  To regularize this problem we simply add a competing term to our cost function.  There are two somewhat obvious choices for a regularization term that I can pull off the top of my head.  The first comes from the argument that we should expect our transformation to be not too different from the identity matrix (no correction), because we expect (and want) our corrections to be small.  In this case we would apply a penalty to the cost function that is proportional to the squared difference between our transformation matrix parameters and the identity matrix.  The second choice we could make would add a penalty term proportional to the square of the distance between the corrected and raw final positions.  This would serve to insure that the changes in final positions were small.

As a first test, I chose to go with the first choice for regularization (in no small part because it is much easier to code).

So here are the key results of the same 27 parameter fit with a regularization term designed to keep the transformation close to the identity matrix (in the same format as before):

That seems much better.  There is still a potential nit to pick on Chen's slider, but otherwise this looks somewhat promising.

I also ran this "correction" scheme by calculating based on only one pitcher or the other.  Surprisingly, the correction then for the pitcher that was left out was not horrible.  It wasn't as nice-looking, primarily with respect to the final positions of pitches.  But the movements and velocities were mostly there.  I have a feeling that in this case it has more to do with the fact that I am using one righty and one lefty.  I expect that a matrix thus computed using only pitchers from one side or the other will be a farther approximation from "truth" than if at least one pitcher of each handedness is included.  (if you want to see the plots from leaving one or the other out, let me know...I'm leaving them out in the interest of space...4 more plots like the ones above)

I still need to run a few other sanity checks, because I'm still a bit afraid of overfitting and other things.  If they don't show everything going to the crapper, then I think from here I may at least have a framework from which to start working out on corrections that cover an entire season.  I'd also like to try out the other regularization method I mentioned above.

We'll see.

Let me know what you think, either in comments, or via email. ( )

*Suggestion is probably not the most appropriate descriptor of Alan's email.  Actually, he seemed to be slightly mis-interpreting what I was previously doing (probably due to my writing it out poorly) as requiring 27 parameters when in fact I was only requiring 9, however without that exchange I probably would have gone down a much different path before this one, because it just feels like too damn many free parameters.  So, while it may not  have been (and probably wasn't) his intention to suggest this, I went ahead and took it as such.

No comments: