Saturday, March 16, 2013

A not so quick note on corrections


I just wanted to post a quick note about corrections to the pitchf/x data. There have been several articles detailing just some of the quirkieness with the pitchf/x data, and I will look up and append as many of them to the bottom of this list as soon as I can. But before I do, I just wanted to get the following off of my brain, as I think it will form the starting point from which I procede in this respect. I am going to simplify the problem greatly, but try to capture the essence of what we are trying to solve.


Imagine we live in a world that contains just two baseball stadiums outfitted with pitchf/x camera systems. For fun, let us just say that those two stadiums are Coors Field and Safeco Field. Further, we have pitchf/x data records for least one pitcher that has thrown at each field within a reasonable amount of time. Further, let's say we have good reason to assume that at least one of those pitchers had no noticeable changes in their abilities between outings in either of those parks...He threw with the same velocities, the same spins, and painted the black all night long, for instance...Say Mariano Rivera pre-knee injury. But when we look at the data, we see noticeable differences between his pitches at those two parks. In all attributes. Different release point, different velocities, different movements. Not terribly large, but large enough to ask whether something is changing with that pitcher, or something is different with the two measurement systems. Or is something else going on. 


The first thing we can try to eliminate is the something else. The two obvious candidates are atmospheric effects. Safeco and Coors sit at vastly different elevations, and one expects a large difference in air density between the two. So we can take pitches thrown at Safeco, subtract gravity from the z direction of acceleration, and then scale the remaining accelerations (all caused by aerodynamic forces) by the ratio of air density at Coors with respect to Safeco, then add gravity back in. Do the accelerations look similar? No? Well, what about wind? Thats actually a little tougher to sort out.  A lot tougher.  While we can get weather reports that give the prevailing windspeed and direction, that direction is usually only good to about 20 degrees or so, and we don't know anything about gusts.  But even worse than that, we have absolutely no idea how that prevailing wind plays out onto the field between the mound and home plate.  Stadiums are pretty big obstacles to wind, and without detailed wind studies, there's just no way for someone looking at the data to know.  So we can't rule out wind at all as the cause for our park to park differences. At least, some of them anyway.  Mainly through changes in accelerations.  So since we can't always eliminate wind as a source of systematic error, lets simplify this problem by simply stating that the data we have was all collected on still days (at least for this toy problem I am laying out).   While we are going to ignore the effects of wind for the time being, it's important to remember that it still exists, and that it's effects are unpredictable without much more information than is publicly available.


So if we have little reason to suspect the pitchers in our sample of turning in very different performances, and we can rule out atmospheric effects for most of the systematic differences, we are left with the two camera systems being somehow inconsistent with one another. So how can we describe that inconsistency, and then try to adjust things such that the data at our two locations are in some sort of agreement?  Well, first let's try to frame the problem.

Imagine one of the pitchers in our sample throws at least one of his pitches with remarkable consistency.  We'll call this pitch p0.  In ideal conditions, he always has more or less the same release point, velocity, and imparts the same spin so the accelerations (in our ideal conditions of 70 degrees at sea level) are more or less constant.
In matrix/vector form, we might write this the following way:









Where x0 represents the xyz coordinates of the release point, v0 for the 3 components of initial velocity and a for the 3 coordinates of acceleration.  The 1 tacked on to the end is a little bit of a hack that goes by the name of homogeneous coordinates.  It allows us to do things with matrix multiplication that we can't normally do, like add a constant to a vector through matrix multiplication.  

Now, consider that our pitcher then goes and throws this pitch in one of our stadiums and has that pitch recorded by the pitchf/x system.  What does our output look like?  Well, we hope to get back exactly the parameters that would describe p0.  Except that we won't.  Even if the system is perfect, there is still the effect of the (probably) non-ideal air density.  We can calculate the air density given the altitude, temperature and humidity (although for most purposes we can ignore the humidity, as it's effect is much smaller than the other two), and we can write down what we should expect to see.  The air density simply acts as a multiplier on all accelerations except for gravity.  So we subtract out gravity, throw our multiplier at the accelerations, and add gravity back in.  Or, something like this:






Where G represents the operation of adding in gravity, G-1 represents the operation of subtracting it, and R represents the operation multiplying all accelerations by a factor of ρ/ρ0. (ρ0 = air density at sea level and ρ = air density at the ballpark).  That's what the "real world" parameters of the pitch are when thrown in a non-ideal air density.  But that may or may not be what we get out of pitchf/x when it records the pitch.

Instead, the pitchf/x system at each park gets it's own operator.  For the first park under consideration, I'll call that S, and for the second, we'll call it T.  These operators perform an unknown transform on the parameters of the real-world pitch before giving us some output.  









So the problem we want to solve is to estimate the parameters of S and T, or at least come as close as we can to something that could be considered a reasonable guess.

As this is getting rather long, and it's rather late, I'll leave this right here for now.

A few links on PITCHf/x calibration errors:
Mike Fast's seminal warning to all PITCHf/x analysts: The Internet cried a little when you wrote that on it
More Mike Fast
Max Marchi
Marchi again. 
Jeremy Greenhouse 
Jon Roegele 
Roegele again.

These are just some that I remember reading at one time or another.  If there are important works I have missed, please let me know, and I'll append them too.

No comments: