Tuesday, April 22, 2008

PitchF/X: Accelerations and Corrections, and why you should care.

So I've been wanting to post something about corrections to the 2007 pitchf/x data I've been working on for a while now, and time constraints keep getting in the way. So, here goes.

First let me start by saying that as of now, the only corrections I know of are those defined by Josh Kalk. The method he uses is a very creative one. He essentially uses additive corrections to force the initial parameters of pitches in each park to some league average. Thats a really short and simplified explanation. Yes, I know what I just said makes it sound like he makes every pitch have the same trajectory, and no, thats not what he is doing. For each park he calculates an additive correction to be applied to each pitches initial parameters. His method though leaves me a bit unsatisfied, even though it does seem to improve the quality of the data by some amount. The reasons I am unsatisfied are the following:

A) Josh has made some remarks in some places that leave me confused. First, he changes the initial parameters of each pitch by some amount depending on which park it was thrown in. But, at the same time he claims he likes the data near the plate so much that he doesn't apply any correction to those points. But then this means that there is a disconnect somewhere. He changes the properties of a pitch trajectory. Doing so will most likely alter the final location and velocity of the pitch when you propagate that trajectory in time. So I'm really not sure what he means by this. It seems that he actually has two final locations for each pitch.

B) While his corrections can be applied across any given season, correcting across multiple seasons under this method becomes a very thorny issue. Some pitchers change their mechanics from year to year. Typically pitchers will lose some velocity from year to year. It gets very complicated very quickly. If the good people at Sportvision decide to alter their cameras between seasons by some amount, the values of "league average" can change from year to year, making corrected data from year A not very compatible with year Bs corrected data. This is important for multi year studies, say if you want to develop pitcher aging curves for velocity or such. Or if you want to glean whether or not your favorite pitcher has changed his mechanics in the offseason by looking at the release point.

C) His corrections make no attempt to determine what is actually happening from park to park. Yes, I know that nobody really wants to get into the technical details of what could be going on to cause these things to happen....but someone should. I don't know if Sportvision has the manpower to do the consistency checks necessary to ensure that the data from every park look the same. They have some calibration routine, and I imagine that to them, everything looks fine with their cameras after running the routine. But there is obviously something going on. The people at Sportvision are smart people. Although I generally believe them when they say that their measurments are accurate to within an inch or two, even smart people can overlook things, especially if the thing being overlooked is a subtle thing. Which is what I think is going on at most parks.

I should probably note here that I am not trying to disparage Josh. I think he is a fine upstanding guy, and a good physicist as well as a good sabermetrician. I enjoy reading his commentary at THT, and look forward to more of his articles. I only take issue with his method of corrections. For the most part and for a lot of the analyses out there, corrections aren't really a necessary thing. But to really get the most out of the data, corrections are needed. And those corrections need to be robust enough to handle the issues I raise here for many of the more ambitious projects I can imagine with the pitchf/x data. Perhaps my issues with this are due to my working for too long in an environment where nothing works the way it's supposed to at first, but everyone out there should keep in mind that there are issues with the data when they do an analysis, and they should seriously question whether those issues will or won't affect their findings. It's my feeling that many in the sabermetric world don't ever really consider data quality outside of sample sizes to be an issue. Usually they don't have to. A ball in play is either a hit or it's not right? But people should at least give it a thought.

So, what could it be thats causing the disparity between parks? One thing I had noted earlier, and posted to my Livejournal here, was that the things we'd really like to do to determine what's going on are to measure in the data something we already know. I put forward the idea of measuring the "Coefficient of Drag" as that something. Even though we don't know exactly what the value of Cd is for a baseball, we can at least use it to do consistency checks from park to park. I've shown previously that there are significant differences between some parks in Cd. For the rest of this post, I'm going to focus on the differences between two parks: PETCO Park, and Dodger Stadium. (Previously, on my LJ, I had compared PETCO to Anaheim, but those two don't have many pitchers in common. Two NL stadium will have many more pitchers that have pitched in both).

Through some communication with the Sportvision people and Alan Nathan, in which I posited many ideas that could be checked that might be responsible for the Cd discrepancies (and many of those ideas were quickly shot down...they were mostly semi-obvious things....I did mention above that the Sportvision people were smart right? Yeah, they had thought of these things too. But it's always good to ask.) I learned a great deal about what is going on, and I am still learning more (it's really very fascinating...you should see for yourself).

It's all about the Accelerations.

As it turns out, my first inclinations were totally wrong. Those were primarily that there might be something fishy with either the timing of frames, or that there might be something fishy in the length scales. These would be the easiest things to contribute to a mismeasurement of Cd, but they would also tend to mismeasure other things. Like velocities and positions. By a significant amount. And those kinds of mismeasurements simply aren't seen in the data. At least, not at the levels that would be expected from an incorrect length scale or time scale large enough to explain the differences in Cd. So there's something else going on. Something more subtle.


Almost simultaneously, Alan and I put forth two separate ideas. It turns out that Alan's explains the problem at PETCO almost entirely. Mine may still exist at other parks but it will be more difficult to see I think. Anyway, the two ideas, starting with Alan's:
1. That somehow, there is some distortion in one or both cameras that has been improperly measured.
2. Slightly more complex (and perhaps as a result of 1.), that systematically there are some parts of the trajectory that are not included in the fit to a trajectory. Specifically, in reality, the acceleration a baseball experiences will vary greatly over the flight between the mound and home plate (and this relationship is almost linear with distance travelled). The model used to reconstruct the pitch trajectory is one of constant acceleration. In this model, the acceleration that is found is usually near the mid-point of initial and final accelerations. If somehow points at one end or the other of the pitches trajectory are being systematically missed at certain parks (tossed out because of not so good agreement between the two cameras for instance), then the acceleration our model finds in the fitting procedure will be shifted up or down because the 'measured' initial or final accelerations won't be included in the mean.

The nice thing about these two ideas is that when you think about them in relationship to PETCO, it's possible to determine if one or the other dominate. In order to get systematically high values of acceleration at PETCO, if we are missing points in the fit, then those points must be in the latter half of the trajectory, so initial velocities between PETCO and some other stadium (we chose Dodger Stadium) with more reasonable acceleration values should be about the same, but at the plate, pitches should have a much lower final velocity according to PitchF/X. If Alan is correct, and the distortion is greatest near the mound, then the initial velocities at PETCO should be higher than for, in our case, Dodger Stadium, and the final velocities should be about the same.

So to test this, we looked at pitches from Jake Peavy (because who doesn't like watching him pitch?), through June 2007 (more on why later) in both parks. Below are the results: In the top left are the initial velocities of his pitches, the top right are the final velocities, the bottom left is the z value of the release point, and the bottom right is the x value of the release point. Pitches at PETCO are in red, pitches in Dodger Stadium in blue (Dodger Stadium data has been scaled up so that it appears that there are an equal number of pitches at both stadiums):



As you can see, Alan was totally right. The final velocities are almost the same, but the initial velocities are drastically different! But that's not all the evidence that a miscalibrated camera distortion can drastically change the accelerations. First, are we sure that PETCO has a miscalibrated camera distortion? To answer that, Rand Pendleton of Sportvision was kind enough to look into the PETCO camera registration information for us. What he found sealed the deal. At least for PETCO specifically.



In this plot, Rand has plotted the value used for spherical camera distortion (to get an idea of what spherical camera distortion is, think of a fish-eye lens for an extreme example) versus pitch number at PETCO. You can see that sometime during the season, the value for this factor changed dramatically for one camera, but not the other. He then plotted the measured values of Cd for each of these two "epochs".



As you can see, the shift in k1 (the distortion factor for one camera) does in fact correspond to a shift in Cd, and therefore a shift in the measured accelerations by that camera. Neat! It turns out if you simply plot average Cd values by date at PETCO, you can see the same effect:


Here the x axis is simply days after March 1, 2007. The largest values occur roughly through June, and I've fit a straight line through them. Compare this to Dodger Stadium. The error bars on this plot are also extended a bit to capture the total spread in Cd values rather than just the RMS of the mean. Alan has shown that this spread can be accounted for mostly by small fluctuations in measurements, as well as variations in the properties of the baseball.



Again, these values are relatively stable through June, but afterward, it looks like there might be some strange effects going on. For now though, we will just look at what happened through June of 2007 at Dodger and PETCO.

So, now we'd like to quantify the effect at PETCO and correct for it. Here is the method I am going to try:
Since it would appear that there is little difference in the data at home plate between PETCO and Dodger Stadium I am going to start from home plate. At home plate, I will multiply the Y and Z components of acceleration by the ratio of Cd between Dodger Stadium and PETCO (according to the lines I have fit to those data through June, this multiple is about 0.83). I leave the X component alone for the moment, simply because it appears that the release point data in X are OK (and because I believe it is the camera not behind home plate that has the largest disparity in spherical distortion, and it would contribute most to measurements of Y and Z). After 'correcting' the accelerations in this manner, I then propagate the trajectory back to the release point from home plate and get back the new "initial" parameters.

So, let's see what this does for the data on Peavy at PETCO. The plot below is exactly the same as the first plot in this post, but with the above correction method applied:



The agreement here, in initial velocities and release point is almost staggering. Note that in addition to 'fixing' the initial velocities, that we have also, somewhat, 'fixed' the z position of the release point simply by correcting accelerations. I've repeated this for all pitchers with at least 50 pitches in each of these stadiums (through June 2007), and while not all of them show such great agreement in the 'corrected' data, there is a very marked improvement in the data for all of them by applying these corrections. But since this post is going a bit long, I'm not going to show them here, but I'll be happy to show them to anyone that wants to see them.

Now, the task is to try to apply this to all other parks. This is more challenging, because I think we got lucky with the comparison between PETCO and Dodger Stadium. It's not always the case that the final velocities at any two parks are in such good agreement to begin with. So I have to find some way to handle that condition. And there still could be framing issues involved (where we are systematically missing part of the trajectory in the fit), so I'll have to find some way of dealing with that too. And lastly, I think I also got lucky that only one camera had an issue. It could very well have been both, and if both cameras have an issue like this, it could very well be that the correction factor to apply is different for each component, and not so easy to derive as it was here. These are the issues I am looking into these days. So to recap: For this one special case at least, where I think I understand the issue with the cameras, I can completely derive corrections from a ratio of drag coefficients to make one park look like another in the data. Making one park look like another is really just a check to make sure we are doing what we think we should be doing. In reality, I can choose any value of Cd to be the "real" value, and make corrections from a ratio of the park I want to correct for to the 'real' value I have decided upon. As measurements of Cd advance, no "reference park" is needed...we can just choose the best 'true' value of Cd and correct to that. This type of correction is more robust than a utilization of league averages for all parks because the values we are correcting to can be fixed across seasons with ease...Allowing for instance, corrections to both 2007 and 2008 data such that they are compatible with one another. There are still issues to be resolved however when generalizing this method. It is my hope that as I learn more, (perhaps at the PitchF/X summit!), a way of resolving these issues will become clearer.

While I'm not entirely sure that I can account for all parks with a method like this, or a similar one, I think I'm on the right track. Let me know what you think. Especially if you might have an idea for getting around any of the issues mentioned in the previous paragraph. Or if you think I'm full of crap.

6 comments:

pobguy said...

Nice work, Ike. It looks like a step in the right direction, and I look forward to hearing more about it at the summit. Before then, I will try to get my simulation going to see how changing the distortion correction affects the trajectory.

pobguy said...

One more thing: This is Alan Nathan (pobguy).

Ike H said...

Thanks Alan,

I hope this is a step in the right direction. As I said, there are other issues that need a solution before this is complete, and I don't know if they'll get solved before the summit. If they don't, hopefully I can get some good ideas there.

Josh Kalk said...

Hi Ike,

I had only been paying attention to your live journal so I missed this post for a bit. I just want to take a few minutes to clarify a few of your concerns about my system.

First, once I make my corrections but before I normalize the data (pretend that every pitch is thrown at sea level with standard temp) the location of the pitch does seem to stay the same. You are 1005 right that I am messing with teh trajectory but messing with it in a way that doesn't change the final position.

Second, you are also 100% right that my corrections could have difficulties porting from year to year if sportvision changed the league average. In fact, all my reported numbers are just corrected to the average camera and if that average camera isn't the true value then my corrections will be systematically off by that amount.

Lastly, yes in my method I don't care all that much about what is actually going on from park to park.

I really look forward to when you have your corrections up and running. I have been wanting someone else to do a different method so I can check my results now for quite some time so once you have some nice results to compare please let me know and we can start by looking at a handful of pitchers and see what the results are.

josh

Ike H said...

Hi Josh.

Thanks for those clarifications!

So the thing I wonder the most about your first point is that there may be some velocity dependence in the difference between the "new" final location of a pitch trajectory and the old final location. Specifically, I think things are probably going to be alright for fastballs, but I'm suspicious about what happens to curveballs, and to a slightly lesser extent, changeups.

In my LJ, it was pretty apparent that at least when I applied your corrections (which I'm not entirely sure I believe the numbers from...the release points seemed consistent with your reported numbers though), that the coefficient of drag curve changed significantly for slower pitches. At the higher velocities, all parks seemed to converge toward one value, but at lower velocities, the curves for each park started to diverge strongly, and thats why I'm suspicious of final locations for slower pitches...

Anyway, I look forward to seeing you in San Fran (assuming you are going to be there), and talking with you more about these things.

Harry Pavlidis said...

Josh, will you be there?