Monday, May 12, 2008

Lewiston Bound!

I'm a little late in posting this, and for those of you only interested in my work related to PITCHf/x, you may want to skip this and go to my two previous posts today.  But it needs to be said.  

The OCU Stars are heading back to Lewiston.  If you know nothing about NAIA college baseball, then you probably have no idea what I'm talking about.  For those that do though, you'll know that Lewiston, ID plays host to the NAIA world series, and my Oklahoma City University Stars have been a regular attendee in recent years, and won it all in 2005.  They are getting another shot at it this year, and I couldn't be more proud of these guys.

Many people had all but counted them out this year.  They dropped a few games they probably shouldn't have, and even got swept in the regular season by top ranked conference rival Lubbock Christian.  LCU was just about everyones pick to storm out of the conference and roll onto the championship game in Lewiston.  The Chaps had rolled through the conference schedule with only one loss on the year, and swept their way through the conference tournament as well.  I'll admit even I had my doubts that the Stars would get back to the big dance.  LCU just looked too formidable.

But if there's one thing can that can be said about OCU Baseball...Win, Lose, or Draw, they've got a lot of fight in them.  Everyone from the Coaches, players, and alumni, to even the casual fan knows that the expectations of them are high, and when the season is on the line, that they won't just lie down quietly.  

In addition to winning out in the regional, with a 5-0 shutout of LCU in the semis and a 14-8 slugfest in the finals, Coach Crabaugh also achieved his 10th straight 50+ win season with their first win in the regional.  I had no idea he was approaching that mark, and just doing the math, that means that my senior season there was the first season of that streak.  I feel greatly honored to be a part of that.  Crabaugh runs a hell of program, and it's my opinion that he deserves every accolade that they can give him.  In addition to that, this years Stars team broke two more school records that were set during my senior season.  We set a lot of records back then, including 4 of them that were set by yours truly, (and broken during the National Championship season of 2005).  They broke the team home run record of 141 dingers that our 1999 team set (in addition to the 1988 team...which was also the national NAIA record), and are now at 143, having hit 3 long balls in the regional championship, including a grand slam off the bat of Landon Camp that tied the old record.  Not only that, but David Dennis tied the school record of 27 homers in a single season set by Rick Nadeau during my senior year, and Nick Klusaw in 2005.

It's hard to say if this years team ranks up there with the 2005 team or other teams that made it far in the NAIA WS.  They've stumbled a bit along the way.  But the way they've been playing this week goes a long way to convincing me that they are firing on all cylinders.  And if they keep that up, I think they have it in them to bring home a ring.  I couldn't be happier for them right now.  

Go Stars!

Spherical Distortion and Acceleration

While I await a release of Alan Nathan's PITCHf/x simulator, so that I can play with it and determine as best I can the relationships between different types of miscalibration and the results we see in the PITCHf/x data, I wanted to make a "toy" simulator of spherical distortion to see how that really affects pitch trajectory. This is an extremely crude thing that I whipped up in a few minutes one night while attending the PHENO conference in Madison, WI. The results I got were somewhat confusing, so I didn't touch on this at all at the summit. However, I think I now know how to interpret what I found, and I'll describe it below.

First though, the method:
For the moment, I just wanted to isolate the effect of mis-measuring spherical distortion in one dimension. At two dimensions (a real camera) there is some coupling of the two dimensions that I wanted to ignore for the time being. At three dimensions (two cameras -- the real PITCHf/x system), there is a coupling effect between two 2D things that gets even more complex. I wanted the simplest case, so I just imagined we had a ruler that exhibited the effect of "spherical" distortion centered somewhere on that ruler.

What does this mean?
Imagine that we are throwing a baseball straight up into the air with a giant ruler standing next to us. This ruler though is incorrect. It exhibits a radial distortion from some point governed by the following equations

r' = r*(1+k1*r^2)

where r is the true distance from the "center" of the distortion, k1 is a coefficient of distortion, and r' is the measured distance from the center of the distortion.

Since the unit of measurement I am using is simply distance (feet), the coefficient of distortion I choose has no meaning relative to the coefficients of distortion actually seen in the cameras (which typically use pixels as a unit of measurement). So I've chosen to use a value of k1 = 0.00001. It seems small, but it is "reasonable" and actually not that small. I then assume that we are throwing a pitch at 90 mph straight up in the absence of air, so that it has an initial velocity of 90 mph (-132 fps) upward, and an acceleration downward equal to g (32 f/s^2).
I then plot distance (as measured by our faulty ruler) as a function of time in 1/60 second intervals, and fit those points to the expected Newtonian equation of motion in 1 dimension. This gives me my "pitch" trajectory that I think I just measured. I do this over 60 ft, and I do this three times, once with the center of distortion at release point. Once with the center of distortion 60 ft away from release (the end of the tracking) and once with the center of distortion at the center of the tracking (30 ft). (and then I repeated this for a negative value of k1). To make things somewhat consistent with everyones understanding of PITCHf/x, the release point is at 60 ft, and the end of the tracking occurs at 0 ft. So "up" is negative, just the same way that pitches travel in the minus y direction toward home plate.

So the results:
First, just as a sanity check, I make sure that everything works like I think it's working, so I set k1=0...meaning our ruler is correct:



So, here you see the points in distance vs time using our correct ruler, and the results of the fit in the box above. p0 indicates where we think the release point is (60 ft), p1 corresponds to initial velocity, and p2 corresponds to acceleration divided by two (16). You can see that everything is spot on, as we should expect it to be.

So now, lets apply a distortion, centered at 0 (or at the end of the tracking).


Here you can see that we now think that the first point we tracked occured at about 62 feet instead of 60. This is expected because we have distorted our ruler so that it's not correct any more except at 0 feet. In addition, our initial velocity is now at 142 fps (96 mph), and the acceleration we measure has nearly doubled to 60 ft/s^2. So we see that in this case, we get a 3% increase in our measured initial position, a 7% increase in our measured initial velocity, and a whopping 88% increase in acceleration! Thats huge!

So now what happens if we center the distortion at the other end?



So here now we get the initial release point of the ball about right, but now we are slightly underestimating the initial velocity by only about 2 fps (~1.4 mph), and grossly underestimating the acceleration (5 ft/s^2 instead of 32). Wow. We also think that the final location of the ball is somewhere past 0 feet...at some negative number.

OK, but in reality, this isn't what happens ever. The cameras are centered about at the midpoint of a pitch trajectory. So what happens if our distortion is centered there, as we expect to happen with the cameras?


This is really interesting! We get everything about right, even though we know our ruler is pretty much wrong! Wierd. This happens due to the fact that we would overestimate the acceleration on one side of the distortion, and underestimate them on the other side. Our fitting only has room for one acceleration, so it makes the best guess possible, which appears to be equal amounts of over and under estimation and comes out with the inputs being equal to the outputs. Completely by accident!

If I repeat this experiment with a negative k1 factor, I get the same results, but reversed. So in addition to the overestimation or underestimation of spherical distortion of our cameras, the location of the camera axis relative to the midpoint of our tracking makes a huge difference in how much we over or under estimate the acceleration of a given pitch. In reality, we should be close to the midpoint, but being off by a bit, in addition to a registration that uses an incorrect value of k1 will impact the measured accelerations, and the magnitude of each of these (the incorrectness of k1 and the distance from camera axis to tracking midpoint) both greatly affect the mismeasurement of acceleration.

PITCHf/x summit

I just spent a weekend in beautiful San Fransisco talking baseball, visiting with old friends, and making some new friends. I wish it could have lasted longer. Having never been to the Bay Area before, I wasn't sure what to expect. Suffice it to say, I loved everything I saw there. From downtown SF, to AT&T Park, to Half Moon Bay. Just a beautiful place to be, and I can't wait to go back sometime.

The purpose of my visit was to attend the 1st Annual PITCHf/x Summit, an event where sabermetricians, Sportvision, MLBAM, and team reps gathered to discuss the uses of this great new data set, and potential future data sets. I gave a talk largely centered around my previous post on corrections, and how measurements of C_d can be an indicator of data quality and a quantity from which we might be able to derive corrections, perhaps even on a game-to-game basis. I was really impressed with the willingness of Sportvision to discuss every aspect and gave us the chance to see it in action, which was really key for me in understanding how things work.

I had a few misconceptions that are now clear. For instance, I had thought that the two cameras recorded pitch locations in a synchronous mode, when in fact the recording is asynchronous. They also reinforced some things I had thought, but wasn't sure of. For instance, it's absolutely true that the data near home plate are more accurate than the data near the mound. When they do their camera registrations, there are a whole host of calibration points near home plate, and only a few near the mound. This makes perfect sense for a number of reasons. A) Being accurate near the plate is what really adds value to a broadcast or webcast of a game. Much more so than being as accurate near the mound. B) They don't always have a lot of time to do these calibrations. Groundskeepers run a very tight ship with their fields and getting on the field to do registrations isn't something they can take their time with. I hadn't thought of this before, but I'm not so surprised. While you'd really like to set up a whole grid like structure in the region between home and the mound, there just isn't always the time to do it. The method they have in place does a fairly good job in most cases, but leaves the mound less constrained than home plate.

I don't know if there's any way around all of that at the time of registration. However, it may be that through measurements of the drag coefficient over the course of a game, It might be possible to fine-tune the registration a little more, and if not correct things on the fly, at least back-correct the data after the game.

Anyway, for those of you that didn't attend the meeting, my slides and the slides of everyone else that presented at the Summit, can be found in the link above. Shortly, I'm going to post a little something about some things I left out of my talk on corrections. I left them out because the results I was getting were a bit confusing. But through discussion at the summit, and seeing the system in action at AT&T park, I think I now know how to properly interpret these results...

Stay tuned.

Some other thoughts:

Ross Paul, of MLBAM, gave a nice talk on the method in place for real-time pitch classification. Analysts have the benefit of 20/20 hindsight when it comes to pitch classification, which is not always something you have when trying to do the job in real-time. Especially when you need your method to be general and contain as little information about the pitcher as possible (because you want it to perform just as well on a minor league call up as it would with a veteran pitcher). He chose to use an artificial neural net because he believes it to perform better than a decision tree with continuous variables. While I am not the expert in DTs or ANNs, there are people I work with who are. From what I seem to remember, ANNs can be very fickle about useless variables, and DTs less so. I'm going to check on that though. At some point, I may want to try to do my own "pseudo-real-time" classifier with a Decision Tree, if for no other reason than to learn a bit more about them. I also would like to see how big of a gain I can get with boosting a tree (a technique we use here). There are other potential baseball analyses that I'd like to do with a DT (but that I'm not going to mention at this point) or some other machine learning technique, and I think pitch classification would be a great way to start learning about these methods.

HITf/x, a potential new system to track batted ball trajectories was also discussed, at great length. This would be another giant leap forward in available quantitative baseball data. Being a former pitcher, I'm philosophically less interested in such a system as I am in pitched ball trajectories, but as a physicist, it is highly intriguing. Having such a system as a compliment to PITCHf/x would be an enormous boon to the field of Sabermetrics.

Tuesday, April 22, 2008

PitchF/X: Accelerations and Corrections, and why you should care.

So I've been wanting to post something about corrections to the 2007 pitchf/x data I've been working on for a while now, and time constraints keep getting in the way. So, here goes.

First let me start by saying that as of now, the only corrections I know of are those defined by Josh Kalk. The method he uses is a very creative one. He essentially uses additive corrections to force the initial parameters of pitches in each park to some league average. Thats a really short and simplified explanation. Yes, I know what I just said makes it sound like he makes every pitch have the same trajectory, and no, thats not what he is doing. For each park he calculates an additive correction to be applied to each pitches initial parameters. His method though leaves me a bit unsatisfied, even though it does seem to improve the quality of the data by some amount. The reasons I am unsatisfied are the following:

A) Josh has made some remarks in some places that leave me confused. First, he changes the initial parameters of each pitch by some amount depending on which park it was thrown in. But, at the same time he claims he likes the data near the plate so much that he doesn't apply any correction to those points. But then this means that there is a disconnect somewhere. He changes the properties of a pitch trajectory. Doing so will most likely alter the final location and velocity of the pitch when you propagate that trajectory in time. So I'm really not sure what he means by this. It seems that he actually has two final locations for each pitch.

B) While his corrections can be applied across any given season, correcting across multiple seasons under this method becomes a very thorny issue. Some pitchers change their mechanics from year to year. Typically pitchers will lose some velocity from year to year. It gets very complicated very quickly. If the good people at Sportvision decide to alter their cameras between seasons by some amount, the values of "league average" can change from year to year, making corrected data from year A not very compatible with year Bs corrected data. This is important for multi year studies, say if you want to develop pitcher aging curves for velocity or such. Or if you want to glean whether or not your favorite pitcher has changed his mechanics in the offseason by looking at the release point.

C) His corrections make no attempt to determine what is actually happening from park to park. Yes, I know that nobody really wants to get into the technical details of what could be going on to cause these things to happen....but someone should. I don't know if Sportvision has the manpower to do the consistency checks necessary to ensure that the data from every park look the same. They have some calibration routine, and I imagine that to them, everything looks fine with their cameras after running the routine. But there is obviously something going on. The people at Sportvision are smart people. Although I generally believe them when they say that their measurments are accurate to within an inch or two, even smart people can overlook things, especially if the thing being overlooked is a subtle thing. Which is what I think is going on at most parks.

I should probably note here that I am not trying to disparage Josh. I think he is a fine upstanding guy, and a good physicist as well as a good sabermetrician. I enjoy reading his commentary at THT, and look forward to more of his articles. I only take issue with his method of corrections. For the most part and for a lot of the analyses out there, corrections aren't really a necessary thing. But to really get the most out of the data, corrections are needed. And those corrections need to be robust enough to handle the issues I raise here for many of the more ambitious projects I can imagine with the pitchf/x data. Perhaps my issues with this are due to my working for too long in an environment where nothing works the way it's supposed to at first, but everyone out there should keep in mind that there are issues with the data when they do an analysis, and they should seriously question whether those issues will or won't affect their findings. It's my feeling that many in the sabermetric world don't ever really consider data quality outside of sample sizes to be an issue. Usually they don't have to. A ball in play is either a hit or it's not right? But people should at least give it a thought.

So, what could it be thats causing the disparity between parks? One thing I had noted earlier, and posted to my Livejournal here, was that the things we'd really like to do to determine what's going on are to measure in the data something we already know. I put forward the idea of measuring the "Coefficient of Drag" as that something. Even though we don't know exactly what the value of Cd is for a baseball, we can at least use it to do consistency checks from park to park. I've shown previously that there are significant differences between some parks in Cd. For the rest of this post, I'm going to focus on the differences between two parks: PETCO Park, and Dodger Stadium. (Previously, on my LJ, I had compared PETCO to Anaheim, but those two don't have many pitchers in common. Two NL stadium will have many more pitchers that have pitched in both).

Through some communication with the Sportvision people and Alan Nathan, in which I posited many ideas that could be checked that might be responsible for the Cd discrepancies (and many of those ideas were quickly shot down...they were mostly semi-obvious things....I did mention above that the Sportvision people were smart right? Yeah, they had thought of these things too. But it's always good to ask.) I learned a great deal about what is going on, and I am still learning more (it's really very fascinating...you should see for yourself).

It's all about the Accelerations.

As it turns out, my first inclinations were totally wrong. Those were primarily that there might be something fishy with either the timing of frames, or that there might be something fishy in the length scales. These would be the easiest things to contribute to a mismeasurement of Cd, but they would also tend to mismeasure other things. Like velocities and positions. By a significant amount. And those kinds of mismeasurements simply aren't seen in the data. At least, not at the levels that would be expected from an incorrect length scale or time scale large enough to explain the differences in Cd. So there's something else going on. Something more subtle.


Almost simultaneously, Alan and I put forth two separate ideas. It turns out that Alan's explains the problem at PETCO almost entirely. Mine may still exist at other parks but it will be more difficult to see I think. Anyway, the two ideas, starting with Alan's:
1. That somehow, there is some distortion in one or both cameras that has been improperly measured.
2. Slightly more complex (and perhaps as a result of 1.), that systematically there are some parts of the trajectory that are not included in the fit to a trajectory. Specifically, in reality, the acceleration a baseball experiences will vary greatly over the flight between the mound and home plate (and this relationship is almost linear with distance travelled). The model used to reconstruct the pitch trajectory is one of constant acceleration. In this model, the acceleration that is found is usually near the mid-point of initial and final accelerations. If somehow points at one end or the other of the pitches trajectory are being systematically missed at certain parks (tossed out because of not so good agreement between the two cameras for instance), then the acceleration our model finds in the fitting procedure will be shifted up or down because the 'measured' initial or final accelerations won't be included in the mean.

The nice thing about these two ideas is that when you think about them in relationship to PETCO, it's possible to determine if one or the other dominate. In order to get systematically high values of acceleration at PETCO, if we are missing points in the fit, then those points must be in the latter half of the trajectory, so initial velocities between PETCO and some other stadium (we chose Dodger Stadium) with more reasonable acceleration values should be about the same, but at the plate, pitches should have a much lower final velocity according to PitchF/X. If Alan is correct, and the distortion is greatest near the mound, then the initial velocities at PETCO should be higher than for, in our case, Dodger Stadium, and the final velocities should be about the same.

So to test this, we looked at pitches from Jake Peavy (because who doesn't like watching him pitch?), through June 2007 (more on why later) in both parks. Below are the results: In the top left are the initial velocities of his pitches, the top right are the final velocities, the bottom left is the z value of the release point, and the bottom right is the x value of the release point. Pitches at PETCO are in red, pitches in Dodger Stadium in blue (Dodger Stadium data has been scaled up so that it appears that there are an equal number of pitches at both stadiums):



As you can see, Alan was totally right. The final velocities are almost the same, but the initial velocities are drastically different! But that's not all the evidence that a miscalibrated camera distortion can drastically change the accelerations. First, are we sure that PETCO has a miscalibrated camera distortion? To answer that, Rand Pendleton of Sportvision was kind enough to look into the PETCO camera registration information for us. What he found sealed the deal. At least for PETCO specifically.



In this plot, Rand has plotted the value used for spherical camera distortion (to get an idea of what spherical camera distortion is, think of a fish-eye lens for an extreme example) versus pitch number at PETCO. You can see that sometime during the season, the value for this factor changed dramatically for one camera, but not the other. He then plotted the measured values of Cd for each of these two "epochs".



As you can see, the shift in k1 (the distortion factor for one camera) does in fact correspond to a shift in Cd, and therefore a shift in the measured accelerations by that camera. Neat! It turns out if you simply plot average Cd values by date at PETCO, you can see the same effect:


Here the x axis is simply days after March 1, 2007. The largest values occur roughly through June, and I've fit a straight line through them. Compare this to Dodger Stadium. The error bars on this plot are also extended a bit to capture the total spread in Cd values rather than just the RMS of the mean. Alan has shown that this spread can be accounted for mostly by small fluctuations in measurements, as well as variations in the properties of the baseball.



Again, these values are relatively stable through June, but afterward, it looks like there might be some strange effects going on. For now though, we will just look at what happened through June of 2007 at Dodger and PETCO.

So, now we'd like to quantify the effect at PETCO and correct for it. Here is the method I am going to try:
Since it would appear that there is little difference in the data at home plate between PETCO and Dodger Stadium I am going to start from home plate. At home plate, I will multiply the Y and Z components of acceleration by the ratio of Cd between Dodger Stadium and PETCO (according to the lines I have fit to those data through June, this multiple is about 0.83). I leave the X component alone for the moment, simply because it appears that the release point data in X are OK (and because I believe it is the camera not behind home plate that has the largest disparity in spherical distortion, and it would contribute most to measurements of Y and Z). After 'correcting' the accelerations in this manner, I then propagate the trajectory back to the release point from home plate and get back the new "initial" parameters.

So, let's see what this does for the data on Peavy at PETCO. The plot below is exactly the same as the first plot in this post, but with the above correction method applied:



The agreement here, in initial velocities and release point is almost staggering. Note that in addition to 'fixing' the initial velocities, that we have also, somewhat, 'fixed' the z position of the release point simply by correcting accelerations. I've repeated this for all pitchers with at least 50 pitches in each of these stadiums (through June 2007), and while not all of them show such great agreement in the 'corrected' data, there is a very marked improvement in the data for all of them by applying these corrections. But since this post is going a bit long, I'm not going to show them here, but I'll be happy to show them to anyone that wants to see them.

Now, the task is to try to apply this to all other parks. This is more challenging, because I think we got lucky with the comparison between PETCO and Dodger Stadium. It's not always the case that the final velocities at any two parks are in such good agreement to begin with. So I have to find some way to handle that condition. And there still could be framing issues involved (where we are systematically missing part of the trajectory in the fit), so I'll have to find some way of dealing with that too. And lastly, I think I also got lucky that only one camera had an issue. It could very well have been both, and if both cameras have an issue like this, it could very well be that the correction factor to apply is different for each component, and not so easy to derive as it was here. These are the issues I am looking into these days. So to recap: For this one special case at least, where I think I understand the issue with the cameras, I can completely derive corrections from a ratio of drag coefficients to make one park look like another in the data. Making one park look like another is really just a check to make sure we are doing what we think we should be doing. In reality, I can choose any value of Cd to be the "real" value, and make corrections from a ratio of the park I want to correct for to the 'real' value I have decided upon. As measurements of Cd advance, no "reference park" is needed...we can just choose the best 'true' value of Cd and correct to that. This type of correction is more robust than a utilization of league averages for all parks because the values we are correcting to can be fixed across seasons with ease...Allowing for instance, corrections to both 2007 and 2008 data such that they are compatible with one another. There are still issues to be resolved however when generalizing this method. It is my hope that as I learn more, (perhaps at the PitchF/X summit!), a way of resolving these issues will become clearer.

While I'm not entirely sure that I can account for all parks with a method like this, or a similar one, I think I'm on the right track. Let me know what you think. Especially if you might have an idea for getting around any of the issues mentioned in the previous paragraph. Or if you think I'm full of crap.

Wednesday, April 16, 2008

Welcome

Welcome to my baseball analysis blog.  Since I want to share some of my baseball analysis with the rest of the world, I decided that it would be better to have a place to do it.  So here we are.  All baseball, all the time.

So who am I?  I'm a physicist, and a ballplayer who is just getting into the field of sabermetrics.  It's a big field, and at this point, I have a long way to go and a lot to read to catch up to the rest of the field.  Confession:  I haven't read the Bill James Abstracts.  And I probably won't for a while.  And thats not because he's a Red Sox guy.  I have however read "The Book", and many an internet article on sabermetrics.  And there are a whole lot of acronyms out there that won't all fit into my head now.

Anyway, being new to this field, I'll probably take a few missteps here and there, and I'm almost sure that I'll do a little needless analysis at some point, replicating or nearly replicating someone elses work simply because I didn't know it existed.  Hopefully I don't, but more and more, it seems that most of the exciting analysis is spread out over the web or in books I would never think to buy (without someone pointing them out to me).  

So if you think I overlooked something at some point, don't be afraid to point it out.  Like I said, I'm just getting started, and trying to figure out how to balance this whole new thing with everything else in my life (like, you know work).