Tuesday, May 21, 2013

Oklahoma City

This is a very non-baseball post.  This is also a very personal post.  It really doesn't belong here at all, but these words have to come out, and I don't really have another place to put them right now.

It's 2:26 am.  I can't sleep. I'm typing this up on my phone because there is no power here, and I can recharge in my car if I need to. There are at least two helicopters that have been buzzing just north of us since we got home at 10ish. They are not the reason I can't sleep.  My mind is racing in a million different directions right now.  I might cover all of them here, because I might not be able to sleep if I don't.  I may ramble a bit. I may not go in chronological order. I'm not entirely sure of the chronology of all of this myself yet. As this post is mostly for my own sanity, if you happen to be reading this, and are not me, please forgive me any transgressions against basic English composition. If you can't, I'll blame most of them on autocorrect anyways. 

At about 3-ish today (yesterday really I guess...It's hard for me to increment the days without a sleep between them), a big fucking tornado tore through the area immediately to the north of my house. To answer the first round of questions you might have, we are all fine. Our house stands with no visible-in-moonlight damage.  We were all out near the State Capitol when the tornado hit.  I was at a continuing-ed seminar being hosted by the Geophysical Society of Oklahoma City in the Oklahoma history center, while my wife had taken the kids and our babysitter to the state Capitol to inform some of our legislators just how bad some pieces of proposed legislation really are. I really dig how she's not afraid to tell powerful people that they are full of it.  Especially when she's right.  And she's usually right. I also really dig that she is determined to show our kids just how easy it is to do so, and takes them with her on trips like these. 

I'm especially glad of that last fact today. 

You see, most of the buildings at the Capitol complex have some very large basements. Also, this time they happened to be very much away from the destructive parts of the storm.  However, as a matter of policy all people in those buildings were herded into the basements. You can never be too careful with these things.  The destructive power of some tornadoes are matched only by their randomness. A random walk with a sledgehammer.  This one had a big fucking sledgehammer. 

Down in those basements, some people quickly self segregated. Not by race, or gender or any of the other lines we often misguidedly draw upon society though.  AT&T customers over here, Verizon over there. No, none of us really have any dog in the hunt with regards to the carrier wars. Instead, several users of each cellular network sought out and found those small areas of the large basement that let their signals through.  Even though those signals were delivering data to their users at a rate that would make a pimply faced teen in 1994 proud of his 2400 baud modem (yes, I had one of those then, and it was outdated then too). 

Slowly the news trickled in. "Holy shit, Moore High school is gone!" I heard someone exclaim. Where this shard of knowledge originated, or its veracity, I have no idea.  A few minutes later the same was said about Westmoore High. "Please don't let the next person say Southmoore High" I said to no one in particular. 

Guess what someone said a few minutes after that?  I don't feel I need to answer that question. 

I should note though that currently I believe all Moore High Schools are probably still standing. I could be very wrong about that. When we finally got out the radio announcers seemed to be focusing on two elementary schools. One of those though is about a half mile north of Southmoore.  

For some reference, I live about a mile south of Southmoore. 

We were down in the basements for what seemed like an eternity.  It was 4 something when they gave us the all clear. I called my wife. Our babysitters parents were kind enough to welcome us into their home while we tried to figure out whether or not there was a home to go home to and how we would get there with all of the destruction between us and home. And also as we checked up on friends and family, and let those who might have been worried about us know that we were all ok. 

I'm happy to say that so far, as far as we can tell, all of our loved ones are safe. Although there are still some spots of worry on that front. A good friend of ours, a single mother of two autistic children lives in an apartment complex just off of 19th. A mutual friend of ours out in San Francisco mentioned on Facebook that she heard they were safe, but had not heard directly from her. (Wireless Internet is currently all but useless at my house, so if there have been any updates on her status since 10pm, I have no idea what they are right now).  Another friend of ours, a mother of a 2 year old lost her house. She lives in a neighborhood just north of 19th and Santa Fe. I had reported on twitter earlier that she and her son had ridden out the storm in the bathtub. I'm happy to say that that report was not entirely accurate. Had they done so, there is a very good chance they would not be with us anymore.  Instead, as they were hunkered down in the tub waiting for the storm to hit or pass, she panicked. She put her son in the car and drove away. Thankfully she chose a good direction to go, as her last update on Facebook noted that when she got back to the house after the storm had passed that her tub was filled with debris.  Apparently she got out about 10 minutes ahead of the storm.  Her husband works at OU so we are pretty sure he is fine, and they have some family to stay with for the time being...

Most of the other updates we have gotten through various social media, texts and phone calls have all been positive.

Staying put is hard. 

Staying here when you know there are still kids missing in a demolished elementary school is gut wrenching. Going however may do more harm than good. I'm not trained in any life saving anything. (Note to myself #1:  change that).  The road just leading to my neighborhood is crazy clogged with cars. Adding to that might be adding to the time it takes someone to get to medical professionals.

I don't think I'll be going to work tomorrow. Instead we'll try to meet up with other members of a local group we belong to and do whatever we can find that needs doing. 

I must have said enough. I'm feeling a bit sleepy. I have other thoughts. Even some ideas that might be useful in the future. But these can wait. At least until tomorrow.  

For now, be safe.  Hug your loved ones tight and let them know in no uncertain terms that they are loved. Also, hug a stranger.  They might need one too. 

Saturday, April 13, 2013

Part 6: More thoughts

I thought I would post a little note explaining why I am approaching corrections in the manner I am currently pursuing.  Way back in 2008, when I originally started messing around with PITCHf/x data, I noticed a curiosity that I spoke a bit about then.  That measurements of the coefficient of drag varied by quite a bit at some parks.  This absolutely should not happen.  While at least one variation in Cd was traced back to an error in the spherical distortion parameter on a camera at PETCO field in 2007, my conversations with some of the people at Sportvision at the first PITCHf/x summit in 2008 led me to believe that for the most part, they had a good handle on these spherical distortion parameters.  I tried a few things that used the error in Cd to estimate correction factors, but those were utter failures.  Not long after that, my attention turned away from PITCHf/x for a while.  At the time, Josh Kalk had derived some correction factors that simply added a constant to each pitch parameter based on the park it was thrown in.  Mike Fast had used a variation of this method to correct initial and final positions at the game level (but to my knowledge never applied this method to velocities or accelerations).  I was never too satisfied with this method although on some level it did work.  The main reason I found this method unsatisfactory was that it was very difficult to find a justification for adding a constant value to velocities and accelerations.

So when I decided to come back to PITCHf/x I noticed two things:  1) Josh's correction method never really gained widespread adoption, possibly because it was pulled down when he was hired by the Rays and 2.) The calibration issues still had not been completely eliminated.  It was Jon Roegele's article detailing issues with horizontal movement at Tampa Bay that got me thinking about this again.  And I thought that if we could describe the systematic issues as a transformation of the world coordinates that then we could potentially derive corrections that have a justification behind them.  So the simplest case to look at first was an affine transform.  This is the class of linear transformations that consist of rotations, translations, scaling and shearing.  Specifically I thought there might be something to a shear transform.  It has a few properties that could fit the problem well.  Specifically, if you had a miscalibration that produced a shear that was entirely due to a tilt of the z-axis, then a large fraction of the calibrations come out looking just fine to the user.  It would only be when re-calibrating the z axis that this would possibly get noticed at all.  Secondly, this same kind of shear could also conceivably be responsible for the differences I saw across parks in the coefficient of drag.  If one measured gravity in a coordinate system that had a z-axis shear, then gravity would have an x or y component to it.  And it turns out it wouldn't take much to make an impact, as the typical drag force on a fastball is somewhere in the ballpark of 1g.

So that was why I tried an affine transform first.  It almost worked.  But not quite.

But in trying I was able to easily modify my code to the method I mentioned in part 5.  In thinking more about what I did the other day, I keep coming back to basically where I was in 2008.  That it's really the accelerations that matter (those are the only matrix elements that are more than 5% different from the identity matrix...actually, one other velocity component was significant too.).  Allow me to litter the screen a little bit with some plots now:  These are the results of this fit using only the lefthanded pitcher, Chen.  Again, in these plots green=Tropicana, red=Camden, blue=Tropicana made to look like Camden:

Chen's Trop pitches look much more like his pitches at Camden, and the final locations don't move a whole lot.  Hellicksons pitches mostly look better, but the final locations seemed to move in the wrong direction.

Lets go the other way, using only Hellickson:

Now the situation is reversed:  The final locations of Chen's pitches look terrible, but Hellicksons don't move much.

For reference then, from my previous post, when you basically average those two corrections (include both pitchers in the fit):

I think this brings us back to that place I didn't want to go.  Spherical Distortion.  I think that not only is it necessary here to correct the accelerations, but that that correction will actually be release-point dependent.  (Although we might be able to call the average of a righty/lefty correction a close enough approximation)Also, it's entirely possible that we could still limit ourselves to 9 parameters plus a small number of other parameters.  It's likely that there is one Y location that is close to being a fixed point.  If we can find that location to apply corrections to acceleration only, we likely don't need all the other terms.  I'm not sure if it's possible to find that though.  Actually, it need not be spherical distortion, but some sort of effect which makes the PITCHf/x coordinate system a little bit non-uniform.  Spherical distortion is simply the first cause of that I can imagine.  There may be others.

How to implement something like that without exposing ourselves to smoothing over changes in performance is something I need to think about.  I had ideas before that were dependent on the correction factors not being release point dependent.  This kind of blows some of them out of the water for a moment.

Thursday, April 11, 2013

Part 5.

Previously, I had been attempting to derive and apply corrections to PITCHf/x data under the theory that an affine transformation of the PITCHf/x coordinate system in a given stadium (Tampa Bay in the example I was using) would approximate the transformation from the PITCHf/x coordinate system in that stadium to the PITCHf/x coordinate system in another (Baltimore).  This required that I derive 9 parameters, and that the 9x9 application 'matrix' to a vector of pitch parameters be block diagonal with the same 3x3 block on the diagonal.

It turns out that this approximation can either closely correct for movement OR can closely correct for release point.  But not both at the same time, and is thus insufficient for our needs.

Based on an email exchange with Alan Nathan, he suggested* that I should instead allow each 3x3 block to be independent.  I both like and dislike this suggestion at the same time.  First, by going from 9 parameters to define a correction to 27, we run the very real risk of overfitting the data, so we are going to need a strategy to deal with that.  It is very likely that many of the terms will be superfluous.  So we are going to want to cut down as much as we can on the number of significant terms.  Secondly, this now means that we are explicitly trying to approximate a nonlinear transformation with a linear transformation.  That's not so bad in and of itself...the problem could very well be locally linear, and thus a close enough approximation to use....but it makes me a little uneasy.

So naturally I went ahead and tried it.

Expecting some overfitting, the following figure is what my toy model spit out.  In this figure (and all following), green dots represent pitches thrown on July 24, 2012 at Tropicana, red dots represent pitches thrown August 4, 2012 in Baltimore, and blue dots represent pitches thrown at Tropicana, "corrected" by this model. (click to embiggen...if it's working as I think it should)

We seem to do somewhat better.  Both movements and release points are at least mostly centered in the right places, with some possible nits to pick on Chen's slider and Hellickson's curve (but not as bad as before).  However the locations of pitches seem to move quite a bit, and in a manner that is not easily discernable.  Also, it is especially apparent that Hellickson's "corrected" release point (at 50 ft), and Chen's too, have much smaller variances than the raw data at either park.  Both the effect on release and final position are likely relics of overfitting.  They possibly also arise because I am creating a mapping between each pitch of a given type in one park to every other pitch of that type in the other park.

So how can we reduce this overfitting?  Well one common method is through regularization.  What we have been doing all along is solving a least squares problem that finds the parameters of our transformation matrix that minimize a cost function:  The sum of the squared differences of our corrected pitch parameters and the mapped pitch parameters in the park we are correcting to.  To regularize this problem we simply add a competing term to our cost function.  There are two somewhat obvious choices for a regularization term that I can pull off the top of my head.  The first comes from the argument that we should expect our transformation to be not too different from the identity matrix (no correction), because we expect (and want) our corrections to be small.  In this case we would apply a penalty to the cost function that is proportional to the squared difference between our transformation matrix parameters and the identity matrix.  The second choice we could make would add a penalty term proportional to the square of the distance between the corrected and raw final positions.  This would serve to insure that the changes in final positions were small.

As a first test, I chose to go with the first choice for regularization (in no small part because it is much easier to code).

So here are the key results of the same 27 parameter fit with a regularization term designed to keep the transformation close to the identity matrix (in the same format as before):

That seems much better.  There is still a potential nit to pick on Chen's slider, but otherwise this looks somewhat promising.

I also ran this "correction" scheme by calculating based on only one pitcher or the other.  Surprisingly, the correction then for the pitcher that was left out was not horrible.  It wasn't as nice-looking, primarily with respect to the final positions of pitches.  But the movements and velocities were mostly there.  I have a feeling that in this case it has more to do with the fact that I am using one righty and one lefty.  I expect that a matrix thus computed using only pitchers from one side or the other will be a farther approximation from "truth" than if at least one pitcher of each handedness is included.  (if you want to see the plots from leaving one or the other out, let me know...I'm leaving them out in the interest of space...4 more plots like the ones above)

I still need to run a few other sanity checks, because I'm still a bit afraid of overfitting and other things.  If they don't show everything going to the crapper, then I think from here I may at least have a framework from which to start working out on corrections that cover an entire season.  I'd also like to try out the other regularization method I mentioned above.

We'll see.

Let me know what you think, either in comments, or via email. ( ike.hall@gmail.com )

*Suggestion is probably not the most appropriate descriptor of Alan's email.  Actually, he seemed to be slightly mis-interpreting what I was previously doing (probably due to my writing it out poorly) as requiring 27 parameters when in fact I was only requiring 9, however without that exchange I probably would have gone down a much different path before this one, because it just feels like too damn many free parameters.  So, while it may not  have been (and probably wasn't) his intention to suggest this, I went ahead and took it as such.

Thursday, March 28, 2013

Corrections, Part 4: Methods and things

In my previous post, I showed a few plots representing the effect on movement and velocity only of applying a linear-operator-as-park-corrector to "correct" one game in Tampa to one game in Baltimore. This operator showed some improvement, but also created at least one more problem.  Specifically, it overcorrected the horizontal movement of Jeremy Hellicksons curveball to be much less than it should have been.  In that post, I went into almost no detail about how this operator was derived.  That's what this post is for.

The details so far are as follows (and note that there are many here that may change in the future).  The goal is to derive one operator that can be applied to positions, velocites and accelerations.  At this point, I apply that operator somewhere along the pitch trajectory, then propagate the pitch backwards to 55 ft to get the initial velocity and release parameters.  That's the plan for the use of this operator.

Recall in my first post the form I used to write down what we wind up seeing from PITCHf/x at any two given parks:

With p0 representing an identical pitch as it leaves the hand of a given player at either stadium S or stadium T.  The GRG^-1 term represents the correction term that must be applied to account for a difference in air density, with the R operator being a diagonal matrix with rho/rho_0 on the diagonal elements that correspond to ax, ay, and az, and rho represents the air density at that particular park, and rho_0 our standard air density.  T and S simply represent the operator that transforms an actual pitch that was thrown into the data we receive at their respective stadiums, and are unknown, and dt and ds are simply the data we receive.

Since the first toy experiment was designed to ignore the effects of air density, we can simplify to write:
dt = T p0
ds = S p0

If then we have a situation where the same pitch is recorded by our two different systems, T and S, then we can say:

p0 = T^-1 dt
and thus
ds = S T^-1 dt

Since both S and T are unknown, we can only solve for the combination S T^-1, or T S^-1 explicitly.  We'll call S T^-1 = A.  This A is the operator I am solving for in my toy experiment.

Now we simply assume that fastballs thrown by a given pitcher in one park are equivalent to fastballs he threw in the other park.  And the same for changeups, curves, etc, etc.  There are a few ways to approach this.  For now, I simply went with the brute force approach.  We "assume" that each pitch in a category thrown at one park maps exactly to each pitch in the same category thrown at the other park. (This actually works out to be nearly identical to using averages of the parameters, but with a weighting that prefers larger clusters, and although it seems like it should, it does not significantly add to computational complexity).  Doing so gives a set of equations of the following form for each mapping defined

a11*xs + a12*ys + a13*zs = xt
a21*xs + a22*ys + a23*zs = yt
a31*xs + a32*ys + a33*zs = zt

where xs, ys, zs can represent acceleration components for pitches thrown in S and similar for pitches thrown in T.  Note though: We could also use velocities.  And release points.  Also note, in the case of accelerations, it turns out that when we cannot ignore air density, there is a 4th translation term in those equations, and it is proportional to the difference in air density between the two parks...but since we are ignoring air density for the time being, that term has been dropped.  You'll also note here that I have left out any scale factors.  For the moment, I am limiting myself to affine transformations without translations.

Next we just rewrite this set of 3 equations for each pitch mapping in matrix form.

Xa = y

where X is a 3N x 9 matrix (N here represents the number of pitch mappings we have, not the number of pitches thrown in each park.  In my toy model I have something like 5000 mappings) composed of pitch parameters for each pitch thrown in park T (Tampa in my previous post) and y is a 3N vector composed of pitch parameters for each pitch thrown in park S (Baltimore in my previous post), and a is a vector composed of the elements of A.  This is the classic inverse problem with the well known least squares solution of:
a = (X^T X)^-1 X^T b
(although computing this without loss of precision can be a bit of a headache, but there are ways around that)

Now, my first rounds of doing this were remarkably unstable.  So to stabilize things, I also included in X and y the initial velocity components of the pitch, and the results of that produced the movement plots you saw in the previous post.  As it was late when I got around to making plots, I didn't check the release points, or I would have shown you how bad they got screwed up.  The separation in release points got bigger.  In my tests so far today, I included release point data, and that seemed to fix the release points, but also let a little bit of the original movement separation back in.  However it's progress.  I should also note that the matricies looked almost the way I thought they might look.  More like a shear mapping than anything else.  Imagine the x/y plane staying mostly fixed and the z axis falling over a little bit in some direction.  Also, this was not a large tilt by any stretch of the imagination.  About 6 degrees.  It's hard to notice a picture frame thats off by 6 degrees.

So anyway, that's where all this is at right now.  It's not as pretty as I hoped, and I'm not showing plots now because I don't feel like wasting space with them, but it's not a complete loss yet.  (If you'd like to see more plots, just let me know.  I may make a big plots post.  I've got about a million of them now.) I may have to open things up and compare more than just two games.  Two series maybe.  We'll see.  I may also need to open it up to the possibility of a projective transformation on the coordinate system.  I hope not though.  There's still more to try, and I'm encouraged that at least a few things are starting to converge.

Corrections part 3: Almost there...maybe?

In my last two posts, I talked about the idea of applying corrections as a linear operator on the pitch parameters, with the trick being how to derive said operators.  I've tried a few methods out, and one of them has shown a hint of promise.  There is still more work to be done, as will be shown in a minute, but the initial findings leave me hopeful that I am on the right track.

I wanted to make things as simple as possible for this first test, so I took a stadium that is known to deviate from the league norm strongly in at least one observable parameter.  Tampa Bay in 2012 fit that bill, as was described by Jon Roegele here.  Next, I wanted to find 2 games that fit the following criteria:
i - The same starting pitchers went against each other in each game.
ii - One game was played at Tropicana Field, and the other at some other park.
iii - Each starting pitcher went long enough to throw at least 80 pitches.
iv - the 'some other park' of condition ii was at an elevation near Tropicanas (15 ft)

That last requirement was made so that I would, for now, be able to mostly ignore the effects of air density.  We'll tackle that problem at a later date, although my cursory glance at that problem so far indicates that the air density problem could be as much of a help to us as it is also a headache for us...it may not be too...it depends on a few other things that I haven't completely worked out yet, so we'll just leave it alone for now.

The easiest choices to go hunting around for games that meet those criteria then are games played within the American League East.  Except maybe Toronto (~300 ft).

New York and Boston however both seem to have some interesting effects of their own.  Perhaps we should leave them alone for now.  So how about Baltimore.  As it turns out, there weren't a whole lot of pairs of games in 2012 between the Rays and the O's in which the same two pitchers dueled it out into the late innings.  But there was at least one pair that met my conditions.
Those two games occurred within 2 weeks of each other as well.
On July 24 Jeremy Hellickson and Wei-Yin Chen both pitched into the 7th at Camden Yards in a 3-1 Rays victory.  11 days later those two met again in Tampa in a game that saw Chen throw 7 shutout innings in a 4-0 Orioles win.  Hellickson only lasted 4 innings in that game, but also managed to throw 88 pitches in those 4 innings.

Below are vertical movement vs. horizontal movement plots for (left) and velocity vs horizontal movement (right) for each pitcher.  Pitches thrown in Tampa are represented with blue dots, while pitches thrown in Baltimore are in red.

As you can see, movements appear to be shifted to the right, and that shift seems to grow as go farther to the right.  Also, relative to Baltimore, Tampa might be giving a very slight boost to fastball velocities.

Next, through a least-squares method, I derived a linear operator that would supposedly take the Tampa pitches and make them more like the Baltimore pitches.  As it's getting very late here, I will save the discussion of how I derived this (and talk about improvements that can be made) for tomorrow.  For now, I'm just going to show the plots, talk briefly about them (even though I think it's very obvious what this "correction" has done well and what it has not), and then go to bed.  So without further ado...

So as you can see, for the most part, the Tampa clusters are closer to their Baltimore counterparts.  The glaring exception is Hellicksons curveball.  It's clearly been overcorrected.  Instead of having more horizontal movement away from a righty as before, it now has less.  Also note that Chens slider and curveball both seem to have lost a little bit more velocity than they probably should have.

So there are the results of my first stab at this.  The positives:  For the fastballs at least, I'm pretty happy with the results.  The negatives:  Well, it's not perfect.

A little later I'll talk about how I got to this point, and where I can go from here (I have not even come close to exhausting all options on this method).  But I wanted to get this up at least, because hey, this is kinda cool.

Friday, March 22, 2013

Thursday, March 21, 2013


It has been brought to my attention that the comments here are broken.  I'm trying to fix it.  Please bear with me.  Also, feel free to test that by posting a comment.  I can't seem to fix that right now, so send comments on the twitter to @TweeterDeMonkey.