I just spent a weekend in beautiful San Fransisco talking baseball, visiting with old friends, and making some new friends. I wish it could have lasted longer. Having never been to the Bay Area before, I wasn't sure what to expect. Suffice it to say, I loved everything I saw there. From downtown SF, to AT&T Park, to Half Moon Bay. Just a beautiful place to be, and I can't wait to go back sometime.
The purpose of my visit was to attend the 1st Annual PITCHf/x Summit, an event where sabermetricians, Sportvision, MLBAM, and team reps gathered to discuss the uses of this great new data set, and potential future data sets. I gave a talk largely centered around my previous post on corrections, and how measurements of C_d can be an indicator of data quality and a quantity from which we might be able to derive corrections, perhaps even on a game-to-game basis. I was really impressed with the willingness of Sportvision to discuss every aspect and gave us the chance to see it in action, which was really key for me in understanding how things work.
I had a few misconceptions that are now clear. For instance, I had thought that the two cameras recorded pitch locations in a synchronous mode, when in fact the recording is asynchronous. They also reinforced some things I had thought, but wasn't sure of. For instance, it's absolutely true that the data near home plate are more accurate than the data near the mound. When they do their camera registrations, there are a whole host of calibration points near home plate, and only a few near the mound. This makes perfect sense for a number of reasons. A) Being accurate near the plate is what really adds value to a broadcast or webcast of a game. Much more so than being as accurate near the mound. B) They don't always have a lot of time to do these calibrations. Groundskeepers run a very tight ship with their fields and getting on the field to do registrations isn't something they can take their time with. I hadn't thought of this before, but I'm not so surprised. While you'd really like to set up a whole grid like structure in the region between home and the mound, there just isn't always the time to do it. The method they have in place does a fairly good job in most cases, but leaves the mound less constrained than home plate.
I don't know if there's any way around all of that at the time of registration. However, it may be that through measurements of the drag coefficient over the course of a game, It might be possible to fine-tune the registration a little more, and if not correct things on the fly, at least back-correct the data after the game.
Anyway, for those of you that didn't attend the meeting, my slides and the slides of everyone else that presented at the Summit, can be found in the link above. Shortly, I'm going to post a little something about some things I left out of my talk on corrections. I left them out because the results I was getting were a bit confusing. But through discussion at the summit, and seeing the system in action at AT&T park, I think I now know how to properly interpret these results...
Stay tuned.
Some other thoughts:
Ross Paul, of MLBAM, gave a nice talk on the method in place for real-time pitch classification. Analysts have the benefit of 20/20 hindsight when it comes to pitch classification, which is not always something you have when trying to do the job in real-time. Especially when you need your method to be general and contain as little information about the pitcher as possible (because you want it to perform just as well on a minor league call up as it would with a veteran pitcher). He chose to use an artificial neural net because he believes it to perform better than a decision tree with continuous variables. While I am not the expert in DTs or ANNs, there are people I work with who are. From what I seem to remember, ANNs can be very fickle about useless variables, and DTs less so. I'm going to check on that though. At some point, I may want to try to do my own "pseudo-real-time" classifier with a Decision Tree, if for no other reason than to learn a bit more about them. I also would like to see how big of a gain I can get with boosting a tree (a technique we use here). There are other potential baseball analyses that I'd like to do with a DT (but that I'm not going to mention at this point) or some other machine learning technique, and I think pitch classification would be a great way to start learning about these methods.
HITf/x, a potential new system to track batted ball trajectories was also discussed, at great length. This would be another giant leap forward in available quantitative baseball data. Being a former pitcher, I'm philosophically less interested in such a system as I am in pitched ball trajectories, but as a physicist, it is highly intriguing. Having such a system as a compliment to PITCHf/x would be an enormous boon to the field of Sabermetrics.
Monday, May 12, 2008
Subscribe to:
Post Comments (Atom)

2 comments:
Ike,
Let me know if you have any success with the decisions trees; they are definitely a good way to attack pitch classification. If you are planning on learning more about the various machine learning techniques, I highly recommend the excellent book by Tom Mitchell called "Machine Learning". It's a fabulously dense book which gives great overviews of many techniques.
Thanks Ross! I will let you know what happens with the DTs. I don't know for sure if they will do better or worse, or about the same. It's mostly for my own curiosity, but either way, I was planning on letting you be the first to know.
Post a Comment