One of the books I received over Christmas and just had a chance to read is A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans by Ken Ross. In this book Ross introduces the reader to probability and statistics using a good dose of baseball along with gambling, cards, and other topics. The book that most closely resembles it is Curve Ball, although the goal of this book is definitely to teach probability rather than using probability and statistics to elucidate relationships in baseball as the authors of Curve Ball do.
Some of the topics covered include converting probabilities to odds and vice versa, understanding combinations (how many ways are there to select n elements from a set of k elements), probability and Bernouli trials, correlation, and linear regression. Although many of the examples are related to baseball there are plenty more related strictly to the lottery, casino games, and betting on baseball. Since I'm not much of a gambler these latter examples kind of lost my interest.
Correlation and Offense
What did pique my interest was his discussion of correlation and offensive statistics on pages 129-131. Here Ross introduces the notion of correlation and notes several correlations of offensive team statistics and winning percentage for 2003. Those he provides are:
Interestingly, OBP correlates more strongly with winning percentage than does either OPS (OBP+SLUG) or BRA (Batter Run Average = OBP*SLG). None of them, however, have a "strong correlation" defined at greater than .70. Keep in mind that correlation in this case is simply a measure of the linear relationship between winning percentage and these other statistics. In other words as OBP increases winning percentage will increase in a more uniform manner relative to BRA, OPS and the others.
I thought it would be interesting to calculate the correlation between team runs scored with not only these offensive measures but also several of the run estimators I discussed in a recent series:
A Brief History of Run Estimation
Run Estimation: Runs Created
Run Estimation: Batting Runs
Run Estimation: Estimated Runs Produced
Run Estimation: Base Runs
To do this I loaded the new 5.2 version of the Lahman database into SQL Server. I then plugged the run estimators into a query and then loaded the results into Excel. I used Excel's CORREL() function to calculate the correlation coefficient, r, you see below.
These are the formulas I used:
- Runs Created (RC) - I used the version found in The 2005 Bill James Handbook and talked about here. I also calculated the basic version (RC-Basic or RC-B) for comparison
- Batting Runs (BR) - I used the formula found in the 2004 Baseball Encyclopedia but instead of using the ABF factor I used a custom out value of -.10. I did this so that Batting Runs would calculate total runs instead of marginal runs above the league average
- Estimated Runs Produced (ERP) - rather than use Paul Johnson's formulas I used two of Jim Furtado's eXtrapolated Runs formulas - XR and XRR or eXtrapolated Runs Reduced
- BaseRuns (BsR) - I used the version of the formula found on Tangotiger's site
So here were the results:In ascending order of correlation (r value) with runs scored my data set revealed:
As you can see all of the statistics except stolen bases and walks correlate strongly with runs scored, that is, they have an r value greater than .70. In fact, stolen bases actually have a negative r value, meaning that there is essentially no, or a slightly negative, correlation between teams that score a lot of runs and those that steal a lot of bases. At first glance you can imagine that this is because some teams that cannot hit homeruns and doubles must resort to stolen bases in order to try and manufacture some offense while some teams that are more proficient in extra base hits also include some speedy players. In that sense it's not likely that stolen bases actually inhibit run production but it does reveal that it is not necessary to steal bases in order to score a lot of runs. In this data set teams that scored more than 800 runs stole an average of 90 bases while those that scored fewer than 800 runs stole 94.
However, once you get past slugging percentage you can see that the remainder of the measures starting with OPS produce r values clustered between .955 and .964. In other words, all of them are very closely correlated with run scoring and so as the measures go up and down, so does run scoring. This is why sabermetricians prefer to use these other measures to evaluate offensive production rather than the standard batting average (official since 1876) or homeruns. There are two additional points here.
First, of these measures OPS, although at the bottom of the cluster, is by far the easiest to calculate using only addition with readily available statistics. That's why many prefer to use it instead of batting average. For myself, although a much more informative number than batting average alone, I'd prefer to see all three standard offensive measures since they convey more information when viewed as a group than OPS does by itself. In addition, BRA, as mentioned by Ross is not distorted by players who have a low SLG and a high OPB or vice versa. For example, a player with a .250 SLG and a .475 OBP will have an OPS of .725 while a player with a .350 SLG and a .370 OBP will have an OPS of .720. However, when you calculate their BRA the first player's is .119 while the second player's is .130. Since BRA is more closely correlated with run scoring I'd have to conclude that BRA is the better measure.
Secondly, because the run estimators, starting with RC-Basic, actually do attempt to estimate runs, the formulas can be applied to individuals in the hopes that they approximate the run contribution of the individual. So given this list it appears that Furtados eXtrapolated Runs is the best measure to use since it correlates more closely with run scoring than any of the others (even if the difference is minute).
However, there are two other ways to measure these run estimators. First, we can find the average error for each estimator, or the average number of runs each was off.
Judging from these numbers we would assume that BaseRuns is the most accurate since it produces the smallest average error? But is this better than using the r value? Remember that the r value measures the strength of the linear relationship between runs and the thing measured while smallest average error tells us which one makes is closer on average. So this could mean that while XR is better at ranking, BsR is actually more accurate.
There is actually a third way to rank these estimators and that is to use the standard deviation.
And doing so gives us yet a third winner, RC, who standard deviation is about a run better than any of the others. This means that the spread of values for Runs Created is smaller than any of the others, in other words it does not tend to make big mistakes. This is easily seen by noticing that RC-B misses the mark by 102 runs for the 2000 White Sox while the biggest error RC makes is 72 for the 2002 Phillies.
So we can conclude that XR gives us the best correlation, BsR gives us the smallest average error and RC gives us the smallest error distribution. So which one should we use? Since all of these (except the basic Runs Created formula) are so close, in practice it rarely matters given the restricted range of offensive levels at which major league baseball is played. As mentioned in my series the linear formulas (BR, XR, and XRR) tend to underestimate run production at higher levels while multiplicative formulas (RC-B and RC) tend to overestimate run production at higher levels (although James has corrected for this in RC in recent years). BsR alone seems to provide a better mix in either environment since it is an intuitively more accurate way of modeling run production.
The other aspect of the book that caught my attention was the author's discussion of streaks or "hot hands" in chapter 8. After reviewing the standard studies done on the subject as they relate to baseball and basketball, Ross reviews a study done by Reid Dorsey-Palmateer and Gary Schmidt that looked at professional bowlers. This study showed pretty convincingly that the proportion of strikes after strikes was higher to a statistically significant degree, than the proportion of strikes after non-strikes. Ross then uses this study to conclude that "I must now assert what I have long intuited: Sports players do get in 'the groove', or have 'hot hands'". To be fair, he also acknowledges that this is very difficult to detect in complex games like baseball and basketball.
I'm not so sure that he's correct when it comes to baseball. Couldn't it just be that bowling is different than baseball or basketball by the fact that the trials (frames) occur more frequently and under more uniform conditions? In other words, maybe what allows bowlers to repeat their performance is the fact that their adrenalin doesn't diminish between frames as it does for baseball players between at bats. My intuition is that these are apples to oranges comparisons and that the result for bowlers cannot be extrapolated to mean that there is a "hot hand" phenomenon in baseball.
In addition to the above sections I appreciated the author's inclusion of summaries of some of the interesting work done by my fellow SABR members and published in By The Numbers, the newsletter of the statistical committee.
Overall, this is a good book for those wishing to learn more about statistics and those interested in odds and betting. If you're already sabermetrically literate you'll probably not find a whole lot that's new.