The February 2005 issue of By The Numbers, the newsletter of the SABR statistical analysis committee has been published (for those readers who are not members of The Society for American Baseball Research, click here to join).
Of particular interest were articles by Jim Albert and Phil Birnbaum, both of which relate to Bill James' article in the 2005 Baseball Research Journal titled "Underestimating the Fog".
The Fog Reviewed
I wrote a short review of James' article back in March and at that time noted that:
"James' argues that some of the best known negative sabermetric conclusions should not really be viewed as conclusions at all, but rather simply as non-answers to questions under study.
James goes on to criticize the common technique employed in various sabermetric studies that typically cited to 'prove' these conclusions - for example Dick Cramer's famous 1977 Baseball Research Journal article on clutch hitting and James' own look at platoon differentials in the 1988 Baseball Abstract. That technique involves the search for recurrence or persistence of the phenomena being studied. In other words...studies were done that attempted to determine if the effect...persisted across seasons. In each case repeated studies have shown that it doesn't - therefore the effect is, in the words of James, 'transient' and not 'persistent'. That which is not persistent is then assumed not to be real.
James then argues that in many of these cases the negative conclusion - the phenomena is not real - is flawed because there is too much instability in the data used to make the conclusion...The randomness involved in such a small sample size tends to swamp the differential itself, thereby making the results meaningless. James notes that Cramer's original study of clutch hitting was flawed for the same reason."
In his article, titled "Comments on 'Underestimating the Fog'", Albert brings a bit more clarity to what James was saying and talks specifically about the amount of luck and the amount of ability in various baseball statistics (a topic that he and co-author Jay Bennett discussed in their excellent book Curve Ball). He notes that given a set of players with 500 at bats, about 50% of the variation in batting average will be due to luck while the other 50% can be attributed to differences in their hitting ability. He then notes, as does James, that the amount of luck will vary with sample size. He has also discussed this topic in another online article that I referred to in in a post on the general topic of luck and batting average a few months back as well.
Albert goes on, however, to disagree with James in his assertion that using comparison offshoots (difference between two statistics) in making comparisons between players has the effect of summing the luck component of the individual statistics. He says this notion is simply "nonsense" and uses a simple example to show why. Albert then illustrates his methodology of testing against a simple model and concludes that there is evidence for a platoon difference (using SO/AB) as he and Bennett concluded in chapter 4 of Curve Ball. Platoon differences are one of the sabermetric conclusions that James called into question in his original article.
In conclusion Albert notes that what James is simply a restatement of the logic statisticians use when they test models meant to approximate reality.
"When we assess the goodness of fit of a model, there are two possible conclusions: either there is significant evidence to reject the model or there is insufficient evidence. Saying there is not enough evidence to reject the model doesn’t say the model is true, but it does say that we just can’t provide evidence to say that it is false. Statisticians don’t prove models are true – that’s why we are careful to say things like "we have insufficient evidence to reject a model.'"
His opinion, then, is that phenomena like those mentioned by James including clutch hitting may indeed exist, but it is likely they have a very small effect and are difficult to pick up with a statistical test.
The more controversial of the two articles in By The Numbers is Phil Birnbaum's "Clutch Hitting and the Cramer Test". In this article Birnbaum attacks James' notion that "random data proves nothing - and it cannot be used as proof of nothingness" and says it is "certainly false". His position is that if one can show that their study is well designed, then a conclusion of "no effect" can reasonably be offerred.
Birnbaum goes on to apply the method used in Cramer's 1977 article to data from 14 pairs of seasons from 1974 through 1990 and shows the correlations in the table reproduced here.
r r^2 f
74-75 .0155 .0002 .86
75-76 .0740 .0055 .37
76-77 .0712 .0051 .40
77-78 .0629 .0040 .44
78-79 -.1840 .0339 .02
79-80 .0038 .0000 .96
82-83 -.0250 .0006 .75
83-84 .0456 .0021 .60
84-85 .0222 .0005 .79
85-86 .0728 .0053 .38
86-87 .0189 .0004 .82
87-88 .0034 .0000 .97
88-89 .0829 .0069 .33
89-90 .0373 .0014 .67
He then creates a model where clutch hitting exists and compares the model's results with the actual results. His conclusion is that while the Cramer test can easily determine whether clutch hitting is a real phenomenon if the ability were normally distributed with a standard deviation of 30 points, the method starts to fail once the standard deviation falls below about 7.5 points. In other words, he is in agreement with Albert that if clutch hitting is real, its effect is small.
He then combines the data for his 14 seasons into a single regression and concludes that the Cramer test could pick up the effect at a standard deviation of 15 points but begins to fail at a standard deviation of about 10 points.
In his set of conclusions Birnbaum acknowledges that the Cramer test cannot completely disprove the existence of clutch hitting, but it does put a "strong lower bound on the magnitude of the possible effect". That lower bound is that the effect is certainly less than 15 points of batting average; "that is, at least two-thirds of all players are expected to clutch hit within .015 of their non-clutch batting average." And so his conclusion is that given the small effect it is probably impossible to distinguish good clutch hitters from bad.
These article then generated a response from Bill James called "Mapping the Fog" which he posted to the SABR Statistical Analysis Committee Yahoo group. That was quickly followed up by a response from Phil Birnbaum published in the same forum. Both responses have now been published along with a lively discussion by Chris Dial on The Baseball Think Factory.
In the meat of his response James creates a model where clutch hitting exists (a model league where 20% of the hitters are clutch performers at the level of 25 points) and then uses the Cramer test to try and detect the effect. The test does not reliably detect it (it would do so only 65% of the time) and so he concludes that even under these ideal circumstances the Cramer test has nothing to say as to whether clutch hitting is a real phenomena or not.
"Even when we know that the clutch effect does exist within the data, even when we give that effect an unreasonably clear chance to manifest itself, there is still a 35% chance that it will entirely disappear under this type of scrutiny."
James goes on to respond specifically to statements made in both articles and notes that in his opinion "there is an immense amount of work to be done before we really begin to understand this issue."
In his response Birnbaum response is critical of the way James applied the Cramer test to his model, using a simpler sign based test rather than an actual regression. He goes farther, however, and actually uses James' model with the regression analysis and concludes that using 14 separate seasons of data the Cramer test works in that it would pick up clutch hitting using James' model.
I find this entire discussion important and fascinating since it comes so close to the heart of how many sabermetric studies have been done - the most recent and famous being the conclusions drawn in favor of DIPS.
For the most part I have to side with Albert and Birnbaum and agree that while these statistical tests can never prove that clutch hitting or platoon differentials do not exist, using data over a number of seasons (after all, the number of true clutch plate appearances is small, around 50 per 600) or for an entire career they can surely indicate that if they exist, their effects are small and may even fall within the normal variation of the statistics that are used to measure them. As a result - and here is the actionable thing - they needn't be considered when making personnel decisions over the long haul.
As an aside I thought the most interesting comment made in the discussion of this topic on TBTF was that it could be the case that clutch hitting is actually a "negative ability". In other words, it could be the case that most hitters actually perform worse in clutch situations and that good clutch hitters can be defined as those who maintain their performance level. It would be interesting to see what a model would look like to test for this.