FREE hit counter and Internet traffic statistics from freestats.com

Sunday, November 21, 2004

A Brief History of Run Estimation: Base Runs

In previous articles in this series on the history of run estimation I've looked at Runs Created (RC), Batting Runs (BR), and Estimated Runs (ERP) Produced. Although all of these formulas are intuitive in that they are based on a model of how the game is played, one of the differences between RC and both BR and its variant ERP is that RC is a non-linear formula. In other words, RC essentially says that the whole is greater than the sum of the parts in reference to offensive events while BR and ERP use a linear approach that assign weights to the various offensive events. In this installment of the series I'll look at another intuitive but non-linear run estimation formula called Base Runs (BsR).

History
Base Runs (BsR) was developed by David Smyth in the early 1990s and since then much discussion of it has occurred on the Strategy and Sabermetrics forum. Like RC and BR this formula is based on team scoring, which is then applied to individuals. The basic formula for BsR is:

BsR = (BaseRunners * ScoreRate) + HR

In other words, runs can be estimated if you know how many base runners there are, what their chance of scoring is, and how many homeruns the team hit. To many, this is the strength of BsR in that it capture a more intuitive and accurate model of how runs are created. Like RC the formula is then further broken down into components like so:

BsR = (A * (B/(B+C))) + D

Here A represents baserunners defined as hits plus walks minus homeruns, B is the advancement of those runners, C is outs defined as at bats minus hits, and D homeruns. So you can see that the ScoreRate is defined as the advancement value divided by the advancement plus the outs consumed.

ScoreRate = B/(B+C)

The complexity of the formula is centered in the calculation of B. I've seen several versions of the B factor including:

B = (.8*1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB)
B = .8*(1B+SB) + (2.2*2B) + (3.5*3B) + (2* HR)
B = (2.5*TB - H - 5*HR + 2*SB + .05*(BB+HBP))*X

As you can see in the first two formulas found on Tangotiger's site and posted by Smyth himself on a blog discussion of run estimators, the B factor is much like the Linear Weights values with the exception that they're larger and that triples outweigh homeruns. Although this seems intuitively wrong, this is because homeruns are a special case considered in the D factor. The third formula used by Brandon Heipp in his article on BsR for the August 2001 issue of the SABR journal By The Numbers takes a different approach and can be expanded to:

B = (1.5*1B) + (4*2B) + (6.5*3B) + (4* HR)+(2*SB)+.05*(BB+HBP))*X

In this version the weights are basically doubled and include both stolen bases and hit by pitch and an "X" factor is introduced to adjust for the team or league that is being tested. This value is historically around .535. This is used in the same way as the varying value for outs in the Batting Runs formula to gauge it for a league.

The full formula using the first B factor above is then:

BsR = ((H + BB - HR) * ((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))/((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))+((AB-H))))) + HR

To illustrate that this formula is more accurate at the extremes consider a thought experiment with two imaginary teams. Team A hits 97 homeruns and makes three outs while team B draws 100 walks and makes no outs. In the case of Team A common sense says the team will score 97 runs. In the case of team B common sense says they'll score 97 runs and leave the bases loaded. When compared with RC and BR BsR compares as follows:

     Team A   Team B

BsR 97 97
RC 376 0
BR 136 33

Obviously, at these extremes BsR is doing a better job because it doesn't overvalue the homerun as RC does and because being a non-linear formula it takes into account the offensive context as BR does not.

But of course major league baseball is not played at the extremes. This is why all three formulas can be relied upon to approximate the nearly straight line relationship of offensive events to runs in the frequency ranges for events in major league baseball.

That being said Heipp did calculate the Root Mean Square Errors for each of the formulas and found that BsR had a smaller error than RC but an ever-so-slightly larger error than Furtado's Extrapolated Runs (XR). All three however, were in the range of 23.7 to 25.8 runs.

A challenge for BsR is that, like RC, BsR is in essence a run estimator for teams and so when applied to the statistics of an individual the formula automatically forces interaction of the offensive events with themselves. However, a player never creates runs in the context of a team of clones and so the result is that BsR over estimate runs for players with high slugging and on base percentages. However, it doesn't do so as much as RC since homeruns are isolated. In order to negate this effect the RC formula now includes contextualizing a player's individual statistics with eight mythical average players (see my article on RC for the details). The same approach can be used with BsR as Heipp shows in this article.

A second challenge for BsR is to more accurately approximate the ScoreRate. The formula B/(B+C) may not in fact be the best way to do this and there has been much discussion on this topic recently. Of course, one way of coming up with empirical weights for the offensive events is to calculate the ScoreRate for entire leagues using the known number of baserunners, outs, and runs and then run a regression on the offensive events. That may in fact be what Smyth did although I don't know.

140 comments: