Thursday, October 07, 2004

A Brief History of Run Estimation: Runs Created

As promised we'll start our look at run estimation formulas with what is perhaps the most well known formula - Bill James Runs Created.

James, who coined the term "sabermetrics", introduced the formula in an early Baseball Abstract (1979 I believe). It is one of a class of run estimation formulas which I'll discuss in this series that Albert and Bennett in their book Curve Ball call "intuitive" formulas because they are based not on rigorous statistical models such as regression analysis but rather on a common sense model of how the game of baseball actually works.

In its basic incarnation the Runs Created formula James initially published (although he confesses in the 1984 Baseball Abstract that he developed and discarded 30 or 40 such formulas - a sure sign that this is an intuitive formula) simply consists of three components:

RC = (A*B)/C

where A = H+BB or the number of runners on base, B = Total Bases or what is done to move runners along, and C = AB+BB, or the context in which A and B occur. So in total the basic formula was:

RC = ((H+BB)*TB)/(AB+BB)

The main advantage of this formula is that it is simple to calculate and based on counting statistics such as hits, at bats, and walks that are readily available. The other advantage is that the scale is the same as runs batted in or runs scored and so 100 Runs Created is an excellent season. To give an example, Aramis Ramirez in 2004 had 174 hits, 49 walks, 316 total bases, and 547 at bats. That gives him ((174+49)*316)/(547+49) = 118.2 Runs Created, a fine season.

The basic premise, or the intuitive model, behind the formula is that offense is essentially the product of getting on base and advancing runners through extra base hits within a particular offensive context. Typically, this formula is accurate to within 1% for a given league for a year.

In The 1984 Baseball Abstract James introduced two additional versions of the formula, a stolen base version, and a technical version.

The stolen base version uses the following formulas for A, B, and C:

B = TB+.55*SB

So the complete formula is:

RC = ((H+BB-CS)*(TB+(.55*SB)))/(AB+BB)

In the 1984 Abstract James explains that he shifted from using the .70 figure for stolen bases published in the 1983 Abstract and removing caught stealing from the C factor of the equation, since doing so could be shown to make logical sense.

The technical version (also called Tech-1) simply expands the stolen base version by including all the available counting statistics. The A,B, and C factors then become:

B = TB+.26*(BB-IBB+HBP)+.52*(SB+SF+SH)

As you can see now hit batsmen (HBP), and grounded into double play (GIDP) are included in the A factor since the former adds a runner to the bases and the latter subtracts one. In the B factor intentional walks (IBB) are subtracted from walks and hit batsmen are added. The reasoning being that non-intentional walks and hit batsmen both have some advancement value, here weighted at .26. Sacrifice flies (SF) and sacrifice bunts (SH) are also included along with stolen bases and given a weight of .52, slightly lower than .55 since an advantage of the stolen base, that it helps prevent double-plays, is already included in the A factor. The C factor then expands appropriately to include the entire offense context encapsulating all plate appearances.

These three versions of Runs Created remain among those used most often by sabermetricians because of their ease of use. However, James kept innovating and, for example, the authors of Total Baseball in 1989 included 13 additional technical versions of the formula (Tech-2 through Tech-14) introduced in the 1988 version of The Bill James Historical Baseball Abstract that adjusted the weights and included the counting statistics that were available in the period from 1900-1954. These are variations of Tech-1 and include:

  • Tech-2: 1954; B factor drops IBB
  • Tech-3: AL 1940-53, NL 1951-53; SF is dropped from C factor; B factor changes to (1.025*TB+.26*(BB+HBP)+.52*(SH+SB)
  • Tech-4: AL 1939; B factor becomes TB+.26*(BB+HBP)+.52*(SH+SB)
  • Tech-5: AL 1931-38; A factor becomes .96*(H+BB+HBP-CS)
  • Tech-6: AL 128-30, 1920-26, NL 1920-25; B factor weights for SH and SB change from .52 to .51
  • Tech-7: AL 1927, NL 1926-30; A factor changes to .93*(H+BB+HBP) and B becomes TB+.26*(BB+HBP)+.46*SH
  • Tech-8: AL 1913,1917-19, NL 1913-14, 1917-1919; A factor becomes H+BB+HBP-.02*AB and B becomes TB+.85*(SH+SB)
  • Tech-9: AL 1914-16, NL 1915-16; A becomes H+BB+HBP-CS
  • Tech-10: AL NL 1908-12; B becomes 1.025*(TB+SB)+.75*SH
  • Tech-11: AL NL 1900-1907; A becomes H+BB+HBP
  • Tech-12: NL 1939-50; A becomes H+BB+HBP-GIDP; B becomes TB+.26*(BB+HBP)+.52*SH
  • Tech-13: NL 1933-38; B factor becomes 1.025*TB+.26*(BB+HBP)+.52*SH
  • Tech-14: NL 1931-32; A becomes .95*(H+BB+HBP)

Most recently James included a newer version in his 2004 edition of The Bill James Handbook and his 2002 book Win Shares where:

B = TB+.24*(BB-IBB+HBP)+.62*SB+.5*(SH+SF)-.03*SO

While all three factors remain relatively intact (the B factor now gives different weights to stolen bases versus sacrifice flies and bunts and even includes a bit of a penalty for striking out) the relationship between the factors has also changed over time from the simply (A*B)/C and become a bit more complicated:

RC = (((2.4*C+A)*(3*C+B))/(9*C))-(.9*C)

The basic structure remains but as James described in Win Shares the calculation has been modified to address one of the criticisms of earlier version of the formula. Essentially, since the previous versions of the formula simply multiplied the A and B factors "it presented the player as if his offensive elements were interacting with one another". This leads to estimates of runs created for players and teams with high slugging percentages and high on base percentages that are in fact too high. Albert and Bennett note this in Curve Ball when they discuss how "product models" like Runs Created "tend to be unrealistic for players at either end of the offensive production spectrum." In fact, a player's offensive elements interact with other players on his own team. However, James concluded that if you attempt to calculate runs created by calculating the runs created for the player's team and then determining how many runs they would have created without the player, the player would rate slightly differently on good offensive teams and bad. His solution is to evaluate the player "as if he played in a context of eight other players of average skill, each having the same number of plate appearances." For average skill he simply used a player with a .300 on base percentage and a .400 slugging percentage. Albert and Bennett advocate using this same idea but instead place the player in the context of an average team for the league and year in which he played.

With this information you can see that the A factor of the equation has been modified to include 8 other players with a .300 on base (8 * .300 = 2.4). The B factor has been augmented with 8 players with a .400 slugging percentage (8 * .400 = 3). The C factor then includes the plate appearances for all 9 players. After performing the (A*B)/C, the runs created by the other 8 players are removed by multiplying the plate appearances for one player by .9. This works since the runs created by 8 of the typical players are equal to 10% of the plate appearances (a quirk of using the .300 OBP and .400 SLUG).

There are additional adjustments when home runs with men on base and batting average with runners in scoring position are available.


Here HRRISP and ABRISP are the homeruns and at bats for the player with runners in scoring position while HRROB and ABROB are the homeruns and at bats with runners on base. Essentially, D is the number of hits with runners in scoring position plus homeruns with men on base above that which would be expected given the player's typical performance. D is then added to the number of runs created found through the previous equation apparently on the theory that each is worth about a run.

Once Runs Created are calculated it can be used to create derivative statistics that are appropriate for making comparisons between players. This is necessary since like other counting statistics such as hits or runs scored, Runs Created is heavily influenced by the opportunity the player has. Even a very good hitter with 100 plate appearances will create fewer runs than a very bad hitter with 600 plate appearances.

The first is Runs Created per 27 outs or RC/27 (also sometimes called Runs Created per Game or RC/G). Simply put, this formula estimates the number of runs per game that a team made up of nine of the same player would score per game. To do so you divide the number of Runs Created by the number of outs the player consumed and then multiply this by 27.

RC/27 = (RC / (AB-H+SH+SF+CS+GIDP)) * 27

Early versions of the formula by James used 27 outs, however, later versions of the formula use the league average outs per game like so:

RC/27 = ((RC*3*LgIP)/(2*LgG))/(AB-H+SH+SF+CS+GIDP)

where LgIP is the number of innings pitched in the league and LgG is the number of games played in the league.

Of course, this statistic produces a rate and so is ideal for comparison. For example, here are the season leaders in RC/G from Lee Sinins Sabermetric Encyclopedia where he calculates Runs Created using the appropriate technical version:

1 Barry Bonds 2004 22.08
2 Barry Bonds 2002 21.23
3 Ted Williams 1941 19.14
4 Barry Bonds 2001 18.65
5 Babe Ruth 1920 18.41
6 Babe Ruth 1921 17.90
7 Babe Ruth 1923 17.31
8 Barry Bonds 2003 16.75
9 Ted Williams 1957 16.44
10 Nap Lajoie 1901 15.78

A second derivative statistic is OWP or Offensive Winning Percentage. OWP is a calculation of the winning percentage of a hypothetical team made up of nine of a particular player on offense and league average pitching and defense. It is based on RC/27 and the Pythagorean Formula which estimates how many games a team should win based on their runs scored and runs allowed. The formula is:

OWP = (RC/27^2)/((RC/27^2)+(LgR/G^2))

This produces a winning percentage that can easily be used to compare players. The advantage it has over RC/27 is that it takes into account the league context in which the player played by using the league runs scored per game (LgR/G). Notice the differences in the top 10 list in RC/27 and the top 10 in OWP shown below:

1 Barry Bonds 2004 .958
2 Barry Bonds 2002 .942
3 Barry Bonds 2001 .922
4 Mickey Mantle 1957 .915
5 Babe Ruth 1920 .913
6 Ted Williams 1941 .908
7 Babe Ruth 1923 .896
8 Babe Ruth 1921 .891
9 Ted Williams 1957 .891
10 Babe Ruth 1926 .883

Finally, Runs Created can also be used for comparisons to average players. For example, Lee Sinins has created RCAA or Runs Created Above Average which estimates how many runs a player contributed beyond a league average player given the same number of outs consumed. As a results RCAA can have a negative values much like Batting Runs as we'll discuss in our next post in this series.


WilliamKF said...

Is D correct? I see it elsewhere as first term being hits with runners in scoring positions not hr with runners in scoring position. Which is correct?

