FREE hit counter and Internet traffic statistics from

Sunday, November 28, 2004

Clutch Hitting: Fact or Fiction?

Back from the Thanksgiving holiday I was alerted to this article on clutch hitting by a fellow SABR member. Interestingly, the author finds support for the existence of a clutch hitting using Retrosheet data. There are two interesting things about this study.

First, the size of the clutch hitting effect that was found is described by the author:

"Over the course of a season, an average hitter will get approximately 150 clutch plate appearances, in which he will get on base 49 times with a standard deviation due to randomness of 5.7. The difference between a 'one standard deviation good' and an average clutch hitter amounts to only 1.1 successful appearances, while the difference between a good and an average overall hitter amounts to 3.9 successful plate appearances."

In other words, the clutch hitting effect that was found was relatively small, 28% of the difference between a good and bad hitter, smaller than most baseball people would - assume I -think. This means that a hitter who normally hits .240 might hit .255 in clutch situations or a hitter that normally hits .270 might hit .285. Over the course of a season the random effects (note that the standard deviation is much larger than the difference between an average and a good clutch hitter) would thus swamp clutch ability which is why clutch hitting has been difficult to measure.

While this study is important it doesn't seem to me to contradict the basic conclusion of previous sabermetric studies that clutch hitting is largely a myth. See the following for more information:

Clutch Hitting And Experience
Clutch Hitting and Statistical Tests
Hitting with Runners in Scoring Position
Does Clutch Hitting Exist?

A second important point in the study is that the author found that power hitters tended to be poor clutch performers while contact hitters tend to be good clutch hitters. On this point the author notes:

"There was a great deal of discussion [on the Baseball Primer site in February 2004] as to whether or not the tendencies of power hitters was evidence that what I am measuring is not "cluch", but rather tendencies of various types of hitters against the types of pitching one would find in clutch situations. It is unclear how to prove or disprove this theory without knowing what types and quality of pitches each batter saw during his at-bat, but the fact that this correlation is less with higher-pressure situations would seem to suggest that the important part of the variation is not a function of hitter profile."

I'm not so sure I buy this argument. To me it seems that the correlation may well be higher in pressure situations because closers and other good relief pitchers who are leveraged in such situations have skills designed to get power hitters out. In other words, maybe relief pitchers are selected at some level because of their ability to retire hitters who can turn the game around with one swing. The fact that they're used in these situations may then account for much of the clutch ability that was found in the study.

Tuesday, November 23, 2004

The Thinking Fan's Guide to Baseball

I mentioned in a previous post that I had picked up a copy of Leonard Koppett's The Thinking Fan's Guide to Baseball: Revised and Updated last week. Koppett originally wrote the book in 1966 and subsequently updated it in 1991 and 2001. After his death in 2003 the book was once again updated and reissued with the help of Pat Gillick, former GM of the Toronto Blue Jays, Baltimore Orioles, and Seattle Mariners.

Koppett is generally credited with being one of the most statistically savvy media members and routinely used statistics in his columns that stretched from 1948 through 1975 while writing for The New York Times and the Oakland Tribune among others. He was also a member of SABR and appreciated much of what SABR does (SABR paid him tribute at the convention in Denver in 2003). The book moves through the various aspects of the game beginning with the activity on the field (hitting, pitching, fielding, baserunning, managing, umpires etc.) and then moving on the behind the scenes view that Koppett knew so well including the media, road trips, scouts, scoring, and the business aspects of the game. He concludes with a section titled "The Whole Ballgame" which is a series of essays on other aspects of the game including expansion, changes to the ball, spring training, and other more or less random topics.

What immediately grabbed my interest of course was chapter 15, "Statistics". Here Koppett's attitude towards statistics can be summed up in the following quote:

"Even with all these things in mind [adjustments such as contextualizing and understanding relationships between statistics], however, the fiend for statistics can be led to totally incorrect conclusions, because there is one more fundamental flaw in the way standard statistics are kept. They record how much, but not when, and in the winning and losing of ball games, the when is all-important."

Given his reputation as a bit of an innovator with statistics this surprised me somewhat. He then goes on to give several examples of particular game scenarios where the "when" is all important. Of course, what his analysis misses (although he acknowledges it a bit later in the chapter in a different context) is that statistics are by their very nature aggregate and because of this can only be interpreted when applied to the larger context in which they were created. In other words, he's absolutely correct that official baseball statistics merely record what happened and not when and are therefore an abstraction. But what they assume is that the "whens" even out across the hundreds or thousands of "whats" for a season or career. By doing so they become meaningful as a measure of a player's true ability. This assumption and the fact that studies have shown that clutch hitting is largely a myth, mean that when properly analyzed statistics can be a very useful too indeed. In other words, Koppett vastly over estimates the variability in the performance of players in different situations.

Because of this "fundamental flaw" Koppett goes on to argue that baseball statistics cannot be used to predict, they cannot be used to prove a point but only convince in an argument, and they cannot be used to compare. The reasons he gives for his third point are not controversial. He points out the different contexts in which statistics are counted including ballparks and the problem of using very small sample sizes to make decisions. He does not, however, talk about how those problems can be overcome using park factors and confidence intervals for example.

Interestingly, he then moves into a discussion of how statistics are used or interpreted by players and management. In particular, in regards to management he says:

"Also, to the professional baseball man, many implicit qualities go with the phrase '.300 hitter' or '.250 hitter' or '20-game winner,' qualities not taken into account by the fan who uses the same terms."

While I certainly agree that front office personnel takes into account things that are not reflective in the statistics alone, I'll take issue with the very categories used here as examples. Humans categorize reality in order to better understand it and the fact that the categories here include batting average and pitcher victories, two of the least effective ways to analyze the contributions of hitters and pitchers, belies the truth that the "professional baseball man" often didn't (or doesn't) understand what really goes into winning baseball games over the long haul.

And ultimately this is the problem of perspective I talked about in a recent post. By being so close to the game, the sportswriter, the scout, the manager, or the general manager can miss the bigger picture. How else can you square the fact that prior to the last few years winning strategies such as controlling the strike zone and not giving up outs were undervalued and are still controversial? You can see this in Koppett himself when throughout the book he uses batting average almost exclusively when discussing the merits of various hitters.

Finally, Koppett lays it all on the line when summing up the chapter.

"My own view is that the SABR mania (and I speak as a member of the organization) has gone out of control. The Bill James approach, of cloaking totally subjective views (to which he is entirely entitled) in some sort of asserted 'statistical evidence,' is divorced from reality. The game simply isn't played that way, and his judgments are no more nore less accurate than anyone else's. The truly mathematical manipulations - regressive analysis and all that - do not, in my opinion, add to what is already known on the working level. They are perfectly valid and even brilliant as material for conversation...But they don't tell a professional - a player, manager, or scout - anything he doesn't know anyhow by observation. And when his observation conflicts with the printout, he - and I - will trust his observation every time..."

He then goes on to give two reasons why. First, he see statistics as too blunt an instrument to tell you all of what happened which in part reflects his belief in the importance of clutch hitting discussed above, and second, that the "reality of the game" is too complex to capture statistically. He notes that "Most professionals feel the same way."

Perhaps I've been immersed in sabermetrics too long but I have hard time getting my mind around that paragraph. Taking just part of the next to last sentence, "they don't tell a professional...anything he doesn't know anyhow by observation" reveals how flawed his reasoning here is. I wonder how many players, managers, or scouts could have told you how many runs a good offensive player typically contributes in a season before the birth of sabermetrics? My guess is that the range would have been from 20 to 200. I wonder how many professionals before the popularization of OPS would have told you that a player with a .309/.321/.443 line was actually a better player than one with a .269/.389/.492? I wonder how many professionals would have told you that it is generally ok to bunt with a runner on first and nobody out (it's not - ever)? These are all examples of information not obtainable at the working level.

And further to the point, baseball, being the game of the long season with thousands of discrete events (over 193,000 in 2003) does not lend itself to making decisions based only on observation. Human observation is simply not up to the task. This point was nicely illustrated by a story that Paul DePodesta, now GM of the Dodgers but formerly the assistant to Billy Beane in Oakland, told in a speech he gave and which Rob Neyer was nice enough to send me the transcript of:

"Our manager now, Ken Macha, loves our second baseman Mark Ellis. Mark Ellis is a good player, he plays hard, and he plays every day. But he didn't have a very good offensive year this year, yet Ken Macha kept putting him in the lineup every day. It even got to the point late in the year where he started hitting him leadoff. We finally went to Ken and said, 'We like Ellis too, but he probably doesn't need to be hitting leadoff, and getting all these at-bats.' And his comment to us was, 'Ellis is a clutch hitter.'

I thought, 'OK, clutch is one of those subjective terms I'm not wild about,' so I went back and I looked at the numbers, and at that time during the year Ellis was hitting about .163 with runners in scoring position and two outs, which I think is a clutch situation. But I didn't say anything, we kept it under wraps. When we were getting close to the playoffs, though, we began talking about the way the lineup should work against the Red Sox, and at one point Macha was talking about putting Ellis leadoff. Finally Billy Beane, our General Manager, just couldn't take it any more, and he said, 'Ellis is hitting .163 with runners in scoring position and two outs. He's not clutch.' And immediately, Macha said, 'But he hit that game-winning home run off of Jason Johnson.'

'OK, that's right, but if you want to play that game I'm going to come up with a lot more instances where he failed than instances you're going to come up in which he succeeded.'"

DePodesta's point was that observation isn't always the best way of understanding something, especially where large numbers and percentages are concerned.

So while I like this book overall, the writing is of course very entertaining and the anecdotes very interesting, the perspective is one that many sabermetrically knowledgeable fans will sometimes bristle at.

On a slightly different subject I found chapter 28, "The Ball's the Same, the Bat's the Same - Or Are They?" interesting as well. Here Koppett includes a nice history of changes in the styles of bats used and how the ball has changed over time. I was a bit surprised when after discussing the offensive outburst of 1987 which he indicates may have been accounted for by weather changes, he then says "In 1993, however, it was undeniable that a truly livelier ball appeared..."

This surprised me because it doesn't appear to me that livelier balls are usually given as the culprit. I've written previously on the power surge and tests that physicist Alan Nathan did on balls from 1974 and 2004 that show no differences in their "bounciness". In addition, Koppett makes no mention of the fact that 1993 corresponds with the first year of the Colorado Rockies which certainly had an effect (I don't have the home/road breakdowns but as an example of the effect in 1993 1040 runs were scored in Denver to only 685 in Rockies road games). Further, it's evident from looking at the general trend in homeruns that the number of homeruns per game had been increasing starting in the early 1980s with a brief decline in the 1988-1989 period. Although the jump is more pronounced since 1993 the trend actually continued through 2001 before declining again.

This doesn't appear to me as the product of a livelier ball but rather the accumulation of advantages for hitters that might include increased strength, the full effect of aluminum bats, and the absence of intimidation among others. Koppett also notes, but doesn't differentiate, the possibility that the balls were not livelier starting in 1993 but instead had lower seams which makes breaking balls less effective (he assumes that seams could be lower only by winding the ball tighter which I doubt). Another trend that's obvious from watching any game from the 1970s or 80s on ESPN Classic is the frequency with which balls are thrown out of play today versus 20 or 30 years ago. This is particularly noticeable on pitches in the dirt but also of course when a new ball is introduced each half inning as the outfielder or infielder throws the previous one into the stands. Balls that stay in play longer are easier to grip and therefore easier to throw good pitches with.

Sunday, November 21, 2004

A Brief History of Run Estimation: Base Runs

In previous articles in this series on the history of run estimation I've looked at Runs Created (RC), Batting Runs (BR), and Estimated Runs (ERP) Produced. Although all of these formulas are intuitive in that they are based on a model of how the game is played, one of the differences between RC and both BR and its variant ERP is that RC is a non-linear formula. In other words, RC essentially says that the whole is greater than the sum of the parts in reference to offensive events while BR and ERP use a linear approach that assign weights to the various offensive events. In this installment of the series I'll look at another intuitive but non-linear run estimation formula called Base Runs (BsR).

Base Runs (BsR) was developed by David Smyth in the early 1990s and since then much discussion of it has occurred on the Strategy and Sabermetrics forum. Like RC and BR this formula is based on team scoring, which is then applied to individuals. The basic formula for BsR is:

BsR = (BaseRunners * ScoreRate) + HR

In other words, runs can be estimated if you know how many base runners there are, what their chance of scoring is, and how many homeruns the team hit. To many, this is the strength of BsR in that it capture a more intuitive and accurate model of how runs are created. Like RC the formula is then further broken down into components like so:

BsR = (A * (B/(B+C))) + D

Here A represents baserunners defined as hits plus walks minus homeruns, B is the advancement of those runners, C is outs defined as at bats minus hits, and D homeruns. So you can see that the ScoreRate is defined as the advancement value divided by the advancement plus the outs consumed.

ScoreRate = B/(B+C)

The complexity of the formula is centered in the calculation of B. I've seen several versions of the B factor including:

B = (.8*1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB)
B = .8*(1B+SB) + (2.2*2B) + (3.5*3B) + (2* HR)
B = (2.5*TB - H - 5*HR + 2*SB + .05*(BB+HBP))*X

As you can see in the first two formulas found on Tangotiger's site and posted by Smyth himself on a blog discussion of run estimators, the B factor is much like the Linear Weights values with the exception that they're larger and that triples outweigh homeruns. Although this seems intuitively wrong, this is because homeruns are a special case considered in the D factor. The third formula used by Brandon Heipp in his article on BsR for the August 2001 issue of the SABR journal By The Numbers takes a different approach and can be expanded to:

B = (1.5*1B) + (4*2B) + (6.5*3B) + (4* HR)+(2*SB)+.05*(BB+HBP))*X

In this version the weights are basically doubled and include both stolen bases and hit by pitch and an "X" factor is introduced to adjust for the team or league that is being tested. This value is historically around .535. This is used in the same way as the varying value for outs in the Batting Runs formula to gauge it for a league.

The full formula using the first B factor above is then:

BsR = ((H + BB - HR) * ((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))/((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))+((AB-H))))) + HR

To illustrate that this formula is more accurate at the extremes consider a thought experiment with two imaginary teams. Team A hits 97 homeruns and makes three outs while team B draws 100 walks and makes no outs. In the case of Team A common sense says the team will score 97 runs. In the case of team B common sense says they'll score 97 runs and leave the bases loaded. When compared with RC and BR BsR compares as follows:

     Team A   Team B

BsR 97 97
RC 376 0
BR 136 33

Obviously, at these extremes BsR is doing a better job because it doesn't overvalue the homerun as RC does and because being a non-linear formula it takes into account the offensive context as BR does not.

But of course major league baseball is not played at the extremes. This is why all three formulas can be relied upon to approximate the nearly straight line relationship of offensive events to runs in the frequency ranges for events in major league baseball.

That being said Heipp did calculate the Root Mean Square Errors for each of the formulas and found that BsR had a smaller error than RC but an ever-so-slightly larger error than Furtado's Extrapolated Runs (XR). All three however, were in the range of 23.7 to 25.8 runs.

A challenge for BsR is that, like RC, BsR is in essence a run estimator for teams and so when applied to the statistics of an individual the formula automatically forces interaction of the offensive events with themselves. However, a player never creates runs in the context of a team of clones and so the result is that BsR over estimate runs for players with high slugging and on base percentages. However, it doesn't do so as much as RC since homeruns are isolated. In order to negate this effect the RC formula now includes contextualizing a player's individual statistics with eight mythical average players (see my article on RC for the details). The same approach can be used with BsR as Heipp shows in this article.

A second challenge for BsR is to more accurately approximate the ScoreRate. The formula B/(B+C) may not in fact be the best way to do this and there has been much discussion on this topic recently. Of course, one way of coming up with empirical weights for the offensive events is to calculate the ScoreRate for entire leagues using the known number of baserunners, outs, and runs and then run a regression on the offensive events. That may in fact be what Smyth did although I don't know.

Changes in the Game

I picked up a copy of the late Leonard Koppett's The Thinking Fan's Guide to Baseball this weekend. In reading his chapter on changes to the game (expansion, a livlier ball etc.) I wanted to take a quick look at the changes along the axes that he mentions; runs scored per game, batting average, and homeruns per games. Using the Lahman database I created the following quick graph. The trendlines for Runs/G and HR/G use a moving 10 year average.
I'll have more comments on the book and Koppett's view of these changes along with his take on sabermetrics later this week.

Friday, November 19, 2004

Leaving the Bat on Your Shoulder

Here's an interesting question:

What are the odds of walking if you don't swing at any pitches?

To answer that question I used the Lahman database and play-by-play files from 2003 to look at all plate appearances that were not intentional walks. Here's the answer:


14573 7.8%

BB No Swings Pct
6842 3.7%

K Pct
30881 16.6%

K No Swings Pct
1108 0.6%

So about one out of 26 plate appearances a batter walks while not offering at a pitch. For this study "offerring" included swinging, bunting, foul tips, missed bunt attempts, swinging at intentional balls, and foul balls. I also found that striking out without offerring is actually more rare and happens only once in 150 plate appearances.

Organizing Domain Logic: Domain Model

The third pattern for representing domain logic in an application is the Domain Model. In this model the application is viewed as a set of interrelated objects. The core feature of these objects (referred to as Business Entities in Microsoft nomenclature) is that, unlike the Table Module approach, each object maps to an entity (not necessarily a database table) in the database. In other words, each primary object represents a single record in the database rather than a set of records and the object couples the object’s data with its behavior (business logic and data validation). Additional objects may represent calculations or other behavior. Ultimately, this approach requires at least the following:

  • Object-relational Mapping. Since this pattern does not rely on the DataSet object or even Typed DataSet objects, a layer of mapping code is required. At present the Framework does not contain built-in support for this layer. Look for support in the Microsoft Business Framework (MBF) technology to be released after the Whidbey release of VS .NET. Techniques to perform this mapping are covered in my article Software Design: Using data source architectural patterns in your .NET applications.
  • ID Generation. Since each Domain Model object maps to an entity in the database, the objects need to store the database identity within the object (the CLR’s object system uniquely identifies each object based on its data). There are several techniques to do so documented by Fowler in his discussion of the Identity Field pattern.
  • Strongly-Typed Collection Classes. Because objects in the Domain Model map intrinsically to an individual entity, representing multiple objects is handled through collections. Framework developers can therefore create strongly-typed collection classes to handle representing multiple domain objects. For more information see my article Take advantage of strongly typed collection classes in .NET.

The end result of a domain model is that the application manages many individual and interrelated objects (using aggregation) during a user’s session. Although this may seem wasteful, the CLR’s efficiency at managing objects makes the Domain Model a valid approach. Such an approach would not have been efficient in the world of COM, however. The Domain Model can also take advantage of inheritance by implementing a Layer Supertype.

To build an object used in a Domain Model a best practice is to follow the Layer SuperType pattern. Using this pattern an architect would develop a base class from which all domain objects would inherit as shown here:

<Serializable()> _

Public MustInherit Class BusinessObjectBase : Implements IComparable
Protected _id As Integer ' could also use GUID or a key class
<xml.serialization.xmlattribute()> _
Public Property Id() As Integer
Return _id
End Get
Set(ByVal Value As Integer)
_id = value
End Set
End Property
MustOverride Function Save() As Boolean
MustOverride Function Delete() As Boolean
Public Shadows Function Equals(ByVal o As BusinessObjectBase) As Boolean
If Me.Id = o.Id Then
Return True
Return False
End If
End Function
Public Shared Shadows Function Equals(ByVal o As BusinessObjectBase, _
ByVal o1 As BusinessObjectBase) As Boolean
If o.Id = o1.Id Then
Return True
Return False
End If
End Function
Protected IsDirty As Boolean = False
' Used when the Sort method of the collection is called
Public Function CompareTo(ByVal o As Object) As Integer _
Implements IComparable.CompareTo
Dim b As BusinessObjectBase = CType(o, BusinessObjectBase)
Return Me.Id.CompareTo(b.Id)
End Function
End Class

You’ll notice that this abstract class contains the implementation of the Id property as well as the abstract Save and Delete methods and a protected field to determine if the object has been altered and is in need of removal. In addition, it shadows (implemented with the new keyword in C#) both signatures of the Equals method inherited from System.Object that is used to determine whether two instances of the object are equal. Note that these methods check the Id property but could alternately have been implemented to check against some other criterion or more likely overridden in the derived class and checked against other properties of the object. This class also uses the SerializableAttribute and XmlAttribute classes so that the object can be serialized to XML and its Id property represented as an attribute.

A derived Domain Model object, implemented by the Order class might then look as follows.

<serializable> Public Class Order : Inherits BusinessObjectBase
Private _orderId As Long
Private _cust As Customer
Private _prod As Product
Private _quant As Integer = 1 'default
Private _ship As ShipType = ShipType.Postal 'default
Public Sub New()
End Sub
Public Sub New(ByVal cust As Customer, ByVal prod As Product)
_InitClass(cust, prod, Nothing, Nothing
End Sub
Public Sub New(ByVal cust As Customer, _
ByVal prod As Product, ByVal quantity As Integer)
_InitClass(cust, prod, quantity, Nothing)
End Sub
Public Sub New(ByVal cust As Customer, ByVal prod As Product, _
ByVal quantity As Integer, ByVal ship As ShipType)
_InitClass(cust, prod, quantity, ship)
End Sub
Private Sub _InitClass(ByVal cust As Customer, _
ByVal prod As Product, ByVal quantity As Integer, _
ByVal ship As ShipType)
_cust = cust
_prod = prod
Me.Quantity = quantity
Me.ShipVia = ship
Me.IsDirty = True
' Generate a new or temporary order id: use a system assigned key
' _orderId = key table GUID
End Sub
Public ReadOnly Property Customer() As Customer
Return _cust
End Get
End Property

Public ReadOnly Property Product() As Product
Return _prod
End Get
End Property

Public Property Quantity() As Integer
Return _quant
End Get
Set(ByVal Value As Integer)
If Value < 0 Then
Throw New ArgumentOutOfRangeException( _
"Quantity must be greater than 0")
End If
_quant = Value
Me.IsDirty = True
End Set
End Property

Public Property ShipVia() As ShipType
Return _ship
End Get
Set(ByVal Value As ShipType)
_ship = Value
Me.IsDirty = True
End Set
End Property

Public Function CalcShippingCost() As Double
' calculate the shipping cost based on the Customer and Product objects
' store the shipping cost for this order
End Function

Public Function CalcOrderTotal() As Double
' calculate the total cost of the order with tax
' store the cost for this order
End Function

Public Function IsComplete() As Boolean
' Determines whether this order has enough information to save
End Function

Public Overrides Function Save() As Boolean
' Persist the order
Me.IsDirty = False
End Function

Public Overrides Function Delete() As Boolean
' Remove the order
End Function
End Class

The key point to note about this class is that it combines data (the Quantity and ShipVia properties), behavior (the IsComplete, Save, CalcOrderTotal and other methods), and the relationships to other data (the Product and Customer properties that relate to the Product and Customer classes). The end result is a static structure diagram:

The second key point to note about this structure is that each object derived from BusinessObjectBase creates its own Id property in the object’s constructor and the Id property is represented as a Long (64-bit) integer. The strategy shown here is only one among several documented by Fowler as the Identity Field pattern. The considerations for Id generation are as follows:

  • Use of system assigned keys. Here the assumption is made that the key value will be assigned by the system and not use a “natural key”. In practice this proves to be the more durable design since the independence of the keys allows them to remain fixed once assigned. This also disallows the use of compound keys (which if used should be implemented in a separate key class) which are difficult to manage and slow performance of the database.
  • Use of data types. Here a 32-bit integer is used, however, unless uniqueness can be guaranteed in the generation of the ids it might be possible to duplicate keys which would cause problems once the object is persisted to the database. An alternative would include the use of System.GUID, which is (for all intents and purposes) guaranteed to be unique. If you wish to protect yourself against data type changes in the key you could also implement the key in a key class that abstracts the data within the key and performs any comparisons and generation.
  • Generation of the ids. If a data type such as an integer is used, one technique for generating unique ids is to generate them from a key table. The downside of this technique is that it incurs a round trip to the database server.
  • Uniqueness of the ids. In the code shown here it is assumed that each object generates its own key probably through the use of a key table. This then assumes that each object type (Order, Customer, Product) generates keys independently of the others. An alternate approach would be to create system wide unique ids through a key table, in which case the generation of the id could be placed in the base class. If the classes form an inheritance relationship, for example if the BookProduct class inherited from the Product class it would be assumed that the generation of the ids would take place in the Product class so that all of the objects generated by the inheritance hierarchy would be uniquely identified so that they could be stored in the same database table.

To be able to manipulate a collection of Domain objects they can be represented in a collection class. While the Framework includes collection classes in the System.Collections namespace including ArrayList, SortedList, and Dictionary among others, there are advantages to creating a custom strongly-typed collection class including:

  • Strong-typing. As the name implies, a custom collection can be strongly-typed, meaning that only objects of a specific type are allowed to be added to the collection.
  • Custom Manipulation. A strongly-typed collection class also provides the opportunity to add custom behaviors to the class, for example, by providing custom sorting capabilities.

To create the strongly-typed collection you can use the power of implementation inheritance. By deriving a class from CollectionBase, ReadOnlyCollectionBase, or DictionaryBase and then overriding and shadowing various members you not only can enforce the type of objects in the collection but take advantage of functionality that Microsoft has already included in the .NET Framework. For example, Microsoft uses the CollectionBase class as the base class for over 30 of its own strongly-typed collection classes in the .NET Framework. Each of these three classes is marked as abstract (MustInherit in VB) and therefore can only be used in an inheritance relationship.

Thursday, November 18, 2004

The 2004 Scouting Report by the Fans

Tangotiger is embarking on a project to collect fan perceptions in order to evalute MLB players. The word is that more data on the Royals is needed. Please click here to go directly to the Royals and provide your input. Thanks

Tuesday, November 16, 2004

Organizing Domain Logic: Table Module

The second pattern for representing domain logic is the Table Module. As the name implies this pattern calls for a Business Component to map to a table in the database. The component then contains all the domain logic methods required to manipulate the data. There are two key considerations here.

First, although the name refers to a table, Table Module can also be used to abstract frequently used sets of data created by joining multiple tables together in a view or query. This makes dealing with more complicated data much easier for the caller to the Business Components.

Second, the core characteristic of each component built as a Table Module is that, unlike Domain Model, it has no notion of identity. In other words, each Table Module object represents a set of rows rather than a single row and therefore in order to operate on a single row, the methods of the table module must be passed in identifiers.

When a method of a Table Module is called, it performs its logic against a set of rows passed into it. In Framework applications this maps to a DataSet and is therefore a natural way to represent domain logic. In fact, using Typed DataSets with Table Module is particularly effective since it promotes strong typing at design time leading to fewer runtime errors and better control over the data flowing through the application.

A common approach to handling this is to create a Layer Supertype (a pattern also discussed by Fowler) or base class for all of the Table Modules that accepts a DataSet in its constructor. This also allows the Business Components to be tested without a live connection to a database.

To implement a Table Module approach in the retail application example you could create an abstract BusinessComponentBase class like that shown here. This class is responsible for accepting the DataSet that the class will work on, exposing individual rows in the DataSet using the default (VB) or indexer (C#) property, and exposing the entire set of rows in a readonly property.

As you can see one of the advantages to using Table Module over Transaction Script is that it is easier to encapsulate responsibilities with the appropriate objects. Here the Customers class contains the SendEmail and CheckCredit methods that can be called independently whereas the Transaction Script shown previously would have included inline code to handle this.
The code to implement the BusinessComponentBase class can be seen below.

Public MustInherit Class BusinessComponentBase
Private _ds As DataSet
Private _pk As String
Protected Sub New(ByVal data As DataSet)
_ds = data
_pk = _ds.Tables(0).PrimaryKey(0).ColumnName
End Sub
Default Protected ReadOnly Property Item(ByVal id As Integer) As DataRow
Dim f As String = _pk & " = " & id
Return _ds.Tables(0).Select(f)(0)
End Get
End Property
Protected ReadOnly Property Items() As DataSet
Return _ds
End Get
End Property
End Class

Note that the protected constructor finds the primary key column and stores its value in a private variable used by the Item property to select and return a particular DataRow.

This class can then be inherited by Orders and Customers. The Orders class in C# is shown here.

public class Orders : BusinessComponentBase
public new OrdersDs Items
{ get { return (OrdersDs)base.Items; }
public Orders(OrdersDs orders):base(orders)
{ }
public ShipType GetShipType(long orderId)
// Return the shipping type for the order
public double GetShippingCost(long orderId)
// Calculate the shipping costs for the order
public void SaveOrder(long orderId)
// Save the specific order to the database
public void SaveOrder()
// Save all the orders to the database
public long Insert(long productId, int customerId,
long quantity, ShipType ship)
// Insert a new row in the DataSet

The main point to notice here is that the constructor accepts a DataSet of type OrdersDs which is a Typed DataSet. The class then shadows the Items method in order to return a strongly typed DataSet. Also, this is a good place to take advantage of overloading as in the SaveOrder method which can be used to save either one order or all of the orders in the DataSet.

A client can then instantiate the Orders class passing into it the OrdersDs DataSet.
Orders o = new Orders(dsOrders);o.Save(123324);

Monday, November 15, 2004

Value by Win Expectancy

I blogged a couple weeks ago about WRAP as discussed in a NY Times piece by Alan Schwarz. There I noted that WRAP was based on the Mills Brothers Player Win Averages (PWA) methodology created in the late 1960s as well as Bennett and Flueck's Player Game Percentage (PGP) documented in chapter 10 of Curve Ball. However, I did not know that fellow SABR member Ed Oswalt introduced a similar measure the 2003 SABR convention called "The Baseball Player Value Analysis System", the results of which for 1972-2002 are documented on his site. In particular the site contains a nice Win Expectancy chart that runs from +5 to -5 for 2002 data.

I also see that Cyril Morong has published an analysis of statistics such as these. His conclusion is that they are very highly correlated with normalized OPS (NOPS) and so don't really tell us anything we didn't already know. By the way, the Rhoids web site listed in the links here uses the same basic methodology but injects salary into the discussion and also includes "inning state runs" using run expectancy tables.

Cyril also informs me that Palmer and Thorn in The Hidden Game of Baseball (pages 171-174) mention a study done by Dick Cramer back in 1977 on the Mills brother's method and note that Cramer's study shows a high correlation between BWA (Batter Win Average created by Cramer and used to measure the runs a player contributes beyond the league average) and PWA and OPS and BWA. Further, Cramer found that the distribution of the differences was normal indicating that it was random and that one year's PWA was not a good predictor of subsequent years.

In other words, while PWA, WRAP, Rhoids, and other similar statistics do tell you exactly who was the most valuable in terms of producing or preventing runs or increasing the probability of victory for their team, these stats don't capture any difference in ablity and therefore can be biased by the situations the player finds themselves in. Because they are essentially retrospective I do think they can profitably used for postseason awards.

Organizing Domain Logic: Transaction Script

This is first of several articles on representing domain logic in .NET applications and services, again taken from a course I developed on .NET patterns and architecture. I should mention that because one of the benefits of using patterns is to share a common nomenclature I'm using the pattern names as defined by Martin Fowler in his excellent book.

Since an application can usually be thought of as a series of transactions, one pattern for representing the transaction is called a Transaction Script. As the name implies each transaction, for example processing an order, is encapsulated in its own script housed in a method of a Business Component (a Business Component is a class or set of classes packaged in a Class Library assembly in .NET used to encapsulate business or domain logic). The benefit of this approach is that it is conceptually straightforward to design by looking at the actions that the application needs to perform.
The way the transaction script is packaged can vary in two basic ways:

Multiple Transactions per Component

This is the most common technique and involves factoring the transactions into higher-level groupings and then creating a Business Component for each grouping and a public or shared method for each transaction. For example, in a retail application the process of ordering a product involves several steps and can be encapsulated in an PlaceOrder method in the OrderProcessing component like so:

Public Class OrderProcessing    

Public Shared Function PlaceOrder(ByVal order As OrderInfo) As Long
' Start a transaction
' Check Inventory
' Retrieve customer information, check credit status
' Calculate price and tax
' Calculate shipping and total order
' Save Order to the database and commit transaction
' Send an email confirming the order
' Return the new order ID
End Function
End Class

In this case the order information is passed to the PlaceOrder method in an OrderInfo structure defined as follows:
Public Structure OrderInfo    

Public ProductID As Long
Public Quantity As Integer
Public CustomerId As Long
Public ShipVia As ShipType
End Structure

Public Enum ShipType
End Enum

In a similar fashion the customer processing can be encapsulated in a CustomerProcessing component (class) like so:
Public Class CustomerProcessing    

Public Shared Function SaveCustomer(ByVal customer As CustomerInfo) As Long
' Validate Address
' Start a transaction
' Look for duplicate customers based on email address
' Save customer to database and commit transaction
' Return the new customer ID
End Function
End Class

Public Structure CustomerInfo
Public Name As String
Public Address As String
Public City As String
Public State As String
Public PostalCode As String
Public Email As String
End Structure

The user process or user or service interface components would be responsible for calling the SaveCustomer method, persisting the customer ID, creating the OrderInfo structure and then passing it to the PlaceOrder method. Note that here the product information would already be known since the UI would have called a method in the OrderProcessing class (or another Business Component) to retrieve products.

Of course, both the OrderProcessing and CustomerProcessing classes could inherit from a base class if there were any code that both could use (for example a base class constructor). In that case the methods would likely not be shared methods. Since these components have dependencies on each other they might well be packaged in the same assembly so they can be versioned and deployed as a unit.

Each Transaction is its own Component

Using this technique, each transaction script is implemented in its own class which uses implementation or interface inheritance in order to provide polymorphism. For example, the PlaceOrder and SaveCustomer scripts could be implemented as classes that inherit from the IProcessing interface like so:

The code to implement the PlaceOrder component would then look as follows:
Public Interface IProcessing    

Function Execute() As Long
End Interface

Public Class PlaceOrder : Implements IProcessing
Private _order As OrderInfo
Public Sub New(ByVal order As OrderInfo)
_order = order
End Sub
Public Function Execute() As Long Implements IProcessing.Execute
' Start a transaction
' Check Inventory
' Retrieve customer information, check credit status
' Calculate price and tax
' Calculate shipping and total order
' Save Order to the database and commit transaction
' Email a confirmation
' Return the new order ID
End Function
End Class
This design is based on the Command Pattern documented by the GoF and allows the user or service interface and process components to treat the scripts polymorphically by calling the Execute method of the IProcessing interface. The UI components could rely on one of the Factory Patterns to create the appropriate object.

The logic to perform the various steps shown in these code snippets can be implemented either with inline managed code or through calls to stored procedures. Using stored procedures has the benefit of introducing a further layer of abstraction and taking advantage of database server performance optimizations. In either case the Transaction Script typically uses only the barest of data access layers or none at all.

Baserunning 1992

Over the weekend I ran my baserunning framework for the 1992 season. In short, there wasn't alot that was surprising. The leaders in IBP were:

Mark Whiten 43 76 62.68 13.32 1.21 0 4.40
Greg Gagne 40 70 58.13 11.87 1.20 0 3.92
Kenny Lofton 50 88 74.40 13.60 1.18 0 4.49
Bob Zupcic 25 45 38.27 6.73 1.18 0 2.22
Thomas Howard 25 42 35.97 6.03 1.17 0 1.99
Chad Curtis 28 50 42.96 7.04 1.16 0 2.32
Reggie Sanders 25 43 37.06 5.94 1.16 0 1.96
Rafael Palmeiro 48 85 73.39 11.61 1.16 0 3.83
Steve Sax 34 60 52.04 7.96 1.15 0 2.63
Kelly Gruber 22 36 31.25 4.75 1.15 1 1.48

Ok, so probably the most suprising things are that Rafael Palmeiro is in the top 10 and that Matt Williams comes in 21st out of 219 qualifiers. However, it must be remembered that Palmeiro was still just 27 years old in 1992 and that in 1993 he stole 22 bases and was caught only 3 times. His .85 IBP in 2003 puts him 323rd.

Other players in the top 20 are no surprise as well and include Lance Johnson, Henry Cotto, Dion James, Willie McGree and Bip Roberts while the bottom 20 include Kal Daniels, Fred McGriff, Kirt Manwaring, Kevin Mitchell, Tino Martinez, Kevin McReynolds, Charlie Hayes, and Mo Vaugn. Cecil Fielder comes in 203rd. Gary Gaetti takes the top spot for the most times thrown out advancing at 4. A top IBR of 4.49 for Kenny Lofton compares favorably with the 2004 data

I also noticed that the sheer number of opportunities for players is higher in 2003, hopefully not a result a miscalculation on my part. In 2003 for players with more than 20 opportunities the average was 34.9 while in 1992 it was 46.2. Part of this can be attributed to the increased offensive context of 2003 when 9.46 runs were scored per game versus 8.23 in 1992. In 1992 players did not get on base as often and the players behind them did not get hits as often to move them along. Interestingly, this means that from an absolute perspective a good baserunner in 2003 was worth more than a good baserunner in 1992. However, since each run was worth more in 1992 a good baserunner could more easily impact any particular inning or game.

Of course it's difficult to compare the same player across these two data sets to check for consistency since they occur twelve years apart which cuts down on the overlap and because you'd expect the IBP to decline during that time as the players age. That said, I did a quick comparison of 13 players that were in both studies. Of the 13, nine had a higher IBP in 1992 than in 2003 as you might expect. Overall the group's IBP was 1.04 in 1992 and 1.03 in 2003. Here are the results:

Ken Griffey Jr.
2003 26 45 39.77 5.23 1.13 0 1.72
1992 33 54 47.87 6.13 1.13 0 2.02
Reggie Sanders
2003 35 55 50.88 4.12 1.08 0 1.36
1992 25 43 37.11 5.89 1.16 0 1.95
Kenny Lofton
2003 66 110 100.71 9.29 1.09 2 2.89
1992 50 88 74.48 13.52 1.18 0 4.46
Barry Bonds
2003 80 115 120.85 -5.85 0.95 2 -2.11
1992 55 84 86.34 -2.34 0.97 1 -0.86
Benito Santiago
2003 43 67 67.31 -0.31 1.00 2 -0.28
1992 21 28 30.54 -2.54 0.92 1 -0.93
Frank Thomas
2003 73 100 107.97 -7.97 0.93 0 -2.63
1992 70 111 106.37 4.63 1.04 0 1.53
Jeff Kent
2003 57 88 87.08 0.92 1.01 1 0.21
1992 50 76 71.42 4.58 1.06 0 1.51
Barry Larkin
2003 32 59 52.40 6.60 1.13 0 2.18
1992 39 61 58.01 2.99 1.05 0 0.99
Larry Walker
2003 54 96 86.56 9.44 1.11 0 3.11
1992 42 74 71.12 2.88 1.04 1 0.86
Gary Sheffield
2003 84 130 130.11 -0.11 1.00 2 -0.21
1992 34 56 53.45 2.55 1.05 0 0.84
Jeff Bagwell
2003 73 113 115.14 -2.14 0.98 2 -0.89
1992 44 69 69.07 -0.07 1.00 0 -0.02
Craig Biggio
2003 79 122 115.76 6.24 1.05 1 1.97
1992 67 94 100.75 -6.75 0.93 2 -2.41
Ivan Rodriguez
2003 61 92 90.80 1.20 1.01 1 0.31
1992 22 34 31.87 2.13 1.07 0 0.70

2003 763 1192 1165.36 26.64 1.02 13 7.62
1992 552 872 838.40 33.60 1.04 5 10.64

And from a team perspective they break down as follows:

CLE 409 641 603.48 37.52 1.06 9 11.57
CAL 344 545 513.88 31.12 1.06 9 9.46
CHA 412 653 628.06 24.94 1.04 9 7.42
SFN 350 538 523.21 14.79 1.03 5 4.43
MIN 476 738 718.94 19.06 1.03 10 5.39
SLN 397 604 591.75 12.25 1.02 4 3.68
TEX 397 619 607.04 11.96 1.02 7 3.32
BAL 418 644 632.06 11.94 1.02 8 3.22
NYA 414 639 628.58 10.42 1.02 7 2.81
MIL 429 658 647.67 10.33 1.02 7 2.78
TOR 375 559 550.79 8.21 1.01 6 2.17
OAK 428 641 640.54 0.46 1.00 13 -1.02
NYN 347 536 535.92 0.08 1.00 11 -0.96
PIT 402 610 618.99 -8.99 0.99 11 -3.96
CHN 374 561 569.38 -8.38 0.99 10 -3.66
KCA 369 563 573.27 -10.27 0.98 14 -4.65
MON 399 588 600.18 -12.18 0.98 13 -5.19
LAN 370 539 552.93 -13.93 0.97 8 -5.32
PHI 403 588 604.49 -16.49 0.97 8 -6.16
DET 425 620 638.60 -18.60 0.97 9 -6.95
ATL 382 560 577.17 -17.17 0.97 12 -6.75
CIN 393 581 598.88 -17.88 0.97 10 -6.80
HOU 380 555 576.86 -21.86 0.96 9 -8.02
BOS 425 616 640.75 -24.75 0.96 13 -9.34
SEA 403 577 603.99 -26.99 0.96 8 -9.63
SDN 329 440 479.59 -39.59 0.92 12 -14.14

And as I found in 2003 the difference between teams is on the order of 75 bases or around 25 runs or 2.5 wins per season.

Sunday, November 14, 2004

Organizing Domain Logic

There are many different ways of organizing domain logic in Business Components that encapsulate calculations, validations, and other logic that drives the central functionality of an application or service. One of the books I like as a software architect is Patterns of Enterprise Application Architecture by Martin Fowler. In that book Fowler defines three architectural patterns (here listed in increasing order of complexity) that designers use to organize domain logic. I’ll explore each of these patterns as they apply to .NET development in a series of upcoming articles.

  • Transaction Script. This pattern involves creating methods in one or a few Business Components (classes) that map directly to the functionality that the application requires. The body of each method then executes the logic, often starting a transaction at the beginning and committing it at the end (hence the name). This technique is often the most intuitive but is not as flexible as other techniques and does not lead to code reuse. This pattern tends to view the application as a series of transactions.
  • Table Module. This pattern involves creating a Business Component for each major entity used in the system or service and creating methods to perform logic on one or more records of data. This pattern takes advantage of the ADO.NET DataSet and is a good mid-way point between the other two. This pattern tends to view the application as sets of tabular data.
  • Domain Model. Like the Table Module, this pattern involves creating objects to represent each of the entities in a system, however, here each object represents one Business Entity rather than only having one object for all entities. Here also, the logic for any particular operation is split across the objects rather than being contained in a single method. This is a more fully object-oriented approach and relies on creating custom classes. This pattern tends to view the application as a set of interrelated objects.

In additon to these patterns I'll also look at how the Service Layer pattern interacts with these. Should be fun.

Friday, November 12, 2004

Inerrancy and Ephesians 4:8 and Pslam 68

Recently I read both The Discarded Image (1964) and Reflections on the Pslams (1958) by C.S. Lewis for the first time. Both are wonderful books written in Lewis' later years (he died the same day Kennedy was assassinated) and are addressed to quite different audiences.

The Discarded Image is an introduction to medieval and renaissance literature born out of the lectures that Lewis gave after becoming the first Professor of Medieval and Renaissance Literature at Cambridge in 1953. In this short book Lewis explicates for his readers what he calls "The Model". The Model is no less than the lens in which people (probably more of the elite than the common man) in the medieval world viewed life. Lewis is often praised as one of the few people who really understood the medieval mind and this book makes you believe it.

Lewis begins with a short reconnaissance through important authors of both the classical and seminal periods (directly before the medieval period). And here his description of the works of Boethius (480-524) I found the most fascinating. If you've read much of Lewis you'll immediately see how influential Boethius was in Lewis' writings. Particularly, Boethius discusses the concepts of determinism and free will and Lewis sides with Boethius where he makes the distinction between God being eternal but not perpetual. God is outside of time and so never foresees, he simply sees. And for this reason he does not remember your acts of yesterday, nor does he forsee your acts of tomorrow. He simply experiences it in an Eternal Now (as the demon Screwtape says in Lewis' book The Screwtape Letters).

"I am none the less free to act as I choose in the future because God, in that future (His present) watches me acting."

Lewis then moves on to discussing The Model proper by breaking it into sections that include the heavens and the Primum Mobile (the outer sphere that God causes to rotate and that in turn causes the other fixed spheres to rotate), angels in all their hierarchical glory, the Longaevi (long-lived creatures like fairies, nymphs, etc.), the earth and animals, the soul, the body, history and the human past, and finally how these were taught using the seven liberal arts.

Although a more learned person would get more out of this book than I did, I can say it definitely opened my eyes and gives you just a glimpse of how a person in that time thought. I say a glimpse because we're so set in our modern ways of thinking that it is difficult even for a second to take in the night sky (as Lewis recommends) and imagine the vitality and purpose they saw in it.

The Reflections on the Psalms is a collection of essays on the Psalms that address issues and interpretations that Lewis came in contact with over the years. Unlike The Discarded Image, this book is not a scholarly work but as Lewis says in the first chapter, "I write for the unlearned about things in which I am unlearned myself".

For me, perhaps the most interesting part of this book is his discussion of "the cursings Psalms" such as Psalm 109 where the Psalmist rails against the "wicked and deceitful man" hoping all kinds of calamity for him and Psalm 137 where the poet says:

"O daughter of Babylon, you devastated one,
How blessed will be the one who repays you
With the recompense with which you have repaid us.
How blessed will be the one who seizes and dashes your little ones
Against the rock. "

Unlike the modern critic Lewis says that it is "monstrously simple-minded to read the cursings in the Psalms with no feeling except one of horror at the uncharity of the poets." Instead, Lewis looks past the sentiments that he says "are indeed devilish" and uses them as a spur to reflect on his own thoughts of uncharity and to look at the consequences of our own evil behavior on others. He also uses this occasion to see a general moral rule that "the higher, the more in danger". And as is typical of Lewis this comes from an unexpected direction with the idea that the higher or more developed moral code of the Jews made it more likely they would be tempted to a self righteousness contrary to those with lower morals. This too should be a lessen Christians can take from these Psalms.

The title of this post, however, is reflective of what Lewis later says regarding the both the inspiration of Scripture and "second meanings" in the Pslams. In the chapter titled "Scripture" Lewis says:

"I have been suspected of being what is called a Fundamentalist. That is because I never regard any narrative as unhistorical simply on the ground that it includes the miraculous. Some people find the miraculous so hard to believe that they cannot imagine any other reason for my acceptance of it other than a prior belief that every sentence of the Old Testament has historical or scientific truth. But this I do not hold, any more than St. Jerome did when he said that Moses described Creation 'after the manner of a popular poet' (as we should say, mythically) or than Calvin did when he doubted whether the story of Job were history or fiction."

In other words, Lewis did not hold to a belief in the inspiration of Scripture inclusive of the concept of inerrancy. To modern evangelicals a belief in inerrancy is indeed fundamental and usually is taken to mean that the "original autographs" of the Old and New Testaments were without error. Further, the concept of error encompasses not only where moral or spiritual matters are concerned but also scientific, historical, and literary. And of course this is why some in the evangelical community view Lewis' writings as dangerous.

While mulling over Lewis' view I then came upon his chapter on "Second Meanings in the Psalms" and to Paul's quote of Pslam 68 in Ephesians 4:7-8:

"But to each one of us grace was given according to the measure of Christ's gift. Therefore it says, 'WHEN HE ASCENDED ON HIGH, HE LED CAPTIVE A HOST OF CAPTIVES, AND HE GAVE GIFTS TO MEN.' " (NASB)

Paul is here "speaking of the gifts of the Spirit (4-7) and stressing the fact that they come after the Ascension". Unfortunately, neither the Greek nor the Hebrew Old Testament supports the reading "gave gifts to men" and instead say "received gifts from men" and in context refers to Yahweh and the armies of Israel as his agents, taking prisoners and booty (gifts) from their enemies (men) as Lewis points out. It appears that Paul is here relying on the Aramaic Targum, a Jewish commentary of the OT. Although most evangelicals assume that Paul is simply expanding the meaning of "received" to include the concept of giving, that seems a stretch to me. Instead it seems more realistic to assume that Paul used an incorrect translation.

For Lewis these are not problems because he viewed all Scripture as "profitable" when read in the right light and with proper instruction (human and through the Holy Spirit) per the teaching of 2 Timothy 3:16. Although I was once firmly in the evangelical camp on this issue, it now seems to me that this is the most reasonable way to understand the Bible and helps us to keep from getting bogged down in issues about possible or probable contradictions.

Greinke and Fergie

               BFP  K  BB  GB  OF IF  LD

Average 17% 10% 32% 22% 4% 13%

Anderson B. KC 745 9% 7% 28% 30% 4% 19%
Bautista D. KC 127 14% 10% 35% 24% 2% 14%
Greinke Z. KC 599 17% 6% 26% 29% 5% 16%
Gobble J. KC 638 8% 7% 31% 32% 6% 15%
May D. KC 832 14% 7% 27% 32% 4% 14%

I thought this was an interesting set of statistics I found in The Hardball Times 2004 Baseball Annual. It breaks down the outcomes by pitcher. Notice that all of them can be considered fly ball pitchers and Gobble and May particularly so. What doesn't bode well for Greinke is his line drive % sitting above average. I would think a pitcher's ability to suppress line drives would be directly related to his "stuff". One of the knocks against Greinke brought up by Bill James as I discussed in a previous post is that he doesn't have that pitch that can fool hitters. With that he still maintained an average strikeout rate because he locates so well.

In order to be successful a pitcher needs to adopt one of a few winning strategies. In the past I've speculated that these are:

1. Strikeout alot of batters in order to minimize the number of balls put into play and therefore balls that will be hits (Nolan Ryan)
2. Walk very few batters and give up very few homeruns to minimize the effect of the hits you do give up (Greg Maddux)
3. Walk fewer batters than average but strikeout more than average to minimize base runners and balls hit into play (Fergie Jenkins)
4. Rely on deception to decrease the number of hard hit balls thereby decreasing the pct of balls put into play that turn into hits (Charlie Hough)
5. Walk very few batters but rely on keeping hitters off balance to minimize base runners and minimize the number of line drives (Jamie Moyer)

To me Greinke fits into the mold of number 3 while Anderson, May and Gobble don't fit into any of the above. It's too early to tell for Bautista obviously.

Thursday, November 11, 2004

Royals Sign Truby

True to his word Allard Baird signed a "stop gap third baseman" today in soon to be 31 year-old Chris Truby. Truby has bounced around a bit playing with Houston, Montreal, Detroit, and Tampa Bay since 2000. The discouraging thing is that in 884 career plate appearances he has walked a grand total of 38 times. His lack of patience makes Terrance Long look like a "plate discipline guy".

To his credit he did have a good season in AAA hitting .300 with 25 homeruns and 41 doubles in 130 games in Nashville and was the Sounds MVP. He walked 47 times and struck out 96 times in his 466 at bats. By all accounts he is a good defensive third baseman and even played some second and short for the Sounds in 2004.

To me this move is basically a lateral one. Jed Hansen looks to be the same player with a little more plate discipline. I'd still rather see some kind of radical experiment with Ken Harvey until Mark Teahan is ready. Oh well.

Tippett on Speed

Although I knew Tom Tippett had some good research on his site I didn't know until today that he did a bit of analysis on the baserunning question I wrote about the other day. In his article Measuring the Impact of Speed he says the following regarding Ichiro Suzuki's 2001 season:

"The most surprising thing about this type of analysis is the relatively small number of baserunning opportunities we end up with. Players only reach base so many times in a season, and after you subtract the times when (a) the inning ends without any more hits, (b) they can jog home on a double, triple or homerun, and (c) they are blocked by another runner, players rarely get more than 50 opportunities per season to take an extra base on a hit.

Last year, Ichiro had 45 such opportunities, and he took 6 more bases than the average runner. He wasn't once thrown out trying in those situations. Six extra bases may not seem like a lot, but it was enough to qualify for our top baserunning rating."

That matches up very well with what I found. Of course, in my analysis runners get more opportunities because I don't take into account the runners in front of them and do give them credit for bases that they "should" get. So in 2003 Ichiro had 84 opportunities and was expected to garner 121.74 bases. He actually advanced 125 bases for an IBP of 1.03 and an IR of 1.08.

Tom also goes on to talk about other kinds of advancements and all the ways in which runners might be credited with extra bases. In Ichiro's case he finds that he took 30 extra bases which might equate to 6-12 extra runs using measures that are not normally accounted for.

Royals and K's

Ron Hostetter alerted me to Joe Posnanski's article on the Royals lack of strikeout pitchers. He also mentions that this should affect the long term evaluation of Jimmy Gobble as I articulated in a post last August. There I found that of young pitchers of comparable ability, the higher strikeout rate pitchers had careers almost twice as long and won more than twice as many games and that pitchers with long careers generally have above average strikeout rates when they're young.

In addition to Gobble, Miguel Asencio (4.45/9) and Kyle Snyder (4.11/9) are other suspects the Royals should look at when it comes to low strikeout rates. Joe didn't mention, however, that Tankersley's strikeout rate in AAA actually dropped last season more than two strikeouts per game while still managing to strike out 29 in 35 innings for the Padres.

From the article....Royals Strikeout Leaders since 2000

2000: Mac Suzuki, 135 (ranked 50th in majors)
2001: Jeff Suppan 120 (ranked 69th)
2002: Paul Byrd, 129 (ranked 53rd)
2003: Darrell May, 115 (ranked 72nd)
2004: Darrell May, 120 (ranked 67th)

Guidelines for Software Architecture

Recently, I've been thinking more about what a software architect does and trying to formulate some basic axioms or guidelines to keep in the front of my mind. To that end I thought Steve Cohen's thoughts on what "Enterprise Ready" means are useful.

Anyway, here are a few guidelines (not original of course) that I came up with for a course I teach on patterns and architecture in .NET that apply to a typical layered architecture (presentation, business, data).

  • Design components of a particular type to be consistent. By this we mean that components within a particular layer should use common semantics and communicate in a consistent fashion. One example is that in the Data Services layer all the Data Access Logic components should use the same mechanism (DataSets, data readers, custom objects) to marshal data.
  • Always run in process if possible. The basic facts of physics dictates that code running in-process is orders of magnitude faster than code running out of process. The reasons to run out of process include a need for process isolation, platform interoperability, and occasionally CPU overload (although “scaling out” often addresses this issue).
  • Only have dependencies down the call stack. In layered architecture it is important to not tightly couple the components lower in the stack to those higher. This allows the application to be modified more easily. Note that a strict rule such as never having the presentation call the data services directly is not implied by this guideline.
  • Separate model from view. Where possible try and abstract the data (the model) from the how the data is presented (the view). This can be done through the use of web user controls and the Bridge Pattern.
  • Use service accounts when possible. The use of service accounts to run server processes such as ASP.NET and Component Services allows for simplified security by not requiring delegation. It also allows connection pooling to occur.
  • Force common functionality up the inheritance hierarchy. One of the common patterns used is the Layer Supertype. This pattern promotes code reuse by implementing common code for an entire layer in an abstract base class whenever possible.
  • Keep policy code abstracted from application code. When possible, try to keep security, caching, logging and other “policy” code independent of the application or domain logic. This allows it to more easily be changed without affecting the functionality of the application.

Bingle Redux

Recently, I mentioned that in reading F.C. Lane's Batting I ran into the term "bingle". After having asked the eminent members of SABR about the origin of the term I posted the opinion that it was a contraction of the term "bunt single" and meant a slap hit, that being a style much in vogue in the deadball era. The term then became synonymous with "single" before dyeing out sometime in the 1950s.

Since then several more SABR members have chimed in. One thought that it was perhaps a blend of "bang" or "bing" and "single". However, now from Skip McAfee we have the following citations that seem to indicate that bingle was originally synonymous with single and perhaps later was used to refer to slap hits.

"Bingle is synonymous with 'base hit'. A player who bingles [note use as a verb] swats the ball safely to some part of the field where a biped in white flannel knickerbockers is not roaming at the immediate time" - Sporting Life, Dec. 15, 1900

"The big fellow grabbed three bingles in the afternoon contest, one of which was a smash over the fence that netted him a home run" - San Francisco Bulletin ,May 26, 1913

"Jack [Killilay] has dispensed eight bases on balls, hit one batter and permitted eight bingles, three of which were triples, in his stay on the mound" - San Francisco Bulletin, May 31, 1913

"In the third inning yesterday the first local batter up drew a bingle." - Youngstown Vindicator, July 29, 1898

And finally, to support the idea that bingle was used only later to refer to slap hits Walter K. Putney in Baseball Stories, the Spring 1952 issue noted that the term was "formerly the name for any kind of a hit", restricted the usage to a single of the "dinky" kind.

Also from 1913 Gerald Cohen's study of the San Francisco Bulletin noted the use of a) "bingle" as a verb (to hit or get a hit), b) "bingler" as a batter who gets a hit, and c) "bingling" as synonymous with "hitting".

Wednesday, November 10, 2004

Measuring Baserunning: A Framework

Who says there's an unemployment problem in this country? Just take the five percent unemployed and give them a baseball stat to follow.
--Outfielder Andy Van Slyke

In my previous two posts (here and here) I laid the groundwork for evaluating the baserunning of teams and players using play-by-play data from 2003. In the second post particularly, I showed the percentage of times players take the expected, +1, +2 number of bases in various situations and how often they get thrown out.

The Questions
Now I'm ready to make a first attempt at developing a baserunning framework in order to answer three related questions:

a) What player helped (or hurt) his team the most with his baserunning?
b) What team gained the most from smart or good baserunning in 2003?
c) What is the quantitative difference between good and bad baserunners?

Note that although this is my first attempt I'm putting this in public in order to get some feedback and certainly don't claim that this is the best method to use. I'm sure there are plenty of holes and problems, the two most pressing of which are that the sample sizes for a single season may not be large enough to differentiate ability from luck, and who hits after you has a large say in how many bases you advance. The former may be insurmountable with the limited data set I have although I'll try and correct for the latter as you'll see.

The Framework
The foundation for my baserunning framework is the table discussed in my previous post. You'll remember that it showed how often runners advance in various situations. For example, with a runner on first and nobody out when the batter singles to left field, the odds are:

Typ     +1    +2  OA

84.5% 14.1% 0.6% 0.7%

In other words, 84.5% of the time the batter stops at second, 14.1% of the time he advances to third, and .6% of the time he scores while .7% of the time he is thrown out on the bases. Using this set of percentages one can calculate the average number of bases advanced in this situation by multiplying the percentages by the bases gained. In this case (.841*1)+(.141*2)+(.06*3)-(.07*1) = 1.14. So when this event occurs a typical runner will advance 1.14 bases. Since this is the average across both leagues (I assumed it wouldn't be necessary to separate the leagues since there is significant overlap with interleague play, but more on that later) I call this Expected Bases (EB). The same calculation can then be done for the other 26 scenarios in the table (I did not use the Runner on 2nd Batter Doubles scenario in the calculations that follow since only one runner in all of 2003 was thrown out in that situation - the A's Mark Ellis). When this is done it turns out that the highest number of Expected Bases for any scenario is 1.86 which occurs with a runner on 2nd and 2 outs when the batter singles. The lowest number of Expected Bases is 1.14 for both the scenario given above and the same one but with 1 out.

It should be noted that for the total calculations below I also included singles fielded by other positions and so the actual number of scenarios is greater than 27. I found that shortstops and second baseman, for example, field a significant number of singles and to a lesser extent doubles, and that the typical number of bases advanced is similar to those fielded by outfielders. There were some plays were 0 was recorded as the fielder and so those were not considered.

As you probably anticipated one can then match up the baserunning situations for individual teams and players in order to compare the actual with the Expected Bases in each scenario. For example, Carlos Beltran of the Royals was at first base 9 times in 2003 when a batter singled to left field with 2 outs. In those situations he advanced to third twice and to second the other seven times. As a result he gained 11 bases. With those 9 opportunities he could have been expected to gain 10.39 (1.15*9) bases given the league average. As a result, he's credited with a positive .61 bases for this scenario, which I'm calling Incremental Bases (IB). When calculated for all of Beltran's opportunities in all opportunities we get a matrix like so where R1BD = Runner on 1st, Batter Doubles, Opp is the number of opportunities in each scenario, EB is Expected Bases, and IB is the Incremental Bases gained.

Sit Outs Fielded Opp Bases EB IB
R1BD 0 9 2 4 4.51 -.51
R1BD 1 7 4 11 8.95 2.04
R1BD 1 8 1 2 2.50 -.50
R1BD 2 8 2 6 5.45 .54
R1BS 0 3 1 1 1.02 -.02
R1BS 0 6 1 1 1.07 -.07
R1BS 0 7 1 2 1.13 .86
R1BS 0 8 1 1 1.28 -.28
R1BS 0 9 1 2 1.36 .63
R1BS 1 7 7 9 7.96 1.03
R1BS 1 8 4 5 5.11 -.11
R1BS 1 9 5 5 6.84 -1.8
R1BS 2 7 9 11 10.3 .60
R1BS 2 9 5 9 7.47 1.52
R2BS 0 7 2 3 2.72 .27
R2BS 0 8 1 2 1.63 .36
R2BS 0 9 4 6 5.62 .37
R2BS 1 3 3 3 3.58 -.58
R2BS 1 7 4 8 5.61 2.38
R2BS 1 8 3 6 4.90 1.09
R2BS 1 9 3 4 4.47 -.47
R2BS 2 4 2 2 2.05 -.05
R2BS 2 7 1 2 1.69 .30
R2BS 2 8 4 8 7.47 .52
R2BS 2 9 1 2 1.74 .25

So when all of these are summed we find that Beltran, in 72 opportunities gained 115 bases. He was expected to gain 106.7 so that puts him 8.36 bases gained above expected. What I like about this method is that it takes into consideration three context dependencies for the runner.

First, the handedness of the batters behind the baserunner are accounted for by looking at the fielder who fielded the hit. So if Mike Sweeney, a right handed hitter hits behind Carlos Beltran one would naturally assume that Beltran will have fewer opportunities to go from first to third because Sweeney is right handed. Beltran will not be punished in this system since we're comparing the number of bases he gained against the expected bases for the scenarios he was actually involved in. This system does not, however, control for how hard the batter hit the ball (which is possible given that there are codes in the data indicating line drive, fly ball, grounder) or park effects (Fenway Park might tend to decrease advancement to third on singles to left).

Second, this system takes into consideration the number of outs. This is important since we know from the table shown in the previous post that with two outs the probability of being able to advance extra bases often doubles. With this system Beltran does not get additional credit if he happens to be on base alot with 2 outs.

And most importantly, because each player will get a different number of opportunities both because of their own ability to get on base and because of the abilities of the batters following them, the sum of the bases gained can be divided by the Expected Bases to yield an Incremental Base Percentage of IBP. For Beltran that number is 1.08 and ranks him 56th among the 331 players with more than 20 opportunities in 2003, largely vindicating his reputation as an above average baserunner. In other words Beltran gained 8% more bases than would have been expected given his opportunities.

The Results
This calculation can then be run for all players and teams. The leaders in IBP for 2003 (more than 20 opportunities) are (you can find the complete Excel spreadsheet here):

Miguel Olivo 26 43 35.45 7.55 1.21 0 2.49
Shane Halter 20 34 28.41 5.59 1.20 0 1.85
Chone Figgins 33 54 45.37 8.63 1.19 0 2.85
G. Matthews Jr. 50 89 75.75 13.25 1.17 0 4.37
Brian Roberts 63 104 89.37 14.63 1.16 0 4.83
Randy Winn 63 109 93.76 15.24 1.16 0 5.03
Denny Hocking 24 40 34.57 5.43 1.16 0 1.79
B. Phillips 31 53 45.85 7.15 1.16 1 2.27
Omar Vizquel 31 54 46.74 7.26 1.16 0 2.39
Rey Sanchez 35 62 53.93 8.07 1.15 0 2.66

While the leaders in total Incremental Bases are:

Raul Ibanez 76 129 113.24 15.76 1.14 0 5.20
Randy Winn 63 109 93.76 15.24 1.16 0 5.03
Brian Roberts 63 104 89.37 14.63 1.16 0 4.83
Marcus Giles 74 122 108.29 13.71 1.13 0 4.53
Orlando Cabrera 67 116 102.61 13.39 1.13 0 4.42
G. Matthews Jr. 50 89 75.75 13.25 1.17 0 4.37
Luis Castillo 92 148 135.55 12.45 1.09 0 4.11
Albert Pujols 68 117 105.54 11.46 1.11 1 3.69
Derek Jeter 65 112 100.61 11.39 1.11 1 3.67
Todd Helton 84 142 131.16 10.84 1.08 0 3.58
Melvin Mora 59 99 88.27 10.73 1.12 0 3.54

In perusing the leaders in IBP and IB (we'll get to IBR in a moment) you do get the impression that these measures makes sense. The leaders in both lists tend to be those players we think of as fast and/or good baserunners. Even Larry Walker, not a particularly fast man but often mentioned as a good baserunner comes in 33rd out of 331 while players perceived as bad baserunners, such as Moises Alou at 277th and Ken Harvey at 289th, or simply slow (Jon Olerud at 321st and Edgar Martinez at 312th are near the bottom. Although the leaders in IB also reflect more opportunities, they seem to be pretty indicative of good baserunners with Raul Ibanez and Randy Winn leading the list.

Of course, I say overall because a pair of catchers, Miguel Olivo and Ben Petrick are the IBP leaders. This can be explained, however, by the fact that they had 26 and 22 opportunities respectively - very near the cutoff - and in the case of Olivo, he scored twice from first base on singles with two outs. Petrick was simply more consistent overall and scored from second all eight times he was there when a batter singled. Neither one was thrown out. This also points out that perhaps 20 opportunities is too low a threshold.

Another interesting case is the Tigers Alex Sanchez, a speedy man who often bunts for hits and who stole 52 bases in 2003. His IBP is only .90 ranking him 286th. A quick look reveals that while he's fast, he also takes lots of chances and was thrown out eight times, the most in the league, in 84 opportunities.

So in answer to question (a) above we can say that Raul Ibanez helped his team the most from his baserunning although Chone Figgins, Randy Winn, Brian Roberts, and Marlon Anderson are all right up there.

On the other side of the coin Mark Bellhorn is at .66 IBP good for 331st place. Bellhorn's poor performance was highlighted during his time with the Cubs in 2003 by his getting thrown out three times in twelve opportunities and only garnering 6 bases out of an expected 17. Some of this may be attributed to "Waving" Wendell Kim as I'll discuss below. For Chicago his IB was -10.92 and his IBP .35. His bad baserunning continued to some degree with the Rockies where his IB was -1.11 and his IB .94.

From a team perspective the leaders in IBP and IB were:

COL 572 912 877.75 34.25 1.04 11 10.31
BAL 635 972 944.07 27.93 1.03 9 8.41
OAK 572 902 880.76 21.24 1.02 13 5.84
ANA 597 909 888.03 20.97 1.02 9 6.11
ATL 659 1005 982.46 22.54 1.02 14 6.18
CLE 551 850 831.83 18.17 1.02 15 4.65
MIN 626 959 944.88 14.12 1.01 15 3.31
SDN 594 900 887.49 12.51 1.01 6 3.59
NYN 521 778 769.58 8.42 1.01 13 1.61
KCA 681 1015 1004.02 10.98 1.01 12 2.54

As you can see Colorado had the highest IB followed by Baltimore. On the other end the Cubs had an IBP of .95 and an IB of -42.56. Using this we can tentatively answer question (b) as including Colorado, Baltimore, Oakland, and Anaheim as good baserunning teams. For question (c) the difference appears to be on the order of 75 or so bases per season that a great baserunning team takes over a bad one. It would be interesting to compare the 2004 numbers to see if there is any trend here and if the Cubs were justified in firing Kim.

The issue that this immediately raises is how much of Mark Bellhorn's poor performance can be attributed to his third base coach and how much to himself? As I showed earlier his performance definitely improved with the Rockies as did that of Jose Hernandez who's IBP was .96 with the Cubs and 1.05 and 1.08 with the Rockies and Pirates respectively although he had only four opportunities with the Cubs, far too small to say anything. So the question of whether there is a team bias at work and how large it may be is unknown.

From the team numbers it also appears there may be a league bias. Nine of the bottom ten teams are from NL while seven of the top ten teams are from the AL. I'll have to rerun the numbers to see if the probabilities are significantly different between the AL and NL but my assumption was that pitchers, while poor hitters, would not be significantly poorer in their baserunning ability. This may be incorrect or it could be that NL third base coaches are much more cautious with pitchers on the bases or that they take more chances when pitchers are coming up. Or a combination of all three.

Next Steps
So where to go next? It seems to me that next logical step is to translate IB into a number of runs gained or lost by individuals and teams. Two possible ways to do this occur to me.

One way would be to assign weights to the outs and advancements and simply sum them. For example, in the linear weights formula an out costs approximately -.09 runs (see my post on Batting Runs for a discussion of why -.09 instead of -.25) while a base gained from an intentional walk is weighted at .33. Using these values one can calculate an IBR (Incremental Base Runs) and see that the Rockies gained 10.31 runs while the Cubs lost 15.84 runs.

COL 572 912 877.75 34.25 1.04 11 10.31
BAL 635 972 944.07 27.93 1.03 9 8.41
ATL 659 1005 982.46 22.54 1.02 14 6.18
ANA 597 909 888.03 20.97 1.02 9 6.11
OAK 572 902 880.76 21.24 1.02 13 5.84
CLE 551 850 831.83 18.17 1.02 15 4.65
SDN 594 900 887.49 12.51 1.01 6 3.59
MIN 626 959 944.88 14.12 1.01 15 3.31
KCA 681 1015 1004.02 10.98 1.01 12 2.54
NYN 521 778 769.58 8.42 1.01 13 1.61
SLN 611 940 931.92 8.08 1.01 14 1.41
TEX 550 825 822.77 2.23 1.00 14 -0.52
CHA 562 865 868.97 -3.97 1.00 8 -2.03
NYA 638 960 960.81 -0.81 1.00 20 -2.07
DET 446 639 642.49 -3.49 0.99 11 -2.14
TOR 656 1004 1007.05 -3.05 1.00 13 -2.18
FLO 560 823 829.39 -6.39 0.99 15 -3.46
SEA 664 976 983.03 -7.03 0.99 13 -3.49
PIT 585 867 877.20 -10.20 0.99 13 -4.54
TBA 603 894 906.19 -12.19 0.99 18 -5.64
CIN 508 754 767.83 -13.83 0.98 14 -5.82
MON 570 853 868.26 -15.26 0.98 14 -6.29
BOS 667 1011 1027.14 -16.14 0.98 12 -6.41
HOU 610 922 937.93 -15.93 0.98 18 -6.88
SFN 579 846 864.84 -18.84 0.98 12 -7.30
LAN 514 757 778.87 -21.87 0.97 10 -8.12
ARI 580 856 880.05 -24.05 0.97 10 -8.84
PHI 626 912 947.43 -35.43 0.96 15 -13.04
MIL 524 750 786.43 -36.43 0.95 17 -13.55
CHN 537 766 808.56 -42.56 0.95 20 -15.84

From an individual perspective Raul Ibanez leads with 5.20 IBR while Geoff Jenkins is last with -6.48 (he had an IBP of .80 in 53 opportunities). Looking at the spread this analysis indicates that good baserunning teams pick up about a win per year (assuming a win is purchased at the cost of 10 or so runs) over average teams and somewhat less than three wins over poor baserunning teams while an individual may be responsible for somewhat less than an extra win with his baserunning.

A second technique that could be used is to look at the run expectancy value for each situation before and after the play and calculate the difference. To me this makes a good deal of sense since it will have the tendency to weight the outs more properly and give more credit for actually scoring a run than simply advancing. A weakness of IBR is that an out at second base is treated the same as an out at the plate. I haven't yet run those numbers but may do so in the future. There is an additional problem doing it this way, however. The presence runners on the bases ahead of the runner we're analyzing will change the run expectancy even though of course the baserunner in no way controls what happens to those runners.

What's Missing?
I'm sure as you've read this you've thought of several things that might be included. Here is what I've identified.

1) This framework only includes three basic situations (runner on first batter singles, runner on second batter singles, and runner on first batter doubles). The situations could be expanded by looking at advancement on groundballs (so called "productive outs").

2) To get a complete view of the baserunning of an individual scoring on sacrifice flies, pickoffs, advancing on sacrifice hits, stolen bases, and even defensive indifference should be taken into account.

3) While some of the context is here accounted for, much else is not. For example, what if four of the eight times Alex Sanchez was thrown out on the base paths he was the tying run with two outs in the bottom of the ninth? Is it reasonable to punish him as severely as a guy who gets thrown out at third base with his team down 3-0 in the third? Obviously not.

4) The framework makes no allowances for the base ahead of the runner being occupied. This particularly effects hitters who are intentionally walked alot like Barry Bonds. For Bonds second was occupied 30 of the 47 times (64%) a batter singled with him there against the league average of 29.2%. In these circumstances the runner will find it more difficult to take an extra base, which will artificially hold down his IB and IBP values. The reason I didn't exclude these situations was because it would have further reduced the number of opportunities but a good case can be made for doing so.

5) It's not clear to me how a team would use this information to make better decisions except at the extremes: telling Alex Sanchez to stop trying to take an extra base every time you're on, firing your third base coach if you're the Phillies or Cubs, and using Gary Matthews Jr. and Shane Halter as my first pinch runners. In other words, while all of this is interesting and provides some quantification of baserunning, it's not very actionable for most teams or players. I realize that wasn't one of my questions when I started this but really worthwhile research should lead to something actionable.

In summary I want to reiterate that this is a first pass at analyzing and quantifying baserunning and for many of you (as for me) I'm sure has raised more questions than answers. I'd appreciate your thoughts in any case.