FREE hit counter and Internet traffic statistics from freestats.com

Thursday, December 30, 2004

Microsoft CRM and Factories

In their seminal book Design Patterns the authors, collectively known as the Gang of Four, write about one of the most often used patterns in object-oriented design known as the Factory Pattern. It turns out that when working with the Microsoft CRM API from Microsoft .NET the Factory pattern can come in handy.

In order to manipulate data in various CRM entities a developer must instantiate and configure individual objects such as CRMAccount, CRMContact, and CRMLead. It also is often the case that a particular application need only work with one or two of these objects. Further, each object requires that its Url and Credentials properties be set before any of its methods can be invoked.

To handle all of this I created a CRMConfigFactory that combines the Factory Pattern with a configuration section handler. The class reads a custom configuration section such as the one shown here.



<mscrmconfig>
<server>http://crmapp.crm.sbl.com/mscrmservices/</server>
<proxyassembly>Microsoft.Crm.Platform.Proxy,
Version=1.2.3297.0, Culture = neutral,
PublicKeyToken=31bf3856ad364e35</proxyassembly>
<entities>
<entity type="Microsoft.Crm.Platform.Proxy.BizUser"
name="BizUser" srf="BizUser.srf">
<entity type="Microsoft.Crm.Platform.Proxy.BizPrivilege"
name="BizPrivilege" srf="BizPrivilege.srf">
<entity type="Microsoft.Crm.Platform.Proxy.CRMActivity"
name="Activity" srf="CRMActivity.srf">
<entity type="Microsoft.Crm.Platform.Proxy.CRMQuery"
name="Query" srf="CRMQuery.srf">
<entity type="Microsoft.Crm.Platform.Proxy.CRMQueue"
name="Queue" srf="CRMQueue.srf">
</entities>
</mscrmconfig>

As you can see, in this project only five of the CRM entities will be utilized as identified in the entities element, each of which contains an attribute that points to the .srf file that represents the endpoint on the CRM server used to process requests. The server element points to the Microsoft CRM server while the proxy assembly element points to the assembly from which the CRM entities will be loaded.

The CRMConfigFactory class then implements the IConfigurationSectionHandler interface along with the Create method. The Create method is called by the CLR and passes into it the configuration section. As shown here, the method reads the entries from the section and populates an ArrayList of custom CRMEntity structures with the information about the entities.


Public Function Create(ByVal parent As Object, _
ByVal context As Object, ByVal section As XmlNode) _
As Object Implements IConfigurationSectionHandler.Create

' Get the server URL
Dim n As XmlNode
n = section.SelectSingleNode("//server")
_server = n.FirstChild.Value

' Get the proxy assembly
_proxyAssembly = section.SelectSingleNode("proxyassembly").FirstChild.Value

' Get the list of supported entities
Dim entities As XmlNodeList = section.SelectNodes("//entity")
Dim e As XmlNode
For Each e In entities
' Add each entity to the hashtable
Dim ce As New CRMEntity
ce.TypeName = e.Attributes("type").Value
ce.Name = e.Attributes("name").Value
ce.Srf = e.Attributes("srf").Value
_entities.Add(ce.Name, ce)
Next
End Function

The CRMEntity structure simply contains Name, TypeName, and Srf fields to hold the values read from the configuration file.

To implement the Factory Pattern the class then exposes a set of overloaded shared methods called CreateEntity whose signature are shown below:


Public Overloads Shared Function CreateEntity( _
ByVal entityName As String) As SoapHttpClientProtocol

Public Overloads Shared Function CreateEntity( _
ByVal entityType As CRMEntityTypes) As SoapHttpClientProtocol

Public Overloads Shared Function CreateEntity( _
ByVal entityName As String, _
ByVal cred As NetworkCredential) As SoapHttpClientProtocol

Public Overloads Shared Function CreateEntity( _
ByVal entityType As CRMEntityTypes, _
ByVal cred As NetworkCredential) As SoapHttpClientProtocol

This style of Factory method is known as a parameterized factory because the first argument specifies the type of object to create. Here the object is specified using either a string or a value from the custom enumerated type CRMEntityTypes.


Public Enum CRMEntityTypes
BizUser
BizPrivilege
Activity
Query
Queue
End Enum

Providing both a string and enumerated type overload is a good practice so that a user of the class can simply add a new entry to the configuration file without modifying
The second argument of the third and fourth signatures represents the set of credentials used when accessing the CRM server while in the first two signatures the System.Net.CredentialCache.DefaultCredentials object is passed representing the currently logged in user.

Finally, the actual work of creating the object is left to a private method called by each of the CreateEntity methods. Here the CRMEntity object is found by accessing the ArrayList (_entities) using the string value passed in to the method. This works since when a CRMEntityTypes value is passed into the public method it is passed into _createEntity by calling the ToString method. Once the type to create has been found the instantiation occurs using the CreateInstanceAndUnwrap method of the AppDomain object from within the current application domain (AppDomain.CurrentDomain).

Notice that the CRM assembly and type name are passed in. Because, being web service calls, all the CRM classes are derived from SoapHttpClientProtocol, the resulting objects is cast and then the Url and Credentials properties set before returning the created object to the caller.


Private Shared Function _createEntity(ByVal typeKey As String, _
ByVal cred As ICredentials) As SoapHttpClientProtocol

Dim c As CRMEntity
c = CType(_entities(typeKey), CRMEntity)

If c.TypeName Is Nothing Then
Throw New CRMException("CRMEntity '" & typeKey & _
"' does not exist in the configuration file")
End If

Dim o As SoapHttpClientProtocol = CType( _
AppDomain.CurrentDomain.CreateInstanceAndUnwrap( _
_proxyAssembly, c.TypeName), SoapHttpClientProtocol)
o.Url = _server & "\" & c.Srf
o.Credentials = cred
Return o
End Function

In this way a client can simply do any of the following:


Dim o As Object = CRMConfigFactory.CreateEntity(CRMEntityTypes.BizUser)

Dim a As CRMActivity = CType(CRMConfigFactory.CreateEntity("Activity"),_
CRMActivity)

Dim o As Object = CRMConfigFactory.CreateEntity(CRMEntityTypes.Query,cred)


Wednesday, December 29, 2004

Forms of Tolerance

In the previous post the term "tolerance" was used in a positive way. In other words to be tolerant of the religious expression of others is a positive thing. Indeed Krauthammer was referring to this form of tolerance when he quoted a 1790 letter George Washington wrote to the Newport synagogue, "It is now no more that toleration is spoken of, as if it was by the indulgence of one class of people, that another enjoyed the exercise of their inherent natural rights.'' In this sense toleration means an acceptance of another's point of view and a recognition of their right to hold to that position even if you think it mistaken. That is the view our pastor called social tolerance last weekend.

However, in our present culture there is another and more malevolent form of tolerance he referred to as epistemological tolerance. In practice this is actually the inverse of social tolerance and is construed to mean that unless you praise and actively support a view contrary to your own you are not being truly tolerant. In this form it is not enough to respect the rights of others to hold a view that is in logical contradiction to your own (e.g. as a belief in reincarnation is contradicted by a belief in bodily resurrection), rather you must recognize that their belief is in some sense true for them. Hence the term epistemological - epistemology being the study of knowledge or truth. In the end epistemological tolerance is a denial that there is any absolute truth and why it is so dangerous.

Monday, December 27, 2004

Tolerance

I was alerted to this article by Charles Krauthammer over the weekend regarding the annual bout of Christmas intolerance. I especially liked this:

"Some Americans get angry at parents who want to ban carols because they tremble that their kids might feel ``different'' and ``uncomfortable'' should they, God forbid, hear Christian music sung at their school. I feel pity. What kind of fragile religious identity have they bequeathed their children that it should be threatened by exposure to carols?...It is the more deracinated members of religious minorities, brought up largely ignorant of their own traditions, whose religious identity is so tenuous that they feel the need to be constantly on guard against displays of other religions..."

Exactly. And typically those who are the loudest are those without any religious tradition at all and whose rather nominal atheism is threatened by the appearance of religious activity in the public square. As C.S. Lewis said in his autobiography Surprised By Joy:

"A young man who wishes to remain a sound Atheist cannot be too careful of his reading. There are traps everywhere--'Bibles laid open, millions of surprises,' as Herbert says, 'fine nets and stratagems.' God is, if I may say it, very unscrupulous."

Lima Time Again

In an interesting move the Royals once again acquired pitcher Jose Lima. Lima had pitched for the Royals in 2003 when Allard Baird signed him out of the independent leagues. He went 8-3 but pitched poorly down the stretch and was injured. The Royals offered him a contract laden with incentives for 2004 before he signed with the Dodgers. He obviously pitched well for the Dodgers in 2004 going 13-5 with a 4.07 ERA and threw a shutout in the division series.

Baird thinks that Lima will be the "innings-eater" he's looking for. To me that seems questionable with his injury history. 2004 was the first time he'd thrown as many as 170 innings since 2000. It must also be remembered that he was 9-1 with a 3.08 ERA and 1.09 WHIP at Dodgers Stadium last year and 4-4/5.56/1.48 on the road. Ouch. He also gave up 17 homeruns in 68 innings on the road, or one every 4 innings. Dodger Stadium is always a pitchers park and in 2004 was 25th in run scoring at 90.9%. Teams that play in extreme parks tend to be fooled by the statistics of their own players, often overvaluing mediocre pitchers when in a pitcher's park and overvaluing mediocre hitters when in a hitter's park. This appears not to have happened with the Dodgers in this case given that they don't appear to be that deep in pitching right now as evidenced by their desire to get Javier Vasquez or possibly Randy Johnson. It causes one to wonder whether Paul DePodesta (the Dodgers GM) knows something about Lima that Baird doesn't.

Unfortunately, since the terms of the deal were not disclosed it is difficult to really evaluate the signing. The good news for sure is that it is only a one-year deal.



Tuesday, December 21, 2004

No Mas Macias Please

As expected the Cubs re-signed utility infielder Jose Macias. I wasn't a fan of the original signing last year and in my estimation the Cubs spent another $825,000 (at least he didn't get a raise) on a very very mediocre player (and that's being kind). Given that the Cubs already signed Neifi Perez I have no idea why Jim Hendry wants another one of these guys.

What's most disappointing though is that Dusty Baker loves guys like Macias (old, no plate discipline, can run a little) and so you know that if you give him these kinds of players (Marvin Bernard, Calvin Murray, and Tom Goodwin spring to mind), he'll actually use them. He started Macias almost 10% of the time (13 games) in the outfield (7 in right, 3 each in center and left) in addition to his 15 starts at second and third. So that's 28 starts for a guy who had a .292 OBP and all of 5 walks in 204 plate appearances. And this was not a fluke. His career OBP is .301 in over 1,500 plate appearances. He's 33 and not getting any better. There is nothing to like about this.

Monday, December 20, 2004

The "New" Runs Created

Awhile back I had written about Bill James' Runs Created formula. I had not seen until this weekend the analysis done by Jim Furtado et. al. when James first published the "new version" in 1999. Interesting stuff....

Asencio, Marrero, Santiago, and Sosa

Well, it was good to see that the Royals did not offer Miguel Asencio a big-league contract for 2005. The 24 year-old was recovering from Tommy John surgery and wasn't projected to be back until August or September. Regardless of whether he was recovering or not Asencio is a guy who doesn't project to have any kind of career. In 171 career innings he's struck out 85 and walked 85. Such a low strikeout rate (4.48 per 9 innings) and a high walk rate aren't a combination for success.

So right now it looks like Runelvyes Hernandez, Zack Grienke, Denny Bautista, Jimmy Gobble, Kyle Snyder, Brian Anderson, Mike Wood, and Kevin Appier (yes they signed the 37 year-old to a minor league contract) will all be in the running for starting spots in 2005.


On another front the Royals acquired Eli Marrero from the Braves for pitcher Jorge Vasquez. No, this is not the answer to the corner outfield situation (he's 31 years old) but it is interesting because Marrero can hit left-handers - to the tune of .415/.462/.670 last year in Atlanta. This could make a good platoon with Matt Stairs or Terrance Long at the corners. Allard Baid admitted that was his thinking in the MLB.com story:

"But what this does allow is for us to buy some time if we need it. Looking at the free agent market and where the dollars go there, for us this makes more sense. If you combine a left-right, whether it be (Terrence) Long, whether it be (Matt) Stairs, you'd have pretty good production out there. It also improves the bats off the bench."


Finally, the Royals also unloaded Benito Santiago to the Pirates in exchange for pitcher Leo Nunez. The Royals will still have to pick up about $1M of his $2.15M salary in 2005. Of all the moves Baird made before the 2004 season this one was one that made the least sense to me. Paying a 39 year-old catcher over $4M for 2 years is strange to say the least. The deal also indicates that perhaps the Royals did not want to deal with the BALCO fallout and Santiago's likely implication in it. I'm looking forward to seeing what John Buck can accomplish in a full season behind the plate.

One more thing, Sammy Sosa will not be a Royal in 2005. His $18M contract is way too big for the Royals to even consider picking up half. Sosa also has a clause where he gets his 2006 option if traded and he won't waive it in order to play in a small market with the likely prospect of losing 90 or more games.

Sunday, December 12, 2004

Contextualizing OPS

Awhile back I had written about OPS and how it is a good back of the envelope calculation for correlating a player's production with run creation. After doing so I went on to discuss normalizing OPS against the league average (called NOPS or OPS+) and then contextualizing OPS for the home park of the player using the Batter Park Factor (BPF).

A couple days ago I received an email from Brandon Heipp who has done some fine work on the BaseRuns estimator that I cited in my article on the subject. He pointed out to me that the technique of dividing NOPS by the BPF is not as accurate as using the square root of BPF. I hadn't heard of doing this and was simply following the method used by Pete Palmer in The Hidden Game of Baseball. In looking back at the book, however, I notice that Palmer, after showing how to divide NOPS by BPF, says the following on page 87.

"To apply Batter Park Factor to any other average - On Base, slugging, Isolated Power, batting average-use the square root of the BPF. This is done so that run scoring for teams, which is best mirrored by On Base Average times slugging percentage, can be represented clearly."

In other words:

1. Since BPF is a measure of the impact of a ballpark on runs scored
2. and Batter Run Average (BRA calculated as OBA*SLUG) very closely correlates with run scoring
3. and BRA/BPF = (OBA/SQRT(BPF)) * (SLUG/SQRT(BPF))

Then to more accurately contextualize OBA, SLUG or other components of run scoring you should divide them by the square root of BPF.

This can also be seen in the basic runs created formula RC = ((H+BB)*TB)/(AB+BB). This formula is the equivalent of OBA*SLUG*AB. If you divide this by the BPF it is equivalent to (OBA/SQRT(BPF)) * (SLUG/SQRT(BPF)) * AB. And so once again the more accurate way to apply BPF to the components of run scoring is by using the square root.

OPS does not correlate as closely with run scoring as do either RC/G or BRA. Albert and Bennett in their book Curve Ball found that using OPS "the number of runs scored by a team per game can be predicted within about .15 Runs per Game for two-thirds of the teams." Although not as good as RC/G, BRA, Total Average (TA), or Batting Runs, it does correlate much more closely than the traditional stats such as OBA, SLUG, or AVG. As a result, it's probably not as important to use the square root with OPS as pure component statistics such as OBP or SLUG for example.

That said, I did recalculate the top leaders in OPS using the square root of BPF. Here they are:


Year OPS NOPS NOPS/PF
2002 NL Barry Bonds SFN 1381 186 195
2001 NL Barry Bonds SFN 1379 182 191
1920 AL Babe Ruth NYA 1379 189 185
1921 AL Babe Ruth NYA 1359 179 177
1923 AL Babe Ruth NYA 1309 178 176
1941 AL Ted Williams BOS 1287 177 175
1957 AL Ted Williams BOS 1257 178 173
2003 NL Barry Bonds SFN 1278 171 172
1926 AL Babe Ruth NYA 1253 170 172
1927 AL Babe Ruth NYA 1258 168 171

These are very similar to the previous list of course but Ted Williams 1957 list now makes the grade since Williams benefited from playing in Fenway Park. Babe Ruth's 1931 season falls off. Of course Bonds' 2004 season would rank right up there but given recent events I didn't feel like calculating it.

Saturday, December 11, 2004

Aging Sluggers

Interested in the comment by Clay Davenport about sluggers whose finest power seasons came after the age of 35, I decided to take a closer look. By whittling my list of players to only those 36 with 400 or more career homeruns, which included all players through 2003, I found that their peak power season measured by homeruns per at bat came at age 26 and again at 30 (.065 or about 36 homeruns per 550 at bats). However, rather than seeing an increase in power, the curve is almost flat as they simply retain their power longer and don't really start to lose it until past page 37. In terms of NOPS their peak season comes at age 27 (133.1) although once again the decline is very small and they stay well above 100 even through age 41 (108.1).

I also looked at the power numbers for my original sample of 6,000 players and found that their peak power season came at age 27 (.027 or 14.9 homeruns per 550 at bats) and the curve looks almost identical to the NOPS curve I posted previously. I had expected that the peak power season might come a couple years after the peak in NOPS instead of just one year (their peak NOPS was 115.8 at age 26).

Finally, I compared these against the cumulative totals from the eight sluggers mentioned by Davenport (Aaron, Baines, Evans, Fisk, Galarraga, Martinez, Sauer, and Ripken). Here I found that indeed their peak power season came at age 37 (.062) with their peak NOPS at 122.6 at age 35. They also of course reatained their power even longer and didn't show decline until age 41.

I conclude from this that while Davenport is correct that a few sluggers peak late that is not the norm. And while Bonds may in fact belong in that select group, it appears unlikely since even for those sluggers their batting average decreased and their strikeout rate increased even as they were maximizing their power - something alien to the post 35 year old Bonds.

Thursday, December 09, 2004

What is a Normal Career Trajectory?

Since I've been writing about normal and abnormal career trajectories the last few days I'm sure several readers have wondered just what a "normal" career trajectory is.

At first that seems like a simple question to answer. Just calculate the batting average, slugging percentage, or OPS of all players who have played major league baseball at various ages and graph them. While that seems like an obvious answer, things are not so simple. If you proceed along those lines you quickly find that there a lot of players whose careers end at a young age due to poor performance and so the early ages will be biased towards lower values. This will tend to distort the picture and show more improvement with age than is really the case since only the better players will still be playing in their early thirties. This problem is even more pronounced as you get to advanced ages since only players that have been very productive make it past age 35 (Roy Hobbs excepted of course). The end result would be a curve that makes it appear that players remain at a low level into their early thirties and then suddenly improve.

What we need to do instead is perform these calculations on a subset of the player population. What I did was to select all the players whose careers started after 1900 and who appeared in games from 1901 to 1978 and who garnered over 6,000 plate appearances in their careers. By following this methodology I excluded all players who were active in 2003 (Rickey Henderson was a rookie in 1979) and selected only those who had relatively long careers equivalent to being a regular position player for around 10 years. An argument can also be made to only include players who played in 10 or 15 seasons or from ages 21 to 39 or some such span. Conveniently, this also excluded the current crop of players whose suspected chemical enhancement would skew the data – after all we’re after a baseline here. Of course, that's not to say that players in the past did not use other forms of illegal or banned substances. But the presumption is that even if so the problem was less frequent than in the lively player era. This gave me a set of 249 position players.

In order to measure the productivity of a player I chose to use Normalized OPS (NOPS or OPS+) for reasons discussed here. I calculated NOPS by taking the raw OPS and dividing it by the league OPS for each player and multiplying by 100. Values greater than 100 are therefore above the league average. These were then weighted by plate appearances (using only AB+BB since the data for sacrifice hits, hit by pitch, and sacrifices was not available extending across the entire time span). Finally, the weighted values were averaged for each age. I then graphed the NOPS by age discarding the sample sizes under 10 at ages 17,18, 43, 44 (Sam Rice), and 53 (Minnie Minoso) and came up with the following (the yellow line is a three year moving average:



As you can see players with long careers tend to begin their careers just below the league average and quickly surpass it reaching a sustained performance level about 15% greater than the league average when they are 25 to 28 years old. Their peak performance comes at the age of 26 with an NOPS of 115.8. From the age of 28 on, there is a slow descent that begins to accelerate at the age of 33. That continues through their late thirties until they reach about league average again at 40 years old. Of course, as the ages increase past age 33 the sample sizes decrease. This accounts for the slight upsurge in NOPS at age 41 and the flatter nature of the curve at the right end. For those interested in sabermetrics you'll notice that the yellow curve is very similar to one drawn by Bill James way back in the 1987 Baseball Abstract and reflects one of the sabermetric principles I covered in my article Sabermetrics 101.

"Both hitters and pitchers peak at age 27 and decline more quickly than is commonly thought. This should impact how players are scouted, developed, and paid. For example, many players by the time they reach free agency are already past their peak performance and so can be expected to decline."

When you compare this curve with the curves I've posted for Barry Bonds the differences become clear while the curve for Sammy Sosa looks similar to this one. One objection to making that comparison would of course be that the players considered in my "normal" curve did not focus on nutrition and weight lifting as today's players do. I'm certainly supportive of that argument and assume that the curve would show sustained performance at more advanced ages. What it wouldn’t show, however, is increased performance at those advanced ages.

A second objection to making the comparison is that you can't compare one individual with a group average. Once again, I'm sympathetic to the argument and another way to do it would be to compare him directly to other players who have played in as many seasons. However, doing so would only serve to highlight the different career trajectories of Bonds' with Babe Ruth, Hank Aaron, Willie Mays, and others.

And finally, a third objection is that there have been examples of power hitters having their best homerun seasons bunched late in their careers. Clay Davenport recently noted on SABR-L that Hank Aaron's best homerun season came at age 39, Darell Evans at 38, Carlton Fisk at 40, Hank Saur at 37, Andres Galarraga at 35, Harold Baines at 36 and 40, Edgar Martinez at 37, and Cal Ripken at 38. And for many of these players their higher power seasons were bunched at the end of their careers. I don't disagree and certainly believe that power lags total performance. However, in Bonds' case it's not just his power that increased but his total performance including batting average, walks, slugging percentage, and even strikeout ratio. That said, it would be interesting to look at the careers of these individuals as a group when compared to Bonds. Work for another day.

Cubs Make Some Moves

Well, the Cubs made their secondbase decision by signing Todd Walker to a one-year deal for $2.5M with a 2006 option based on plate appearances. They also inked Nomar Garciapara for one-year $8M.

I think Walker is the more important signing. It'll be great to have his left-handed bat in the lineup everyday near the top of the order. I hope Nomar can return to form and stay healthy but I'm not holding my breath. I'm afraid we're going to see way too much of Neifi Perez next year at the Friendly Confines. They also offerred arbitration to Matt Clement and Todd Hollandsworth. It sounds like Clement already has multi-year offers from other teams and so likely won't accept. However, the Cubs will get a draft pick as compensation. Hollandsworth is likely to return. One minor signing was backup catcher Henry Blanco, almost as poor a hitter as the departing Paul Bako but a good defender.

Overall though the Cubs haven't really improved anything yet over the team that won 89 games last season and you might say they've gone backwards by losing Clement in favor of Glendon Rusch, signing Ryan Dempster and Perez and with Sosa getting a year older. If they can sway Carlos Beltran away from the Yankee that will all change however. The winter meetings are about to start...

Wednesday, December 08, 2004

Bonds and Sosa

As a Cubs fan I’ve been interested to take a look at the career of Sammy Sosa in the wake of the BALCO mess. This has an impact because it was reported that the Mets had backed away from Sosa after the Jason Giambi revelations last week.

At the risk of sounding like a Giants fan defending Barry Bonds it should be remembered that Sosa, unlike Mark McGwire, admitted only to taking Creatine and not andro (Androstenedione, a substance that is virtually the same as a steroid differing only be a single hydrogen atom). Of course, although Sosa hasn’t admitted it doesn’t mean he didn’t take it. After all, we know that he doesn’t object to cheating on principle given the corked bat incident of 2003 and his unbelievable denial from ignorance.

In his defense, however, one can raise two points. First, in the case of Sosa it should be noted that he definitely changed his offensive approach with the help of hitting coach Jeff Pentland. Watching tapes of Sosa from 1997 and before you immediately notice the higher hand position, a more closed stance, and his closeness to the plate.

Second, it should be remembered that Sosa’s peak power seasons came at the ages of 29-32 and not 37-39 as in the case of Bonds. This is much more inline with the historical peak in power for players coming slightly after their physical peak during ages 27-29. Much to the disappointment of Cubs fans his decline in the last three years is more typical of players in general as can be seen in the following graph.


Perhaps I’m an optimist and grant that being a Cubs fan colors what I’ve said here but until I see other evidence I’m cautiously optimistic that despite his faults Sosa did not cross the line.

Deployment Patterns for the Compact Framework

Just found out that the article Jon Box and I wrote "Deployment Patterns for Microsoft .NET Compact Framework" has been published on MSDN. There is a sample application showing a Smart Application Updater component included as well. Happy reading and I'd love to know what you think.


Tuesday, December 07, 2004

KC .NET User's Group 12/16

The next KC .NET User Group meeting will be Thursday, December 16 - 6:00 PM at the Microsoft Kansas City office.

Directions:
http://www.microsoft.com/mscorp/info/usaoffices/midamerica/kansascity.asp

Topic: Software in a Service-Oriented World
If services are the next big thing—and they are—what impact will this change have on software development? David Chappell will give a perspective on this question, focusing on the impact this move will have on .NET. Drawing from his forthcoming book on Indigo, David will describe how this new technology fits and what it does, and he’ll explain why BizTalk Server is becoming a central technology for .NET developers to understand. The goal is to give a clear sense of what a developer’s life will be like in this brave new world.

Presented by: David Chappell
David is Principal of Chappell & Associates in San Francisco, California. Through his speaking, writing, and consulting, David helps information technology professionals around the world understand, use, market, and make better decisions about enterprise software technologies.
http://www.chappellassoc.com

Thanks Quilogy

For those who haven't heard this is my last week at Quilogy. After nine years of consulting, training, project management, courseware development, more training, and architecture at Solutech/Quilogy I felt I needed to take advantage of a different opportunity to further my career. Hopefully, there will be more to write in that regard in the future.

As for my time at Quilogy I have nothing but good things to say. The company - CEO Randy Schilling and CIO Alan Groh - has given me the opportunities to develop my technical skills and the freedom to integrate other interests such as writing, for which I'm very grateful. Rick Kight, now in St. Charles heading up the Quilogy Development Center (QDC), was my first manager here in the Kansas City office (in the heady days when Powerbuilder was king) and remains a good friend. Michael Smith, Jan Brandt, Eko Setiawan, Lindsay Kinnan, Bob Kimbrough, Chris Eckles, and Ron Hostetter among many others in KC have all been very good to me and have been a pleasure to work with. Of course Jon Box in Memphis worked with me on both the first Atomic course and the Compact Framework book and I'm sure we'll be collaborating in the future.

Throughout the rest of Quilogy Manish Chandak and John McCartan in St. Louis, Dean Furness in Des Moines, and John Talbott and Dave Koopmans in Omaha have all been long time Solutech/Quilogy employees that I highly respect for their skill as well as their work ethic.

I'm excited about the new opportunities but it's always hard to leave what you know and what you're comfortable with especially, when like me, you're leaving a good place to work. Thanks Quilogy.

The Lively Player Era

For those who are unaware of the magnitude of the increased performance Barry Bonds has evidenced in the last few years I've updated a graph I created a few months back with Barry's 2004 stats as well as added HR/AB.
In looking at this graph it appears that by age 33 Bonds had begun the natural decline that is typical for ball players. By age 35 however, his performance levels were equal to his peak period from age 27 to 30. After that he continued to get better with his peak homerun season of 73 coming at the age of 36. However, after dipping back down at age 37 his homeruns per at bat have climbed a little the last two seasons while his strikeouts have declined and his walks, of course, have gone through the roof.

I don't know that I have too much to add to what's been said on other baseball blogs this week. I guess like alot of fans (who should shoulder some of the blame for not being more outspoken and clear sighted) I had hoped that Bonds and others had improved their performance in large part due to hard work and weight training along with a confluence of other factors I've written about before. Alas that seems unlikely to be the case as the scandal spreads and so the increased homerun output since 1993 will be - at least mentally - placed in a different category for baseball fans like me. That saddens me because I'm one who does cherish the statistics of the game and value them in large part for their continuity. Quite simply it makes a wreck of them since there is no way to separate the livelier players" (in the words of George Will) from the unenhanced ones.

And the unfortunate thing is that the widespread use of these substances was so preventable. If baseball had gotten serious when other sports did, namely the NFL which instituted its policy in 1987 and the hoopla surrounding Ben Johnson in 1988, much of this would have been avoided. Baseball even had a second chance in 2002 when the Ken Caminiti and Jose Canseco stories came out but once again they fiddled while Rome burned. The strength of the player's union and the disunity of the owners both played a role.

I did want to address two issue that came up this week. First, in the Peter Gammons story on ESPN a couple days ago. He said,

"In 1999, Barry Bonds was already a Hall of Fame player. He had won three MVPs, and should have had a fourth. He hit .300 and averaged 36 homers a year in the '90s. He is such an intelligent hitter that teammates claim he knows every pitch that's coming, he's reduced the strike zone to the size of a nickel, and in the 21st Century has batted .306, .328, .370, .341 and .362, with on-base percentages of .440, .515. .582, .529 and .609. That's not chemical, that's simple greatness."

I'm not so sure that what Gammons described can be so easily passed off as skill and that there isn't a positive feedback loop related to Bonds' presumed chemical enhancement in operation that shrinks his strikezone and accounts for his "intelligence" and increased batting averages and on base percentages.

If one assumes that Bonds' increased strength has increased his bat speed it is reasonable to assume that this allows him to wait on pitches longer. This allows him to get better recognition and avoid swinging at some percentage of balls that he formerly would chase. This results in more line drives and thus more hits and a higher batting average. At the same time his strength allows him to hit balls harder and farther than before. Gammons noted that a homerun distance expert has claimed that "prior to 2000, Bonds hit three homers longer than 450 feet; in the last five years, he has hit 26". Pitchers aren't dumb and notice this and therefore rightly pitch carefully to him. Umpires also know this and aren't expecting as many strikes thrown to Bonds. The strikezone shrinks. His walks and on base percentage (not counting intentional walks) skyrockets. Bonds can be more selective therefore and swing only at very hittable pitches (mistakes, usually fastballs with poor location or hanging curves). He hits them very hard. Pitchers see this and are even more reluctant to throw strikes. Fewer strikes are expected, the strikezone shrinks. The end result is what you see.

The second is the statement that Bonds made when he said he didn't know what was in the "cream" that he got from his trainer and didn't ask. I'm not a professional athlete but my brother-in-law is an Olympic wrestler and in observing him I find it impossible to believe that any professional athlete, whose job is their body after all, wouldn't know what they were using on their body or wouldn't care or wouldn't see changes that were unnatural. Athletes at that level understand their bodies much more deeply than your average Joe. I'm not buying Barry's defense from ignorance.


Update: Just read this column by George Will. I was referring to his comments on This Week on Sunday morning much of which he echoes in the column. Related to his points about fairness in sport Gammons quoted the Royals Mike Sweeney saying at this year's All-Star game, "I want strong testing because I don't think it's fair for someone to have an illegal advantage over me." Way to go Mike.

Sunday, December 05, 2004

Defensive Spectrum Again

For completeness in the discussion on the defensive spectrum here are the other positions graphed with all of the outfielders grouped together (this was necessary since the Lahman database I am using does not differentiate outfielders until 1996).

As you can see early in the century - until the 1930s - outfielders held a slight advantage offensively and were then generally eclipsed by first baseman. Once again this may be attributed to the need for good fielding first baseman in an era where bunting was more frequent. It's interesting that catchers were at the bottom of the offensive ladder until the teens when they eclipsed shortstops, who have for the most part remained there ever since.

Since I have data for 1996 by outfield position I thought I'd share that as well.

As you would expect DH's rule the roost while centerfielders come in last. Right and left field have jockied for second. The averages over this time span are:

LF 866

CF 798
RF 876
DH 903

Based just on this data using the (flawed) assumption that defensive value is inversely proportional to offensive output I would create the following defensive spectrums by era.

1900-1910
[ OF - 2B - 1B - 3B - SS - C ]

1911-1930
[ OF - 1B - 2B - 3B - C - SS ]

1931-1972
[ 1B - OF - 3B - C - 2B - SS ]

1973-1995
[ DH - 1B - OF - 3B - C - 2B - SS]

1996-2003
[ DH - 1B - LF - RF - 3B - CF - C - 2B - SS]

Of course, centerfielders were always more valuable defensively and sought after than the other two outfield positions and so it's not really fair to lump all three together. As a result centerfield should be placed farther to the right by one or two places in each of the periods before 1996. It's interesting to note the shift of second base to the right and first base to the left over time.

Saturday, December 04, 2004

Defensive Spectrum Redux

I had written the other day about the defensive spectrum in regards to Ron Hostetter's post about Royals prospects.

One of the interesting things about the spectrum is that it has not remained static over time. For example, in the first half of the last century third base was usually considered a more demanding position than second base. As a result, great hitters such as Rogers Hornsby, Eddie Collins, Frankie Frisch, Charlie Gehringer, and Nap Lajoie manned second while slick fielders like Jimmy Collins, Pie Traynor, Stan Hack, and George Kell were placed at third. Today of course, second base is considered a key defensive position, which is why it rates behind only catcher and shortstop while third base has drifted left past center field to the middle of the spectrum and is populated with more productive hitters like Mike Schmidt, George Brett, Matt Williams, and Chipper Jones.

To illustrate the changing nature of the spectrum the following graph shows the OPS of second and third baseman over time. It includes only those players who played more than 100 games at the position for each year. Note that the change in the relative defensive value of the two positions can be traced to the early 1950s when the small-ball of earlier years replete with bunting and chop hitting (which a good third baseman can impact) had all but disappeared. It's interesting to note that the gap between the two positions is once again closing. I chalk this up to the realization on the part of the many teams that second base may not be as important a defensive position as many baseball people once thought, resulting in more offensive-minded players such as Jeff Kent, Alfonso Soriano, Todd Walker, and Mark Bellhorn being slotted there.

This trend can also be seen by breaking down the OPS numbers by quarter century:

           2B   3B   Pct

1900-1924 704 689 102%
1925-1950 745 764 98%
1951-1975 691 762 91%
1976-2003 730 783 93%


Thursday, December 02, 2004

Defensive Spectrum

Good post from Ron Hostetter and how the Royals top prospects are all shifting to the left on the defensive spectrum. I wonder how many left fielder's you can put on the field at one time?

For those who aren't familar with the defensive spectrum, the idea is one that was popularized by Bill James and basically states that defensive positions can be arranged on a spectrum of least to most demanding i.e. [ DH - 1B - LF - RF - 3B - CF - 2B - SS - C ]. Players generally move from right to left on this spectrum over the course of their careers. Shifts in the other direction are rare and seldom work.

Bonds and Giambi

Of course, just when I say there isn't much baseball news the story breaks about Jason Giambi's admitted steroid use to the grand jury investigating BALCO. Giambi's injuries in both 2003 (knee) and 2004 (tumor and parasite) led many people to suspect that he was on steroids although he continued to deny it. His brother Jeremy apparently also admitted to the grand jury that he took the same substances.

What I found most interesting about the ESPN story was the following quote.

Bonds brought Anderson [Greg, Bond's personal trainer] on a barnstorming players' tour of Japan after the big-league season in 2002. Giambi said it was on that November trip he met the trainer.

In his testimony, Giambi said he asked Anderson about the things Bonds was doing to stay at an elite level.

"So I started to ask him: 'Hey, what are the things you're doing with Barry? He's an incredible player. I want to still be able to work out at that age and keep playing,' " Giambi testified. "And that's how the conversation first started."

Bonds continues to deny it but it seems the hoofbeats are getting closer. Gary Sheffield already admitted it but claims he didn't realize that what he was given contained steroids. Right. This also blows away any remaining shred of legitimacy for MLB's steroids policy. The union clearly doesn't care about leveling the playing field for the majority of its members but would rather protect the cheaters.

Wednesday, December 01, 2004

Liberals to Canada

Pretty quiet on the baseball front so I thought I'd pass this along from Joe Blundo who is a Columbus Dispatch columnist who published this in the morning edition on Nov. 16, 2004

The flood of American liberals sneaking across the border into Canada has intensified in the past week; sparking calls for increased patrols to stop the illegal immigration. The re-election of President Bush is prompting the exodus among left-leaning citizens who fear they'll soon be required to hunt, pray, and agree with Bill O'Reilly.

Canadian border farmers say it's not uncommon to see dozens of sociology professors, animal-rights activists and Unitarians crossing their fields at night. "I went out to milk the cows the other day, and there was a Hollywood producer huddled in the barn," said Manitoba farmer Red Greenfield, whose acreage borders North Dakota. The producer was cold, exhausted and hungry. "He asked me if I could spare a latte and some free-range chicken. When I said I didn't have any, he left. Didn't even get a chance to show him my screenplay, eh?"

In an effort to stop the illegal aliens, Greenfield erected higher fences, but the liberals scaled them. So he tried installing speakers that blare Rush Limbaugh across the fields. "Not real effective," he said. "The liberals still got through, and Rush annoyed the cows so much they wouldn't give milk."

Officials are particularly concerned about smugglers who meet liberals near the Canadian border, pack them into Volvo stationwagons, drive them across the border and leave them to fend for themselves. "A lot of these people are not prepared for rugged conditions," an Ontario border patrolman said. "I found one carload without a drop of drinking water. They did have a nice little Napa Valley cabernet, though."

When liberals are caught, they're sent back across the border, often wailing loudly that they fear retribution from conservatives. Rumors have been circulating about the Bush administration establishing re-education camps in which liberals will be forced to drink domestic beer and watch NASCAR.

In the days since the election, liberals have turned to sometimes-ingenious ways of crossing the border. Some have taken to posing as senior citizens on bus trips to buy cheap Canadian prescription drugs. After catching a half-dozen young vegans disguised in powdered wigs, Canadian immigration authorities began stopping buses and quizzing the supposed senior-citizen passengers. "If they can't identify the accordion player on The Lawrence Welk Show, we get suspicious about their age," an official said.

Canadian citizens have complained that the illegal immigrants are creating an organic broccoli shortage and renting all the good Susan Sarandon movies. "I feel sorry for American liberals, but the Canadian economy just can't support them," an Ottawa resident said. "How many art-history majors does one country need?"

In an effort to ease tensions between the United States and Canada, Vice President Dick Cheney met with the Canadian ambassador and pledged that the administration would take steps to reassure liberals, a source close to Cheney said. "We're going to have some Peter, Paul & Mary concerts. And we might put some endangered species on postage stamps. The president is determined to reach out."

Sunday, November 28, 2004

Clutch Hitting: Fact or Fiction?

Back from the Thanksgiving holiday I was alerted to this article on clutch hitting by a fellow SABR member. Interestingly, the author finds support for the existence of a clutch hitting using Retrosheet data. There are two interesting things about this study.

First, the size of the clutch hitting effect that was found is described by the author:

"Over the course of a season, an average hitter will get approximately 150 clutch plate appearances, in which he will get on base 49 times with a standard deviation due to randomness of 5.7. The difference between a 'one standard deviation good' and an average clutch hitter amounts to only 1.1 successful appearances, while the difference between a good and an average overall hitter amounts to 3.9 successful plate appearances."

In other words, the clutch hitting effect that was found was relatively small, 28% of the difference between a good and bad hitter, smaller than most baseball people would - assume I -think. This means that a hitter who normally hits .240 might hit .255 in clutch situations or a hitter that normally hits .270 might hit .285. Over the course of a season the random effects (note that the standard deviation is much larger than the difference between an average and a good clutch hitter) would thus swamp clutch ability which is why clutch hitting has been difficult to measure.

While this study is important it doesn't seem to me to contradict the basic conclusion of previous sabermetric studies that clutch hitting is largely a myth. See the following for more information:

Clutch Hitting And Experience
Clutch Hitting and Statistical Tests
Hitting with Runners in Scoring Position
Does Clutch Hitting Exist?

A second important point in the study is that the author found that power hitters tended to be poor clutch performers while contact hitters tend to be good clutch hitters. On this point the author notes:

"There was a great deal of discussion [on the Baseball Primer site in February 2004] as to whether or not the tendencies of power hitters was evidence that what I am measuring is not "cluch", but rather tendencies of various types of hitters against the types of pitching one would find in clutch situations. It is unclear how to prove or disprove this theory without knowing what types and quality of pitches each batter saw during his at-bat, but the fact that this correlation is less with higher-pressure situations would seem to suggest that the important part of the variation is not a function of hitter profile."

I'm not so sure I buy this argument. To me it seems that the correlation may well be higher in pressure situations because closers and other good relief pitchers who are leveraged in such situations have skills designed to get power hitters out. In other words, maybe relief pitchers are selected at some level because of their ability to retire hitters who can turn the game around with one swing. The fact that they're used in these situations may then account for much of the clutch ability that was found in the study.


Tuesday, November 23, 2004

The Thinking Fan's Guide to Baseball

I mentioned in a previous post that I had picked up a copy of Leonard Koppett's The Thinking Fan's Guide to Baseball: Revised and Updated last week. Koppett originally wrote the book in 1966 and subsequently updated it in 1991 and 2001. After his death in 2003 the book was once again updated and reissued with the help of Pat Gillick, former GM of the Toronto Blue Jays, Baltimore Orioles, and Seattle Mariners.

Koppett is generally credited with being one of the most statistically savvy media members and routinely used statistics in his columns that stretched from 1948 through 1975 while writing for The New York Times and the Oakland Tribune among others. He was also a member of SABR and appreciated much of what SABR does (SABR paid him tribute at the convention in Denver in 2003). The book moves through the various aspects of the game beginning with the activity on the field (hitting, pitching, fielding, baserunning, managing, umpires etc.) and then moving on the behind the scenes view that Koppett knew so well including the media, road trips, scouts, scoring, and the business aspects of the game. He concludes with a section titled "The Whole Ballgame" which is a series of essays on other aspects of the game including expansion, changes to the ball, spring training, and other more or less random topics.

What immediately grabbed my interest of course was chapter 15, "Statistics". Here Koppett's attitude towards statistics can be summed up in the following quote:

"Even with all these things in mind [adjustments such as contextualizing and understanding relationships between statistics], however, the fiend for statistics can be led to totally incorrect conclusions, because there is one more fundamental flaw in the way standard statistics are kept. They record how much, but not when, and in the winning and losing of ball games, the when is all-important."

Given his reputation as a bit of an innovator with statistics this surprised me somewhat. He then goes on to give several examples of particular game scenarios where the "when" is all important. Of course, what his analysis misses (although he acknowledges it a bit later in the chapter in a different context) is that statistics are by their very nature aggregate and because of this can only be interpreted when applied to the larger context in which they were created. In other words, he's absolutely correct that official baseball statistics merely record what happened and not when and are therefore an abstraction. But what they assume is that the "whens" even out across the hundreds or thousands of "whats" for a season or career. By doing so they become meaningful as a measure of a player's true ability. This assumption and the fact that studies have shown that clutch hitting is largely a myth, mean that when properly analyzed statistics can be a very useful too indeed. In other words, Koppett vastly over estimates the variability in the performance of players in different situations.

Because of this "fundamental flaw" Koppett goes on to argue that baseball statistics cannot be used to predict, they cannot be used to prove a point but only convince in an argument, and they cannot be used to compare. The reasons he gives for his third point are not controversial. He points out the different contexts in which statistics are counted including ballparks and the problem of using very small sample sizes to make decisions. He does not, however, talk about how those problems can be overcome using park factors and confidence intervals for example.

Interestingly, he then moves into a discussion of how statistics are used or interpreted by players and management. In particular, in regards to management he says:

"Also, to the professional baseball man, many implicit qualities go with the phrase '.300 hitter' or '.250 hitter' or '20-game winner,' qualities not taken into account by the fan who uses the same terms."

While I certainly agree that front office personnel takes into account things that are not reflective in the statistics alone, I'll take issue with the very categories used here as examples. Humans categorize reality in order to better understand it and the fact that the categories here include batting average and pitcher victories, two of the least effective ways to analyze the contributions of hitters and pitchers, belies the truth that the "professional baseball man" often didn't (or doesn't) understand what really goes into winning baseball games over the long haul.

And ultimately this is the problem of perspective I talked about in a recent post. By being so close to the game, the sportswriter, the scout, the manager, or the general manager can miss the bigger picture. How else can you square the fact that prior to the last few years winning strategies such as controlling the strike zone and not giving up outs were undervalued and are still controversial? You can see this in Koppett himself when throughout the book he uses batting average almost exclusively when discussing the merits of various hitters.

Finally, Koppett lays it all on the line when summing up the chapter.

"My own view is that the SABR mania (and I speak as a member of the organization) has gone out of control. The Bill James approach, of cloaking totally subjective views (to which he is entirely entitled) in some sort of asserted 'statistical evidence,' is divorced from reality. The game simply isn't played that way, and his judgments are no more nore less accurate than anyone else's. The truly mathematical manipulations - regressive analysis and all that - do not, in my opinion, add to what is already known on the working level. They are perfectly valid and even brilliant as material for conversation...But they don't tell a professional - a player, manager, or scout - anything he doesn't know anyhow by observation. And when his observation conflicts with the printout, he - and I - will trust his observation every time..."

He then goes on to give two reasons why. First, he see statistics as too blunt an instrument to tell you all of what happened which in part reflects his belief in the importance of clutch hitting discussed above, and second, that the "reality of the game" is too complex to capture statistically. He notes that "Most professionals feel the same way."

Perhaps I've been immersed in sabermetrics too long but I have hard time getting my mind around that paragraph. Taking just part of the next to last sentence, "they don't tell a professional...anything he doesn't know anyhow by observation" reveals how flawed his reasoning here is. I wonder how many players, managers, or scouts could have told you how many runs a good offensive player typically contributes in a season before the birth of sabermetrics? My guess is that the range would have been from 20 to 200. I wonder how many professionals before the popularization of OPS would have told you that a player with a .309/.321/.443 line was actually a better player than one with a .269/.389/.492? I wonder how many professionals would have told you that it is generally ok to bunt with a runner on first and nobody out (it's not - ever)? These are all examples of information not obtainable at the working level.

And further to the point, baseball, being the game of the long season with thousands of discrete events (over 193,000 in 2003) does not lend itself to making decisions based only on observation. Human observation is simply not up to the task. This point was nicely illustrated by a story that Paul DePodesta, now GM of the Dodgers but formerly the assistant to Billy Beane in Oakland, told in a speech he gave and which Rob Neyer was nice enough to send me the transcript of:

"Our manager now, Ken Macha, loves our second baseman Mark Ellis. Mark Ellis is a good player, he plays hard, and he plays every day. But he didn't have a very good offensive year this year, yet Ken Macha kept putting him in the lineup every day. It even got to the point late in the year where he started hitting him leadoff. We finally went to Ken and said, 'We like Ellis too, but he probably doesn't need to be hitting leadoff, and getting all these at-bats.' And his comment to us was, 'Ellis is a clutch hitter.'

I thought, 'OK, clutch is one of those subjective terms I'm not wild about,' so I went back and I looked at the numbers, and at that time during the year Ellis was hitting about .163 with runners in scoring position and two outs, which I think is a clutch situation. But I didn't say anything, we kept it under wraps. When we were getting close to the playoffs, though, we began talking about the way the lineup should work against the Red Sox, and at one point Macha was talking about putting Ellis leadoff. Finally Billy Beane, our General Manager, just couldn't take it any more, and he said, 'Ellis is hitting .163 with runners in scoring position and two outs. He's not clutch.' And immediately, Macha said, 'But he hit that game-winning home run off of Jason Johnson.'

'OK, that's right, but if you want to play that game I'm going to come up with a lot more instances where he failed than instances you're going to come up in which he succeeded.'"

DePodesta's point was that observation isn't always the best way of understanding something, especially where large numbers and percentages are concerned.

So while I like this book overall, the writing is of course very entertaining and the anecdotes very interesting, the perspective is one that many sabermetrically knowledgeable fans will sometimes bristle at.

On a slightly different subject I found chapter 28, "The Ball's the Same, the Bat's the Same - Or Are They?" interesting as well. Here Koppett includes a nice history of changes in the styles of bats used and how the ball has changed over time. I was a bit surprised when after discussing the offensive outburst of 1987 which he indicates may have been accounted for by weather changes, he then says "In 1993, however, it was undeniable that a truly livelier ball appeared..."

This surprised me because it doesn't appear to me that livelier balls are usually given as the culprit. I've written previously on the power surge and tests that physicist Alan Nathan did on balls from 1974 and 2004 that show no differences in their "bounciness". In addition, Koppett makes no mention of the fact that 1993 corresponds with the first year of the Colorado Rockies which certainly had an effect (I don't have the home/road breakdowns but as an example of the effect in 1993 1040 runs were scored in Denver to only 685 in Rockies road games). Further, it's evident from looking at the general trend in homeruns that the number of homeruns per game had been increasing starting in the early 1980s with a brief decline in the 1988-1989 period. Although the jump is more pronounced since 1993 the trend actually continued through 2001 before declining again.

This doesn't appear to me as the product of a livelier ball but rather the accumulation of advantages for hitters that might include increased strength, the full effect of aluminum bats, and the absence of intimidation among others. Koppett also notes, but doesn't differentiate, the possibility that the balls were not livelier starting in 1993 but instead had lower seams which makes breaking balls less effective (he assumes that seams could be lower only by winding the ball tighter which I doubt). Another trend that's obvious from watching any game from the 1970s or 80s on ESPN Classic is the frequency with which balls are thrown out of play today versus 20 or 30 years ago. This is particularly noticeable on pitches in the dirt but also of course when a new ball is introduced each half inning as the outfielder or infielder throws the previous one into the stands. Balls that stay in play longer are easier to grip and therefore easier to throw good pitches with.


Sunday, November 21, 2004

A Brief History of Run Estimation: Base Runs

In previous articles in this series on the history of run estimation I've looked at Runs Created (RC), Batting Runs (BR), and Estimated Runs (ERP) Produced. Although all of these formulas are intuitive in that they are based on a model of how the game is played, one of the differences between RC and both BR and its variant ERP is that RC is a non-linear formula. In other words, RC essentially says that the whole is greater than the sum of the parts in reference to offensive events while BR and ERP use a linear approach that assign weights to the various offensive events. In this installment of the series I'll look at another intuitive but non-linear run estimation formula called Base Runs (BsR).

History
Base Runs (BsR) was developed by David Smyth in the early 1990s and since then much discussion of it has occurred on the Strategy and Sabermetrics forum. Like RC and BR this formula is based on team scoring, which is then applied to individuals. The basic formula for BsR is:

BsR = (BaseRunners * ScoreRate) + HR

In other words, runs can be estimated if you know how many base runners there are, what their chance of scoring is, and how many homeruns the team hit. To many, this is the strength of BsR in that it capture a more intuitive and accurate model of how runs are created. Like RC the formula is then further broken down into components like so:

BsR = (A * (B/(B+C))) + D

Here A represents baserunners defined as hits plus walks minus homeruns, B is the advancement of those runners, C is outs defined as at bats minus hits, and D homeruns. So you can see that the ScoreRate is defined as the advancement value divided by the advancement plus the outs consumed.

ScoreRate = B/(B+C)

The complexity of the formula is centered in the calculation of B. I've seen several versions of the B factor including:

B = (.8*1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB)
B = .8*(1B+SB) + (2.2*2B) + (3.5*3B) + (2* HR)
B = (2.5*TB - H - 5*HR + 2*SB + .05*(BB+HBP))*X

As you can see in the first two formulas found on Tangotiger's site and posted by Smyth himself on a blog discussion of run estimators, the B factor is much like the Linear Weights values with the exception that they're larger and that triples outweigh homeruns. Although this seems intuitively wrong, this is because homeruns are a special case considered in the D factor. The third formula used by Brandon Heipp in his article on BsR for the August 2001 issue of the SABR journal By The Numbers takes a different approach and can be expanded to:

B = (1.5*1B) + (4*2B) + (6.5*3B) + (4* HR)+(2*SB)+.05*(BB+HBP))*X

In this version the weights are basically doubled and include both stolen bases and hit by pitch and an "X" factor is introduced to adjust for the team or league that is being tested. This value is historically around .535. This is used in the same way as the varying value for outs in the Batting Runs formula to gauge it for a league.

The full formula using the first B factor above is then:

BsR = ((H + BB - HR) * ((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))/((.8*(1B) + (2.1*2B) + (3.4*3B) + (1.8* HR)+(.1*BB))+((AB-H))))) + HR

To illustrate that this formula is more accurate at the extremes consider a thought experiment with two imaginary teams. Team A hits 97 homeruns and makes three outs while team B draws 100 walks and makes no outs. In the case of Team A common sense says the team will score 97 runs. In the case of team B common sense says they'll score 97 runs and leave the bases loaded. When compared with RC and BR BsR compares as follows:

     Team A   Team B

BsR 97 97
RC 376 0
BR 136 33

Obviously, at these extremes BsR is doing a better job because it doesn't overvalue the homerun as RC does and because being a non-linear formula it takes into account the offensive context as BR does not.

But of course major league baseball is not played at the extremes. This is why all three formulas can be relied upon to approximate the nearly straight line relationship of offensive events to runs in the frequency ranges for events in major league baseball.

That being said Heipp did calculate the Root Mean Square Errors for each of the formulas and found that BsR had a smaller error than RC but an ever-so-slightly larger error than Furtado's Extrapolated Runs (XR). All three however, were in the range of 23.7 to 25.8 runs.

A challenge for BsR is that, like RC, BsR is in essence a run estimator for teams and so when applied to the statistics of an individual the formula automatically forces interaction of the offensive events with themselves. However, a player never creates runs in the context of a team of clones and so the result is that BsR over estimate runs for players with high slugging and on base percentages. However, it doesn't do so as much as RC since homeruns are isolated. In order to negate this effect the RC formula now includes contextualizing a player's individual statistics with eight mythical average players (see my article on RC for the details). The same approach can be used with BsR as Heipp shows in this article.

A second challenge for BsR is to more accurately approximate the ScoreRate. The formula B/(B+C) may not in fact be the best way to do this and there has been much discussion on this topic recently. Of course, one way of coming up with empirical weights for the offensive events is to calculate the ScoreRate for entire leagues using the known number of baserunners, outs, and runs and then run a regression on the offensive events. That may in fact be what Smyth did although I don't know.

Changes in the Game

I picked up a copy of the late Leonard Koppett's The Thinking Fan's Guide to Baseball this weekend. In reading his chapter on changes to the game (expansion, a livlier ball etc.) I wanted to take a quick look at the changes along the axes that he mentions; runs scored per game, batting average, and homeruns per games. Using the Lahman database I created the following quick graph. The trendlines for Runs/G and HR/G use a moving 10 year average.
I'll have more comments on the book and Koppett's view of these changes along with his take on sabermetrics later this week.

Friday, November 19, 2004

Leaving the Bat on Your Shoulder

Here's an interesting question:

What are the odds of walking if you don't swing at any pitches?

To answer that question I used the Lahman database and play-by-play files from 2003 to look at all plate appearances that were not intentional walks. Here's the answer:


Non IBB PA
186121

BB-IBB Pct
14573 7.8%

BB No Swings Pct
6842 3.7%

K Pct
30881 16.6%

K No Swings Pct
1108 0.6%

So about one out of 26 plate appearances a batter walks while not offering at a pitch. For this study "offerring" included swinging, bunting, foul tips, missed bunt attempts, swinging at intentional balls, and foul balls. I also found that striking out without offerring is actually more rare and happens only once in 150 plate appearances.

Organizing Domain Logic: Domain Model

The third pattern for representing domain logic in an application is the Domain Model. In this model the application is viewed as a set of interrelated objects. The core feature of these objects (referred to as Business Entities in Microsoft nomenclature) is that, unlike the Table Module approach, each object maps to an entity (not necessarily a database table) in the database. In other words, each primary object represents a single record in the database rather than a set of records and the object couples the object’s data with its behavior (business logic and data validation). Additional objects may represent calculations or other behavior. Ultimately, this approach requires at least the following:

  • Object-relational Mapping. Since this pattern does not rely on the DataSet object or even Typed DataSet objects, a layer of mapping code is required. At present the Framework does not contain built-in support for this layer. Look for support in the Microsoft Business Framework (MBF) technology to be released after the Whidbey release of VS .NET. Techniques to perform this mapping are covered in my article Software Design: Using data source architectural patterns in your .NET applications.
  • ID Generation. Since each Domain Model object maps to an entity in the database, the objects need to store the database identity within the object (the CLR’s object system uniquely identifies each object based on its data). There are several techniques to do so documented by Fowler in his discussion of the Identity Field pattern.
  • Strongly-Typed Collection Classes. Because objects in the Domain Model map intrinsically to an individual entity, representing multiple objects is handled through collections. Framework developers can therefore create strongly-typed collection classes to handle representing multiple domain objects. For more information see my article Take advantage of strongly typed collection classes in .NET.

The end result of a domain model is that the application manages many individual and interrelated objects (using aggregation) during a user’s session. Although this may seem wasteful, the CLR’s efficiency at managing objects makes the Domain Model a valid approach. Such an approach would not have been efficient in the world of COM, however. The Domain Model can also take advantage of inheritance by implementing a Layer Supertype.

To build an object used in a Domain Model a best practice is to follow the Layer SuperType pattern. Using this pattern an architect would develop a base class from which all domain objects would inherit as shown here:

<Serializable()> _

Public MustInherit Class BusinessObjectBase : Implements IComparable
Protected _id As Integer ' could also use GUID or a key class
<xml.serialization.xmlattribute()> _
Public Property Id() As Integer
Get
Return _id
End Get
Set(ByVal Value As Integer)
_id = value
End Set
End Property
MustOverride Function Save() As Boolean
MustOverride Function Delete() As Boolean
Public Shadows Function Equals(ByVal o As BusinessObjectBase) As Boolean
If Me.Id = o.Id Then
Return True
Else
Return False
End If
End Function
Public Shared Shadows Function Equals(ByVal o As BusinessObjectBase, _
ByVal o1 As BusinessObjectBase) As Boolean
If o.Id = o1.Id Then
Return True
Else
Return False
End If
End Function
Protected IsDirty As Boolean = False
' Used when the Sort method of the collection is called
Public Function CompareTo(ByVal o As Object) As Integer _
Implements IComparable.CompareTo
Dim b As BusinessObjectBase = CType(o, BusinessObjectBase)
Return Me.Id.CompareTo(b.Id)
End Function
End Class

You’ll notice that this abstract class contains the implementation of the Id property as well as the abstract Save and Delete methods and a protected field to determine if the object has been altered and is in need of removal. In addition, it shadows (implemented with the new keyword in C#) both signatures of the Equals method inherited from System.Object that is used to determine whether two instances of the object are equal. Note that these methods check the Id property but could alternately have been implemented to check against some other criterion or more likely overridden in the derived class and checked against other properties of the object. This class also uses the SerializableAttribute and XmlAttribute classes so that the object can be serialized to XML and its Id property represented as an attribute.

A derived Domain Model object, implemented by the Order class might then look as follows.

<serializable> Public Class Order : Inherits BusinessObjectBase
Private _orderId As Long
Private _cust As Customer
Private _prod As Product
Private _quant As Integer = 1 'default
Private _ship As ShipType = ShipType.Postal 'default
Public Sub New()
End Sub
Public Sub New(ByVal cust As Customer, ByVal prod As Product)
_InitClass(cust, prod, Nothing, Nothing
End Sub
Public Sub New(ByVal cust As Customer, _
ByVal prod As Product, ByVal quantity As Integer)
_InitClass(cust, prod, quantity, Nothing)
End Sub
Public Sub New(ByVal cust As Customer, ByVal prod As Product, _
ByVal quantity As Integer, ByVal ship As ShipType)
_InitClass(cust, prod, quantity, ship)
End Sub
Private Sub _InitClass(ByVal cust As Customer, _
ByVal prod As Product, ByVal quantity As Integer, _
ByVal ship As ShipType)
_cust = cust
_prod = prod
Me.Quantity = quantity
Me.ShipVia = ship
Me.IsDirty = True
' Generate a new or temporary order id: use a system assigned key
' _orderId = key table GUID
End Sub
Public ReadOnly Property Customer() As Customer
Get
Return _cust
End Get
End Property

Public ReadOnly Property Product() As Product
Get
Return _prod
End Get
End Property

Public Property Quantity() As Integer
Get
Return _quant
End Get
Set(ByVal Value As Integer)
If Value < 0 Then
Throw New ArgumentOutOfRangeException( _
"Quantity must be greater than 0")
End If
_quant = Value
Me.IsDirty = True
End Set
End Property

Public Property ShipVia() As ShipType
Get
Return _ship
End Get
Set(ByVal Value As ShipType)
_ship = Value
Me.IsDirty = True
End Set
End Property

Public Function CalcShippingCost() As Double
' calculate the shipping cost based on the Customer and Product objects
' store the shipping cost for this order
End Function

Public Function CalcOrderTotal() As Double
' calculate the total cost of the order with tax
' store the cost for this order
End Function

Public Function IsComplete() As Boolean
' Determines whether this order has enough information to save
End Function

Public Overrides Function Save() As Boolean
' Persist the order
Me.IsDirty = False
End Function

Public Overrides Function Delete() As Boolean
' Remove the order
End Function
End Class

The key point to note about this class is that it combines data (the Quantity and ShipVia properties), behavior (the IsComplete, Save, CalcOrderTotal and other methods), and the relationships to other data (the Product and Customer properties that relate to the Product and Customer classes). The end result is a static structure diagram:

The second key point to note about this structure is that each object derived from BusinessObjectBase creates its own Id property in the object’s constructor and the Id property is represented as a Long (64-bit) integer. The strategy shown here is only one among several documented by Fowler as the Identity Field pattern. The considerations for Id generation are as follows:

  • Use of system assigned keys. Here the assumption is made that the key value will be assigned by the system and not use a “natural key”. In practice this proves to be the more durable design since the independence of the keys allows them to remain fixed once assigned. This also disallows the use of compound keys (which if used should be implemented in a separate key class) which are difficult to manage and slow performance of the database.
  • Use of data types. Here a 32-bit integer is used, however, unless uniqueness can be guaranteed in the generation of the ids it might be possible to duplicate keys which would cause problems once the object is persisted to the database. An alternative would include the use of System.GUID, which is (for all intents and purposes) guaranteed to be unique. If you wish to protect yourself against data type changes in the key you could also implement the key in a key class that abstracts the data within the key and performs any comparisons and generation.
  • Generation of the ids. If a data type such as an integer is used, one technique for generating unique ids is to generate them from a key table. The downside of this technique is that it incurs a round trip to the database server.
  • Uniqueness of the ids. In the code shown here it is assumed that each object generates its own key probably through the use of a key table. This then assumes that each object type (Order, Customer, Product) generates keys independently of the others. An alternate approach would be to create system wide unique ids through a key table, in which case the generation of the id could be placed in the base class. If the classes form an inheritance relationship, for example if the BookProduct class inherited from the Product class it would be assumed that the generation of the ids would take place in the Product class so that all of the objects generated by the inheritance hierarchy would be uniquely identified so that they could be stored in the same database table.

To be able to manipulate a collection of Domain objects they can be represented in a collection class. While the Framework includes collection classes in the System.Collections namespace including ArrayList, SortedList, and Dictionary among others, there are advantages to creating a custom strongly-typed collection class including:

  • Strong-typing. As the name implies, a custom collection can be strongly-typed, meaning that only objects of a specific type are allowed to be added to the collection.
  • Custom Manipulation. A strongly-typed collection class also provides the opportunity to add custom behaviors to the class, for example, by providing custom sorting capabilities.

To create the strongly-typed collection you can use the power of implementation inheritance. By deriving a class from CollectionBase, ReadOnlyCollectionBase, or DictionaryBase and then overriding and shadowing various members you not only can enforce the type of objects in the collection but take advantage of functionality that Microsoft has already included in the .NET Framework. For example, Microsoft uses the CollectionBase class as the base class for over 30 of its own strongly-typed collection classes in the .NET Framework. Each of these three classes is marked as abstract (MustInherit in VB) and therefore can only be used in an inheritance relationship.

Thursday, November 18, 2004

The 2004 Scouting Report by the Fans

Tangotiger is embarking on a project to collect fan perceptions in order to evalute MLB players. The word is that more data on the Royals is needed. Please click here to go directly to the Royals and provide your input. Thanks

Tuesday, November 16, 2004

Organizing Domain Logic: Table Module

The second pattern for representing domain logic is the Table Module. As the name implies this pattern calls for a Business Component to map to a table in the database. The component then contains all the domain logic methods required to manipulate the data. There are two key considerations here.

First, although the name refers to a table, Table Module can also be used to abstract frequently used sets of data created by joining multiple tables together in a view or query. This makes dealing with more complicated data much easier for the caller to the Business Components.

Second, the core characteristic of each component built as a Table Module is that, unlike Domain Model, it has no notion of identity. In other words, each Table Module object represents a set of rows rather than a single row and therefore in order to operate on a single row, the methods of the table module must be passed in identifiers.

When a method of a Table Module is called, it performs its logic against a set of rows passed into it. In Framework applications this maps to a DataSet and is therefore a natural way to represent domain logic. In fact, using Typed DataSets with Table Module is particularly effective since it promotes strong typing at design time leading to fewer runtime errors and better control over the data flowing through the application.

A common approach to handling this is to create a Layer Supertype (a pattern also discussed by Fowler) or base class for all of the Table Modules that accepts a DataSet in its constructor. This also allows the Business Components to be tested without a live connection to a database.

To implement a Table Module approach in the retail application example you could create an abstract BusinessComponentBase class like that shown here. This class is responsible for accepting the DataSet that the class will work on, exposing individual rows in the DataSet using the default (VB) or indexer (C#) property, and exposing the entire set of rows in a readonly property.

As you can see one of the advantages to using Table Module over Transaction Script is that it is easier to encapsulate responsibilities with the appropriate objects. Here the Customers class contains the SendEmail and CheckCredit methods that can be called independently whereas the Transaction Script shown previously would have included inline code to handle this.
The code to implement the BusinessComponentBase class can be seen below.


Public MustInherit Class BusinessComponentBase
Private _ds As DataSet
Private _pk As String
Protected Sub New(ByVal data As DataSet)
_ds = data
_pk = _ds.Tables(0).PrimaryKey(0).ColumnName
End Sub
Default Protected ReadOnly Property Item(ByVal id As Integer) As DataRow
Get
Dim f As String = _pk & " = " & id
Return _ds.Tables(0).Select(f)(0)
End Get
End Property
Protected ReadOnly Property Items() As DataSet
Get
Return _ds
End Get
End Property
End Class

Note that the protected constructor finds the primary key column and stores its value in a private variable used by the Item property to select and return a particular DataRow.


This class can then be inherited by Orders and Customers. The Orders class in C# is shown here.

public class Orders : BusinessComponentBase
{
public new OrdersDs Items
{ get { return (OrdersDs)base.Items; }
}
public Orders(OrdersDs orders):base(orders)
{ }
public ShipType GetShipType(long orderId)
{
// Return the shipping type for the order
}
public double GetShippingCost(long orderId)
{
// Calculate the shipping costs for the order
}
public void SaveOrder(long orderId)
{
// Save the specific order to the database
}
public void SaveOrder()
{
// Save all the orders to the database
}
public long Insert(long productId, int customerId,
long quantity, ShipType ship)
{
// Insert a new row in the DataSet
}
}

The main point to notice here is that the constructor accepts a DataSet of type OrdersDs which is a Typed DataSet. The class then shadows the Items method in order to return a strongly typed DataSet. Also, this is a good place to take advantage of overloading as in the SaveOrder method which can be used to save either one order or all of the orders in the DataSet.

A client can then instantiate the Orders class passing into it the OrdersDs DataSet.
Orders o = new Orders(dsOrders);o.Save(123324);