The Metrics System: WAR and WARP
Written by Bill   
Thursday, 18 February 2010 09:00
Time for session three in our series of rudimentary courses on advanced baseball statistics for non-math people by a non-math person. See number one, on OPS+, ERA+ and wRC+, here, and number two, on FIP and other defense-independent pitching statistics, here.

If you've read this blog, or any vaguely stats-oriented baseball blog, or a baseball message board of some type, any time in the last year, odds are good that you've come across the acronym WAR, for Wins Above Replacement. If you've been at it a little longer than that, there's a good chance you saw WARP, or Wins Above Replacement Player. They've become incredibly popular over the last few years, for good reason: WAR[P] gives you a single number with which one can compare any two players, no matter what types of players and what positions they play -- even, if you want to get really crazy, a position player and a pitcher -- and get an idea of which one was more valuable. And unlike, say, win shares, which are crazily complicated and give you a number up to (or above) 40 representing the number of arbitrarily-decided-upon thirds-of-a-win the player achieved for his team, WAR and WARP, at the highest level, are very simple: adding up the total value of everything the player does -- his hitting, fielding, pitching, sometimes baserunning -- how many wins did the player give his team over and above what they could have expected had they just plugged in any remotely serviceable player they could find in his place? So they're useful, and they're relatively easy to understand and come to terms with even if you don't get all the underlying math.

WARP was the only game in town for some time. Created, as so many great things are, by Baseball Prospectus (BP), WARP is really about as simple as what I just described above: add up a player's Batting Runs Above Replacement (BRAR), Pitching Runs Above Replacement (PRAR), and Fielding Runs Above Replacement (FRAR), and divide the result by the number of runs determined to be worth a "win" in that league and that season (typically around ten), and you end up with the number of wins a player earned above a replacement player in the same spot. The hard part, of course, is figuring out BRAR, PRAR and FRAR...which you can't do, unless you work for BP or know someone who does.

There are actually two versions of WAR, one designed by Sean Smith ("Rally") and available at BaseballProjection.com, and one available at FanGraphs. They can differ from each other, sometimes by a lot, though the major difference is that FanGraphs WAR, which uses UZR for its defensive component, is available only since 2002, while Sean Smith's defensive rating system changes according to the amount of data available for a given season, so his WAR goes all the way back to Al Spalding. But FanGraphs WAR and Rally WAR are a lot more similar to each other than either is to BP's WARP.

The components are slightly different among the three metrics. For one thing, BP calculates runs above replacement, while Rally and FanGraphs calculate runs above average, and then add in the replacement level -- the number of runs a replacement player is below average -- to arrive at runs above replacement. BP calculates different replacement levels for each position, while the other two use one replacement level per league, per year, and then add a position adjustment based on the difficulty of the positions they played. Rally adds a whole bunch of smaller adjustments into it separately, while with the other two, those effects are measured (if at all) as part of the total batting, fielding or pitching runs. But while I'm sure some of these differences have real effects, it seems to me to be basically a distinction without a difference. They're all adding up the total runs the player creates and/or prevents above a replacement player, and dividing by approximately ten to come up with the wins he created.

So what is the difference? Why did all these smart people set out to create a statistic that measured the exact same thing that BP's established stat did (and with an almost identical name and acronym)?

Well, at least a few reasons (I'm sure there were many, but here are the ones I know):

  • Tom Tango, at least, was convinced that Baseball Prospectus' replacement level was far too low. BP doesn't divulge its replacement level (that, like most other things, is BP-proprietary), but their glossary does explain that "a team which is at replacement level in all three of batting, pitching, and fielding will be an extraordinarily bad team, on the order of 20-25 wins in a 162-game season," and has explained elsewhere that a "replacement player" is essentially an AAA player that might be called upon to fill in as an emergency injury replacement. Tango noted that replacement players are freely available who are not (as BP was assuming all replacement players were) far, far below average on both offense and defense, and so he adopted a much higher standard for a replacement player, one with a total contribution of about -20 against average. BP eventually announced that it was making upward adjustments to its own definitions of replacement level, which lowered its WARP numbers across the board...but still not to the level of the two WARs. Accordingly, WARP is virtually always considerably higher than either WAR number. Pujols' 2009, for example, gets 9.2 by Rally WAR and 8.5 by FanGraphs WAR, but a whopping 11.8 by WARP.

  • BP's fielding numbers are pretty opaque, and don't seem all that trustworthy. Nobody really knows how FRAR was calculated before. Now they use play-by-play data, but nobody knows how. UZR and Total Zone, on which the two WARs are based, are also based on play-by-play data (where available) and are much more transparent and verifiable, and just seem to make a bit more sense.

  • BP's numbers in general, as you might have picked up already, are opaque. As a subscription-based site that closely protects its secrets, it's hard to tell what goes into BRAR, FRAR and PRAR, and, for that matter, the replacement levels. Meanwhile, if you're willing to do enough digging/Googling and have the math chops, you can pretty much figure out every single thing that goes into the FanGraphs or Rally WAR. The creators of these numbers (particularly Tango -- see e.g. here and here) are generally committed to being open about the numbers they use and where they're coming from. I'm not one to dig into the math, as you know, but I do like to have some idea of what it is I'm looking at.
So you can probably tell where I come out. I love BP for many reasons, and I'll refer to all three systems now and then to check consistency and such, but if I have to pick one, I'm sticking with WAR (FanGraphs', when available). Another great thing about WAR is that, because it's so freely and widely available and has become so widely used and discussed, there are very easy reference points. A player with a full-season WAR of 2.0 is roughly an average regular (see Kosuke Fukudome, A.J. Pierzynski). 4.0 or so is a star, if not quite elite, player (Brian Roberts, and Phillies fans will kill me, but Ryan Howard). 6.0 is a superstar and possible MVP candidate (Mark Teixeira, Dustin Pedroia). 8.0 or better is a guy who had a huge year and almost certainly an MVP candidate (Pujols, Joe Mauer). I'm sure you can create similar benchmarks for WARP, and BP probably has, but I'll never remember what they are, so that kind of misses the point.

So that's Wins Above Replacement [Player]. Questions will be welcomed, then furiously Googled...


Digg! Reddit! Facebook! Technorati! StumbleUpon! BallHype: hype it up!
Comments (12)Add Comment
...
written by LarryM, February 18, 2010
Nice summary; the comparisons were especially helpful, though much of was familiar to me, and the rest in part just confirms some things I had already suspected.

As useful as these metrics are, I think that your analysis, intentionally or not, highlights flaws in the numbers, at least as used by many people. I think the people who developed these metrics are properly humble about their limitations, but many "users" of the numbers are not. Quite simply, the metrics give an illusion of objectivity that is somewhat lacking.

More specifically, the metrics are built on a series of assumptions regarding:

(1) The run value of offensive statistics,
(2) The run value of defensive metrics/statistics,
(3) The run value of pitching statistics,
(4) Positional adjustments,
(5) Replacement values,
(6) Runs per win conversion.

1 and 6, and to a certain extent 3, are pretty solid in the sense that the statistical underpinnings are precise, objective, and uncontroversial. 2, 4, and 5 much less so; they are more subjective and imprecise. Even offense measurement is more problematic than some peoplew think, to the extent that baserunning isn't measured (or measured well) in many systems, and there are differences of opinion as to whether situational data should be considered.

Which isn't to say that this stuff isn't valuable, but it should be deployed with IMO quite a bit more humility than is often the case.

I do have a question: which of these systems (if any) consider (a) situational stats, and (b) baserunning, aside from steals and CS?
...
written by LarryM, February 18, 2010
I'd be interested - okay, maybe some of this is available with some goggling, I'm lazy, so sue me - on the empirical basis for assumptions about replacement value and positional adjustments. What is the basis, for example, for the Fangraphs people to assume that a replacement level player is 20 runs per 600 PA below average?

Also, I neglected to add a 7th assumptions: the portion of runs allowed/saved allocated to defense versus pitching. I recall Bill James' assumptions on that issue, most recently when he came out with the Win Shares book. They seemed ... reasonable but arbitrary. Has the state of the science advanced much since then?
...
written by LarryM, February 18, 2010
I think that I partly answered my own question, for Fangraphs WAR anyway. Since it's based upon wOBA, no baserunning aside from steals/CS, no situational stats. The former most likely wouldn't make a huge difference for most players (but might mean we are undervaluing certain players. e.g., maybe Chase Utley is even better than his already sick WAR numbers). The latter would in some years be significant for some players, thoguh I understand the argument against not including it. My own position is the Bil James position: I'm agnostic on the question of whether there is such a things as clutch "ability," but even if there isn't, there is (by definition) such a thing as clutch performance, whether based on luck or not, and it should be considered - afterall, we are measuring value, and hence performance, not measuring (and really can't measure) "ability" in the abstract.
...
written by Bill@TDS, February 18, 2010
Larry, I think the key is to realize that nothing is that precise. We don't know that a guy with a .850 OPS or .310 BA was actually a better hitter than a guy with an .840 or .300, any more than (actually less than) we know whether a guy with 50 batting runs above average was a better hitter than a guy with 45. With that in mind, I don't think WAR is any less reliable than any other stat (and in almost every case, it's much more reliable). The various assumptions made are awfully strong and well-tested assumptions -- even the various defensive metrics are much better than anything else we have and tend to track pretty closely with each other, generally speaking -- and it's just better and more consistent and verifiable than anything else we have right now.

To answer your questions:
1. I think BP incorporates baserunning into EqA, which is the number that BRAR is based off of, but (like with most things BP) I can't get any hard information. Rally's WAR estimates baserunning runs based on Bill James' speed score, which uses stolen bases and triples and something else I'm forgetting to figure out how good a baserunner the guy is (much less than perfect, but the results are remarkably intuitive). FanGraphs WAR incorporates only SB/CS. The next step is for someone to measure the ability to go from first to third, etc. (and the value of that ability). At any rate, in only the very most extreme situations (like, Rickey Henderson or Bengie Molina) would this effect (baserunning other than steals, that is) be likely to go beyond 5 runs a year in either direction.

2. Here's a good article on replacement level by Sean Smith: http://www.hardballtimes.com/m...l-article/ At some level, it's always a guess, but it's a very educated guess based on the type of "AAAA" talent that's always readily available at very, very low cost. It also doesn't really matter for most purposes where you set the replacement level, since it's the same for everybody in the league anyway. Any reasonable guess will do, at least for what I'm going to be doing with it.

3. Yeah, that's totally different. UZR and TZ (and, I assume, BP's FRAR) have nothing at all to do with pitching. The systems measure how many balls were hit in the fielder's direction, where and how they were hit (where that data is available), and how many of those plays the fielder made vis-a-vis how many we'd expect an average player to make, and figures out how many runs those plays would save or cost his team in a neutral environment.

On your last point about situational stats, I couldn't disagree more. There's no place for situational stats in something like WAR, in which the whole point is to measure the player's value apart from the performance of his teammates or luck. Situational stats incorporate both -- luck, obviously, and because it's a cumulative stat, a player who plays on a good team who puts lots of guys on base would tend to benefit from the situational component being considered. If performance in "clutch situations" is important to you in, say, deciding the MVP, then you can factor that in after the fact if you want to -- if you've got two guys with similar WAR totals, compare their Win Probability Added, etc. and make the call. But it's got no place in, and entirely frustrates the purpose of, an objective measurement like WAR.
...
written by LarryM, February 18, 2010
Situational stats - I think you're partly missing what I'm saying regarding situational stats, and partly operating from certain assumptions which are incorrect (or, at least, not black and white).

First, a pet peeve - a lot of analytical people have a knee jerk reaction to this issue that IMO impedes analytical thought. We're so conditioned to countering the more extreme and/or anecdotal versions of the clutch "argument" that our minds freeze up. But more specifically:

The first part - obviously a guy that gets (say) 150 opportunities with RISP shouldn't get more credit than the guy with (say) 100 such opportunities. I’m talking rates here: should the guy who hits 10% better with RISP (by what whatever metric), compared to his overall numbers, get credit for that, compared to a player who hits the same with RISP as he does overall? I say yes. That leaves the luck issue, which I'll deal with below, but removes the team context issue (and of course you are correct that analytical metrics can and should remove team performance from the equation).

Regarding luck: you say that advanced hitting metrics attempt to “measure the player's value apart from … luck..” That simply isn’t true; analytical stats don't do that, because they can't do that.* They can't do that fundamentally because luck permeates the game. This is best demonstrated by examples. Start with BABIP. Obviously, all batting events on balls in play are partly a matter of luck. We don't know how much - it's clear that some hitters do have greater ability to have successful batting events when the ball is put in play; it's also clear that it's partially luck. So we (quite properly) don't even try to remove the luck factor inherent in BABIP. That's the big example; there are others. Some players are lucky enough to play in a park that is particularly suited for their abilities; we don't account for that (and shouldn't); park adjustments are (rightly) not player specific.

In fact, I'd challenge you to name ANY respect in which we try to separate luck from skill in hitting metrics. You might say park adjustments - but that's at best incidentally about luck - primarily it's about adjusting for value.

The one (and only) argument that you have is that BABIP is an area where there is both a luck and skill involved; since we can't know how much is skill and how much is luck, we err on the side of not making an adjustment, whereas (you might argue) we “know” that situational stats are “all” luck, and thus shouldn't include them. There are two problems with that argument: firstly, there are other areas where we could remove luck entirely, but don't (e.g., make park adjustments player specific). But secondly and more to the point, the "luck" factor in situational stats is not an entirely closed question. There is plenty of evidence that suggests that situational success is largely a matter of luck, but the evidence leaves plenty of room for the possibility that there is at least some skill component.**

So no, the issue is not as clear cut as you say. I’ll close with an argument from authority, yes, not the strongest form of argument but at least worth mentioning when the authority knows his stuff and is recognized by all parties, and remind you that Bill James is on my side on this one. smilies/smiley.gif

*There is also in my mind a separate philosophical question of whether we should remove luck even if we could - see my performance(value) versus ability distinction - which isn't answerable on purely logical grounds. But it's moot, as I think I've demonstrated.

**Just to throw a couple names out there – I only looked a few players, so I have no idea if these are “extreme” numbers. Gary Sheffield, over his career – thousands of plate appearances, large sample size – has an OPS 16% higher with RISP than overall. Whereas Alex Rodriquez has an OPS 2% lower with RISP than his overall OPS. That strikes me as statistically significant. I mean, I hate to pick on Alex, who has gotten a huge amount of unjustified criticism regarding his clutch ability, but that’s a pretty significant difference.
...
written by LarryM, February 18, 2010
A bit of a side point here - I do think that some pitching and fielding metrics do attempt to remove luck (IMO perhaps improperly so in the case of pitching in particular, and not entirely successfully in both cases, but those are argument for another day). They DON'T do so in the hitting context, and for the reasons stated IMO shouldn't.
...
written by LarryM, February 18, 2010
"it's just better and more consistent and verifiable than anything else we have right now."

True enough. My point, and maybe I didnt make myself clear, isn't that there is some other, better metric, but that, in evaluating players, no SINGLE metric is reliable enough to be the be all and end all. A lot of people seem to think that if one player has a WAR of (say) 5.0, and another player has a WAR of (say) 4.5, you have "proven" that the player with a WAR of 5.0 is "better." You haven't. Most likely he is, but there's room for arguing otherwise (though ideally by using evidence as opposed to abstractions, unevidenced conclusory statements and "gut" feelings).

And people DO make those sorts of arguments. But other people look at WAR and WAR alone, and argue that the case is closed.

Heck, I've heard it suggested on a very reputable analytically oriented baseball web site that we use WAR to establish minimum standards for HOF enshrinement!!! No wonder some people don't give analytical stats the respect they deserve, with people misusing them in that respect.
...
written by Bill@TDS, February 18, 2010
I’m talking rates here: should the guy who hits 10% better with RISP (by what whatever metric), compared to his overall numbers, get credit for that, compared to a player who hits the same with RISP as he does overall?

I say no. Or more accurately, "hell no." It's been shown again and again that hitters do not have a special ability to hit with runners in scoring position. You're rewarding a player for having the distribution of his hits happen to fall more heavily than expected on the "runners on" side.

No, WAR and the like doesn't remove all luck, and I was being overly broad there. But what it does is remove things that we KNOW are luck (or team-dependent). Like who happened to be on base when the guy came through. And anyway, why is the guy who is so much better in clutch situations slacking off in non-clutch situations? Don't we want him to start rallies, too?

BTW, by OPS, the MLB as a whole hit about 3% better with RISP than with nobody on in 2009. It's easier to hit with runners on base, because the pitcher has to throw from the stretch and is more likely to be tired or struggling anyway. I'm willing to bet without looking that both Sheff and A-Rod had multiple years in which they were far on both sides of the ledger (OPS higher and lower with RISP). It's really been beaten to death--there's nothing there. It's like looking for meaning in tea leaves or cloud shapes.
...
written by Bill@TDS, February 18, 2010
Larry, I'm with you on the latest comment all the way until this:

"Heck, I've heard it suggested on a very reputable analytically oriented baseball web site that we use WAR to establish minimum standards for HOF enshrinement!!! No wonder some people don't give analytical stats the respect they deserve, with people misusing them in that respect."

That's very different from drawing distinctions based on 4.5 vs. 5.0. I'm comfortable enough with WAR to say that, over a more-than-ten-year career, if you fall below a certain standard (a pretty low one, say 40?), you probably don't deserve serious consideration.
...
written by LarryM, February 18, 2010
Aside from the philosophical issue (we're measuring performance, not ability), which I think is very important, and sadly ignored in these kind of discussions, I think we just differ on what the evidence shows. This isn't exactly a currently hot area of research. The research that does exist doesn't prove what you think it proves. It proves that for most players there is no statistically significant evidence of clutch ability.

The "most players" is significant, as the studies have shown some players whose superior hitting in RISP situations is statistically significant. Of course, that doesn't prove that they have such ability (just as the converse doesn't prove that the ability is absent), but it is (at the very least) indicative that the level of certainty expressed by some people on this point is

What is true is that there is a long history of people stating (incorrectly IMO) that the evidence "proves" that hitting with RISP is "all" luck, and that has indeed become the received wisdom. Improperly so IMO.

As for the 3% figure, I knew that the overall numbers were better with RISP, I didn't know the 3% figure, and am happy to see it because I've been looking for it. It doesn't effect the overall argument one way or the other.

Question for you: given your feelings about luck, why shouldn't we (a) make park adjustments for hitters on an individual rather than an overall basis, to capture the fact that some players are lucky enough to be better suited for a particular park, and (b) make at least SOME adjustment to take BABIP into account, since we know that is partly luck. Maybe a conservative 20% regression to the mean? Mind you, I'm not advocating that, but given your priors, why aren't you advocating it?
...
written by LarryM, February 18, 2010
oops, left out the word "excessive" at the end of the second paragrpah.

Just to give you an example of the kind of thing I'm talking about. I love the guys at BP, but their article on the issue cites only two studies, both interesting for for reasons too time consuming to get into here not conclusive, contains the following:

"You can see this yourself if you like, and you don't need to understand correlations to do it. Pick any five players at random, and check out their splits for the last few seasons (you can do this fairly easily at any of the major sports portals). You'll find that their statistics from year to year in the various clutch situations (RISP, late-inning pressure, September) can vary widely, with no rhyme or reason to the splits. But over a large enough sample, players will hit in given situations pretty much as they do overall."

The "vary widely" is true but not as significant as often stated. Other stats, which we don't attribute to luck (e.g., BA, OBP) also vary widely. The second part - "over a large enough sample, players will hit in given situations pretty much as they do overall" - simply isn't true. Of course there is some regression to the mean over large sample sizes, as with ANY statistic, but there are big variations. Look at baseball reference.com. You'll see huge variations in "clutch" performance over the course of careers. Now, it may well be that these variations aren't statistically significant; I haven't done a study myself & ran the relevant statistical tests. But this kind of sloppy argument doesn't serve anyone well.
...
written by Bill@TDS, February 18, 2010
As I've said repeatedly, I don't have the math acumen to hang in with these kinds of conversations. I can't prove to you myself that, given enough plate appearances, very nearly every hitter's performance will fall right into line with his overall performance (with an adjustment for the pitching-from-the-stretch thing). But I've read enough about it to be convinced that it's true. And I know that if you play around with a baseball simulation engine (just peruse the hitting stats on http://mlb.imaginesports.com), you'll see some really weird things in the stats even after 5000 or 10,000 PA (Sheffield had 5626 PA with RISP). It's when they get to 20,000 on up that they start to really stabilize.

On your other questions, I'd love to see park effects applied by handedness and such. It's a matter of someone actually taking the time to implement that, and for past years, of not having enough data to make it work. On BABIP, I doubt we'll ever really know how much of it is attributable to luck (unlike, IMO, with situational hitting), and it's easy enough to make your own adjustments if you see one that's exceptionally high or low (just as, if clutch performance is important to you, you can make your own little adjustments with WPA). I don't see a reason to start making WAGs to try to correct for it.

Write comment

busy
 

About Bloguin

Bloguin is the revolutionary blog network specifically focused on helping bloggers get the most out of their websites. We're currently working on building a large network of online communities and hope to expand our blogging coverage to include a wide range of topics.

Advertisers

The Bloguin Network allows advertisers to promote their products and services to our ever-growing number of visitors. We offer both site-specific ad placements as well as the ability to run a network-wide campaign. If you're interested in working with Bloguin to meet your advertising needs, please contact us.

Bloggers Wanted

The Bloguin Network is always looking to expand. We're specifically looking for blogs in the sports, entertainment, and video games field, but are open to adding any type of quality site.. If you're a blogger and interested in joining our network, please fill out our application form.

The Bloguin Login

The Bloguin Login gives you full access to everything our network has to offer. Your name and password will work for each and every one of our sites. Signing up is simple, and will allow you to post in all our forums, create member blogs, and access other cool features! What are you waiting for? Create an Account!