Thursday, December 27, 2012

Wins above average: a statistical nightmare

by Clark Heins

Introduction: Davis Wylie (pen name of researcher Neil Paine), after much complicated math resulting in each QB‘s stats being “adjusted“ to 2006 levels (through a process known as “translation“, i.e., normalization without standard deviation, converted everything into a final stat “Wins Above Replacement Player” totals in his "The 100 Greatest QBs of the Modern Era” opus which he used to rank QBs.

In football, there is no clearly established formula for determining WAR figures, but Football Outsiders originally estimated that a “Replacement Level” QB was some 13.7% less effective (valuable) than an “average” QB. This percentage was later changed to 13.3% and now rests at 12.5%. All these figures were arbitrary and consisted of some educated guesswork and “value judgments” about “players“ who never existed! It would have been much easier if Wylie had simply used the stat “Wins Above Average” which we can all understand instead of an incomprehensible abstract.

The problem for me was converting these WAR totals to WAA so as to compare with Doug Drinen’s figures in his own WAA opus.

Baseball has three recognized conversion formulas, but I could find none for football except a couple vague variations of Bill James‘ Pythagorean Theorem which, as statistician Zach Fein points out, has an error of between 1.2 to 1.4 games per team per season built into it---that‘s a pretty hefty error when dealing with a 16 game season. I reasoned that, at best, all I could do was approximate, but, by what process?

I’m not a mathematician, but after some experimentation, I came up with a “best fit” equation which I felt, at least, got me in the ballpark---WAR - (11.75% x WAR) = WAA. Although the WAA totals for each QB are going to be infested with some degree of error, the actual statistical rankings of the QBs by Wylie should not change and that is what should really concern us. I’ve included each QB’s WAR totals, my 11.75% subtraction and the resultant approximate WAA totals, then broke down those totals per 16 game seasons. For those QBs who had tie games, I didn’t know what to do, so I’ve included figures with and without the tie games. Wylie’s top 41 QBs were reviewed.

Wins above average (per 16 games)

Davis Wylie

Doug Drinen

PlayerWARWAAPer 16 gamesWAAPer 16 games
Manning (2006)15.9314.061.4335.22.95 (thru 2008)
Brady (2006) (thru 2008)
Van Brocklin15.1313.352.1114.82.34
Brunell 200613.5411.951.2-0.3-.03 (thru 2008)
McNabb (2006)10.569.321.366.2.70 (thru 2008)

Conclusions: Wylie and Drinen are the only two researchers to create either “Wins Above Replacement” or “Wins Above Average” totals for all the major QBs, thus a review of their efforts I feel is worthwhile. Obviously, there is quite a disparity between the figures they come up with. Initially, I thought that Wylie’s totals heavily favored the old-time QBs like Unitas, Dawson, Starr and Griese while Drinen’s totals heavily favored the modern QBs like Montana, Marino, Elway and Manning, but the more QBs I examined, the less obvious this trend became. As an example, Wylie’s totals favored modern QBs like McNair, Moon, Simms and Esiason while Drinen’s totals favored old-timers like Staubach, Graham, Tittle and Stabler.

Then I went back and re-read the comments from readers when Drinen’s list originally came out (March 30, 2009). One blogger (“Brad O.”) seems to have hit the nail right on the head. He wrote, “Good QBs on good teams mostly rank very well. Good QBs on bad teams mostly don’t. Mediocre QBs on good teams do fine. And everyone without a lot of games played is pretty close to the middle.” Brad. O’s comments seem to strike most true with Sonny Jurgensen, Ken Anderson, and Tom Brady. Wylie gives both Jurgensen and Anderson very high totals while Drinen discounts them altogether---even to the extent of giving Anderson a negative total. Meanwhile, Wylie downgrades Brady while Drinen gives him a staggering total---far higher than any other WAA list I have ever seen. The exception to this observation is Mark Brunell who was a good QB on good teams, but is also given a negative total by Drinen.

Part of the problems with Wylie and Drinen are their chosen methodologies. Wylie used numerous math equations (including those of ESPN’s Sean Lahman) upon numerous stats, to normalize all the QB's stats to 2006 levels. But, whenever you adjust stats, small errors are entered into the equations and those errors multiply like compound interest. Adjusted stats for the old-time QBs become unrealistic and tend to give the old-timers more credit than they deserve. At the same time, the modern QBs get less credit than they deserve as their “real” stats are pitted against undeserved “imaginary” stats. Despite the errors that have crept into Wylie’s research, I regard his overall statistical rankings as pretty decent and his totals certainly within the ballpark: 1. Van Brocklin 2. Graham 3. Staubach 4. Young 5. Jurgensen 6. Unitas 7. Tarkenton 8. Anderson 9. Montana 10. Gabriel 11. Fouts 12. Jones 13. Dawson 14. Manning (through 2006) 15. Morrall. Incredibly, Wylie disregarded his own statistical findings and, instead, based his final rankings on subjectivity! Unitas he ranked No. 1 because of his “historical importance”! Elway he ranked No. 5 despite the fact that, statistically, Elway was tied for 32nd out of the 41 QBs I examined!

But, whatever the errors that were introduced by Wylie, those errors introduced by Drinen are much worst. Drinen’s entire research is based upon one stat---“increments of points given up on defense“. He utterly ignores offensive points scored and, in doing so, gives us only half the loaf. Worst, if he had chosen to use offensive point totals, their correlation (.73) to winning percentage is greater than are defensive points given up correlation to winning percentage (.71)---(P.S., Brian Burke’s correlation chart has these figures at .74 and .66.) There is also one other gigantic flaw in Drinen’s system. You see, he actually posted two WAA surveys---one for regular season starts (March 25, 2009) and another which included post-season starts (March 30, 2009). However, in his second study, he changed a few of the variables such as points given up on INT returns which he eliminated. To see how these changes in a few variables affected his WAA figures, we need only review those QBs who never started a post-season game. As examples, Sonny Jurgensen was given credit for 3.9 WAA in the first survey and 0.3 WAA in the second survey; Norm Snead was given credit for -9.6 WAA in the first survey and -14.3 WAA in the second; Joey Harrington had -10.2 WAA in the first survey and -12.4 WAA in the second. A few slight changes in variables affected the increments enough to make major changes in WAA! (P.S., I used Drinen’s results from his second survey for this posting so as coincide with Wylie’s inclusion of post-season games in his study. For some strange reason, Wylie actually didn‘t use the stats from post-season games; instead, he added bonus “Points Above Replacement“ which is a simple multiple of WAR, i.e., WAR = PAR/40. However, when you start to add “bonus points“, your equation becomes infused with personal bias).

Burke has discounted any attempt to build a wins probability model based upon point spreads (January 21, 2007). He writes, “Models that use points scored or points allowed or variations of either, are no more analytical than Dan Fouts.” He refers to such attempts as the “Fouts Analysis”, i.e., “the team that scores the most points will win the game”. Burke concludes his article by writing: “We already know that the ability to score more points than another team leads to winning. The question is: What enables some teams to score more than others?”

Drinen has used a reverse “Fouts Analysis”, i.e., “the team that gives up the fewest points will win the game”. As Burke suggests, both methods are predictive, but are not very analytical. The degree of error that Drinen enters into his calculations must be astronomical. Good QBs who played on good teams are too heavily rewarded while good QBs who played on bad teams are not rewarded at all. These are Drinen’s top 15 QBs in WAA: 1. Brady (thru 2008 season) 2. Manning (thru 2008 season) 3. Van Brocklin 4. Montana 5. Staubach 6. Graham 7. Elway 8. Stabler 9. Young 10. Cunningham 11. Kelly 12. Marino 13. Tittle 14. Favre (thru 2008 season) 15. Fouts. Although Drinen and Wylie agree on eight of the top 15 QBs, Drinen’s figures are all over the place with respect to many of the QBs and, often, are utterly illogical. For example, he has Starr, Jurgensen and Anderson totaling a measly 4.5 WAA for their combined careers---they had 12 NFL passing championships between them! Meanwhile, he gives Elway, who is a poster boy for being a slightly above average NFL QB, a staggering 30.7 WAA!
We end this by pointing out one very salient comment that Sean Lahman made when Drinen’s list was first published. Lahman wrote, “I’m wondering if you think this raw data, even though it measures the QB’s team and not the QB himself, is a useful metric?” Drinen never replied.
All football stats, including WAA figures, are the product of the team, not the QB. This, above all, we should remember.

Footnote: Several years ago, Neil Paine (aka Davis Wylie) admitted that his “100 Greatest QBs” study was a “flawed monstrosity”. Indeed, he pulled it off the Internet. I’ve never agreed with that harsh assessment. Virtually all WAA surveys that I have examined have at least one very serious flaw attached to them---usually using stats like “Yards After Catch” or “Air Yards” which are not very accurate, nor are they officially recognized by the NFL, nor can they be applied universally---or using “Sacks” for which there are no official records before 1969 and, therefore can’t be universally applied to all the QBs throughout NFL history. After all, what is the use of any study that can’t include the likes of Graham, Unitas, Van Brocklin, Starr, Jurgensen and Tittle, etc. To be of any use, WAA models must use stats that are official and universal. Furthermore, they should be stats that relate well to the QB (i.e., Drinen’s “increments of defensive points allowed” stat has virtually nothing to do with the QB; stats like “adjusted yards per attempt” or “adjusted net yards per attempt“ are team oriented efficiency stats which, likewise, have little to do with the QB). Additionally, you would prefer to have a stat that correlates well to winning percentage. As an example, completion percentage has only a moderate correlation (.43) to winning percentage. On the other hand, TD pass percentage does correlate very well with winning percentage (.55), but the problem is that, while the QB has a great deal of control over completion percentage, many other factors, including luck and the skills of the receivers (as Burke points out), have much more influence over TD pass percentage than the QB. The problem, as it has always been, is separating the merits of the QB from the merits of his team---and doing so in an equitable manner.
There are many other problems in creating a WAA model. As Burke illustrates with his August 2007 model which used “Air Yards“, running QBs, like Michael Vick, have a huge advantage if their running ability is factored into the equation (i.e., In Burke’s study of 2006 stats, Vick‘s “Wins Added Per 16 Games“ , based upon his passing stats alone, was -0.10, but when his rushing and fumble stats are included, his “Wins Added“ total zooms to 1.72! A non-running QB, like Tom Brady, saw his “Wins Added“ total precariously drop from .53 to .21. And a QB like Kurt Warner, who was a notorious fumbler, saw his “Wins Added“ total drop off the cliff---from .80 to -.51!). Fortunately, there have been few running QBs in league history like Vick and few fumblers like Warner. Strength of schedule is another factor difficult to account for. Probably the best way to estimate a QB’s career SOS is to review the number of winning teams that he has faced vs. career starts, but how to factor SOS into the equation is always problematic. For instance, in one of Doug Drinen’s SOS lists (March 30, 2009), his figures are seriously flawed because they are simple multiples of his WAA figures which are tragically flawed.

Researcher Chase Stuart has his own version of SOS (the one used throughout, but his SOS figures are based entirely upon two stats---’Average Passing Yards Per Attempt” and “Adjusted Passing Yards Per Attempt” (which has more problems associated with it than passer rating as it is also “weighted“ with bizarre “bonuses“ and “penalties“). Stuart refers to his system as “Morally Accurate”---whatever that means. As far as how to factor in the defenses that a QB faces over the course of his career, Football Outsiders David Lewin has noted that, “The quality of opponents’ defenses tends to even out over the course of a QB‘s career.” Likewise, on a yearly basis, Football Outsiders two passer rating systems (Value Over Average and Defensive VOA), exhibit very little change in overall QB rankings due to defensive factors. Also, football statistician Zach Fein has stated that offensive stats and their corresponding defensive stats have about the same correlation rates when it comes to winning percentage. But, that isn’t quite true because, if you look at Fein’s correlation charts (or those of Burke), virtually all offensive stats correlate better with winning percentage than their corresponding defensive stats. Essentially, defensive stats can be largely ignored while we concentrate our efforts on offensive stats. After all is considered and much discarded, I feel that the two key factors in any WAA model should be simple passing yards per attempt (without any penalties for sacks) and INT percentage as these are the factors that the QB has fairly decent control over, simple passing yards per attempt correlates very well with winning percentage (.58) although INT percentage has only a moderate correlation (.40) with winning percentage, and both of these stats are beneficial as they require no “weights”, no “value judgments”, no “bonuses”, no “penalties” and few “adjustments” other than normalization and standard deviation which must be used if you wish to compare QBs from different eras---which I think most of us do. Sacks and anything pertaining to sacks---such as “turnover ratio”--- should be avoided as the QB has very little control over sacks, data on sacks is sketchy at best, sack percentage has only a modest correlation (.38) to winning percentage and, as Elias warns us, many of the sacks that were counted as sacks in the old days (before 1981) would never be counted as sacks today. Additionally, there is no way of knowing what the QB was thinking when trapped for a loss behind the line of scrimmage---did he intend to pass or run? Sack totals and yards lost via sacks do not differentiate as to intent.

As for the inclusion of “yards per attempt” over “completion percentage”, arguments rage on both sides. Yards per attempt correlates much better with winning percentage than completion percentage, but the QB has much more control over completion percentage than he has over yards per attempt. Essentially, it is a case of “pick your poison” as they are simple multiples of each other and, thus, one or the other must be excluded. Since we are most interested in WAA, I lean toward yards per attempt. Famed sports writers Allen Barra and George Ignatin, in their book “Football By The Numbers”, agree, “Passing numbers best correlated with winning are simple yards per pass attempt and INT percentage.” Also, there is another factor here---yards per attempt heavily favors the old-time QBs while INT percentage heavily favors the modern QBs---so a delicate balance is struck and an aspect of “fairness” is achieved. If you choose to use both completion percentage and INT percentage (as Burke does), your equation would be heavily tilted toward the modern QBs.

Okay. according to the two ingredients of the formula for WAA that I am recommending, who should be the best QBs with respect to WAA? Well here are the top ten QBs of all-time in “Simple Yards Per Attempt“: Luckman, Van Brocklin, Rodgers, Newton, Young, Warner, Romo, Roethlisberger, Ed Brown, and Starr. Here are the top ten QBs of all-time in “INT Percentage“: Rodgers, Brady, O‘Donnell, McNabb, Bradford, Garcia, Flacco, Brunell, Garrard, and Matt Ryan. Clearly, Aaron Rodgers should have a substantial lead over all other QBs. Steve Young and Matt Schaub are the only other QBs who place among the top 20 in both categories.

But, we immediately run into another problem. Just because Rodgers has the lowest INT percentage in NFL history, it doesn’t mean that he was the best of all-time because his raw stats are not era adjusted. As an example, let’s take a look at Neil O’Donnell who has the lowest INT percentage among retired QBs (2.11%). After era adjusting O’Donnell’s INT percentage relative to league average (3.29%) from 1991-2003, he still finishes in the No. 2 spot (relative to average). Researcher Kiran Rasaretnam. points out that the No. 1 spot belongs to Roman Gabriel who starts out in 68th place in career interception percentage (3.31%), but the league average during his career (1962-77) was a whopping 5.34%---thus jumping Gabriel 67 places to the top spot among retired QBs relative to league averages. What I’m trying to point out is that raw WAA figures are useless unless we are able to unbiasedly adjust them by era. Ditto for “Yards Per Attempt”. However, even that is a problem because, as we have seen with Wylie, the process of “ forward normalization” makes the old-time QBs look a lot better than they actually were.

As a recent example, researcher Rupert Patrick normalized the passer ratings of all the QBs in an article for the PFRA. His top three in “normalized passer rating” were Graham, Luckman and Baugh. Inexcusably, Patrick included Graham’s bloated AAFC stats--- so Graham can be ignored, but do you really think that Luckman and Baugh were better than Manning and Brady? If we reversed the process---normalized in a backward direction, the old-time QBs would look inept in comparison to the modern QBs. Therefore, normalization is useless without going one step further---standard deviation. Even the innocent looking process of “standard deviation” has a glaring flaw. It depends entirely upon there being tough competition between passers from the same era. The best example would be Sammy Baugh---over his long career, his only competition came from Sid Luckman. Thus, Baugh, in standard deviation studies, is always ranked among the top 5 passers of all-time based upon his extraordinarily high standard deviation of league passer rating means. But it’s a false positive because his nearest and only competition---Luckman, when confronted with standard deviation, dropped like a rock--- and, currently, is ranked outside the top 35 QBs of all-time in this same category!

One thing we should always be mindful of. Whenever possible, “adjustments” to real stats should be avoided unless they are absolutely necessary. “Adjustments” have become the play things of modern times, but that doesn’t mean that we should trust them. Usually, ”adjustments” are the enemy as they often reflect the “value judgments“ and, thus, the bias of the researcher. Whenever you hear the word “value” in anyone’s ranking system---run for the hills. Regressions are particularly unfaithful as their variables and exponents can be manipulated in any manner which is favorable to the researcher’s desire to embellish whatever argument he is trying to make. It is ratios and percentages of “real” stats that should be embraced in any statistical analysis as they, in contrast to artificially contrived stats like “Wins Above Replacement“, do not discriminate. Failure to realize this is what flawed Davis Wylie’s study and has flawed many similar studies.

Ref: Barra, Allen & George Ignatin, “Football By The Numbers, 1986, Prentice Hall.
Burke, Brian, “The Fouts Analysis”, Jan. 21, 2006.
Ibid., “QB Wins Added II”, July 23, 2007.
Ibid., “QB Wins Added (With Rushing)”, August 2007.
Drinen, Doug, “Adjusting QB Win-Loss Records, Part I”, blog, March 25, 2009.
Ibid., “Ranking the QBs---Schedule & Weather Adjustments’, blog, Aug. 12, 2009.
Ibid., “Adjusting QB Win-Loss Records, Part II’, blog, March 30, 2009.
Fein, Zach, “”Looking At Win Correlations, Part I”, The Sporting
Ibid., “How NFL Statistics Lead To Wins Part II: Quantifying Player Stats”, Bleacher Reports, March 25, 2009.
Paine, Neil, “If Aikman Was Romo”, blog, Aug. 25, 2009.
Ibid., “The 100 Greatest QBs of The Modern Era”,
Patrick, Rupert, “Normalized Passer Rating”, In “Coffin Corner”, Vol 33, No. 3, PFRA 2011.
Rasaretnam, Kiran, “The Importance of Interceptions (Or Lack Thereof)”, Feb. 6, 2009.
Stuart, Chase, “Rearview Adjusted Yards Per Attempt, blog, Aug. 23, 2007.

1 comment:

Anonymous said...

After all, it´s the way Jim Glass and I see it:
No single stat* can tell you how good/bad a QB was/is. No EPA, WPA, QBR, Y/PP or whatever. It always was and always will be team stats.

So all* QB-Rankings are subjective.

Karl, Germany

(* Maybe the only way to extract QB-Performance from team stats is comparisons of QB´s on the same team in the same year. But then you have to "adjust" too: For less practise reps and preparations the backup QB´s get....)

Post a Comment

Note: Only a member of this blog may post a comment.