Advanced NFL Stats Community: Towards a Better Pythagorean: Should Football Outsiders "hold the update"?

by Jim Glass

The good people at Football Outsiders say they are changing the formula for Pythagorean Expectation that they use to gauge team strength, and intend to update all the Pythagorean data on their website via their new formula during the offseason.

The "traditional" Pythagorean formula they've used until now produces a win expectation with about a 91% correlation to actual team wins. They say the new, improved formula increases the correlation to .9134 from .9120.

That's something - but a much simpler adjustment to applying the standard Pythagorean formula that I've been using increases the correlation to 95%. Also, while FO's new methodology uses a log function applied to year-to-date data that will be opaque to the average fan, the method I've been using gives a clear game-by-game result that anybody can easily grasp.

So I submit, for their consideration and yours, the formula adjustment below as being more accurate, simpler to apply, and easier for fans to refer to, understand, and play with.

Take it for what it's worth.

Regular Pythagorean
For those who may be unfamiliar with it, "Pythagorean win expectation" or "projection" is an expected winning percentage derived for a team from its points for-against totals using the formula PF^X/(PF^X+PA^X). It was created by Bill James in the 1980s for baseball, and the exponent "X" he used was 2. Its form reminded him of the Pythagorean Theorem in geometry, hence the name.

The idea quickly was applied to other sports, with modest changes in the exponent found to provide "best fit" for each. The exponent derived for the NFL in the 1980s was 2.37, and it has been the most commonly used ever since - for instance, in determining the "expected W-L" numbers on the team pages at PFR.com. I find that for the years 2001-10 the exponent producing the best fit is 2.67, but the difference this change makes is very minor and not the point here.

Pythagorean expectation is derived not from the difference between points "for" and "against" but from their ratio. NFL teams score an average of about 22 points per game. A team that outscores its opponents by an average of 10 points using offense, by 32-22, has a Pythagorean expectation of 72%. Another that outscores its opponents by the same average of 10 points using defense, by 22-12, has a win expectation of 83%. (This makes sense since if in ten games played to a decision a team outscored its opponents by 50-0 then it must have gone 10-0, while if it outscored them by 450-400 it's likely to have lost several of those shoot-outs, possibly most of them.)

Plenty of data from all the major sports show that when a team's W-L record is a lot better or worse than its Pythagorean expectation there is a strong tendency for its future WL% to regress to its Pythagorean. WL% diverging from Pythagorean expectation is commonly deeemed the result of luck, good or bad, with the Pythagorean being a measure of the team's true strength. This is supported by that fact that going forward, Pythagorean expectation is demonstrably better than past WL% as a predictor of future WL%.

"Unit Pythagorean"
Pythagorean expectation has always been figured using total points for and against for the entire season. A while back I wondered what the result would be if it was calculated on a game-by-game basis, computing the expectation for each individual game score and then averaging them. No more than that. Simple.

Applying both regular Pythagorean and what I think of as "unit Pythagorean" - treating each game as an individual unit, rather than all games in the aggregate - to all 319 teams during the ten seasons 2001-10, using the 2.67 exponent, and comparing the difference between actual and expected winning percentage, produced these results:

* Correlation between expected and actual wins increased to .952 from .914.

* Standard deviation of the difference between actual and expected winning percentage fell to .130 from .180.

A significant improvement. The original 8.6 points of "non correlation" are reduced by 44% while the standard deviation of the difference between expected and actual wins is reduced by 28%.

The reason behind this improvement, I believe, is that figuring the Pythagorean for each game as an individual unit increases the number of observations. Football (unlike baseball) is always cursed by small sample size in statistical analysis. "Per play" analysis systems (such as EPA and DVOA) work by counting each play as an observation, greatly increasing the number of observations about a team compared to the few outcomes observed in its won-lost record. This is especially so early in the season when W-L totals are so low as to prove nothing, but hundreds of plays per team can be evaluated to provide credible estimates of team strength.

Increasing the number of Pythagorean observations from one to as many as 16 or more should provide similar benefits - as one example, in containing the excess effects of extreme, outlier game scores.

For instance, in week 7 the Saints beat the Colts by 62 to 7. The Unit Pythagorean method gives the Saints a .997 score for that week, but only for that week. A Pythagorean expectation of over .969 rounds off to a perfect 16-0 season, so any percentage higher than that is meaningless in practical terms. The Saints exceeded that .969 when the score in that game reached 27-7. The extra 35 points they scored beyond that increased their Unit Pythagorean score for the game by only .028, which divided by 16 games is less than .002, about nothing, just what it should be in practical terms.

But those extra 35 points have a significant impact on regular Pythagorean expectation, as they average 2.2 points a game that one might think of as being "carried over" into other games over the entire season. Over a 16-game season they increase an otherwise average .500-team's winning expectation to over 56%, and the effect is of course larger earlier in the season. At mid-season, after 8-games, after the 62-7 score, a team with a Unit Pythagorean of .500 would have a regular Pythagorean of .620 -- and look like a contender.

Another advantage of Unit Pythagorean is that it is really easy on the eye and the understanding. One can produce a simple table of each team's game-by-game Pythagorean to show just where its Unit Pythagorean came from. The total season-to-date Unit Pythagorean is just the average. If one wants, one can add a column adjusting each game's "rating" by the strength of the opponent, to give a "quality" measure for each game. Then one can rank individual games in order of impressiveness of performance, draw a trend line to see if performance is getting better or worse, lots of things ñ all with no math any more fancy than the Pythagorean formula itself given above.

Pythagenport
Football Outsiders can best explain its own proposed change for itself. Quoting:

We've been writing about the Pythagorean projection since we launched in 2003. We've always used 2.37 as the exponent in the equation ... However, that exponent is based on the offensive environment of the league. We all know the offensive environment is a bit different now. Teams are scoring more points and allowing more points. So the exponent has changed, and 2.37 is not the most accurate way to estimate Pythagorean wins anymore.

Actually, if we want to be as accurate as possible, each team plays in a different offensive environment. Saints games feature lots of points. Jaguars games feature fewer points. The exponent should be different for each team. Baseball Prospectus discovered this a few years ago and started replacing Pythagorean wins with something it called "Pythagenport" (after writer Clay Davenport).

I've figured out a similar method to get better results for the NFL. Pythagenport finds a different exponent for each team based on their offensive environment. The equation that works best in the NFL is 1.5 * log ((PF+PA)/G). The improvement is slight. The correlation between Pythagorean wins and actual wins for 1990-2010 is .9120. The correlation between Pythagenport and actual wins for 1990-2010 is .9134.

However, the improvement from Pythagenport is bigger in recent seasons because scoring has been higher in recent seasons. (In particular, it helps with the Colts, who have continuously outperformed the standard Pythagorean projection all decade.)

Note that I am not criticizing Pythagenport in any way. If it has something interesting to say about the Colts outlier performance regarding regular Pythagorean in recent years, I'm eager to hear it.

All I'm doing is reporting the results I find. Comparing expected-to-actual wins for the 319 team seasons during the years 2001-10, computed using regular Pythagorean, the Pythagenport formula given above, and Unit Pythagorean, I find...

	Correlation	Standard Deviation
Unit Pythagorean	0.952	0.130
FO Pythagenport	0.916	0.167
Reg. Pythagorean	0.914	0.180

Into the public domain Unit Pythagorean is like Big Win% in that it is so simple I have to believe others have used it before, though I have no knowledge of anyone who has. I make no claim to having "invented" it, only to having independently come upon it.

The two are closely related, both use the fact that one-sided victories (and defeats) are much better indicators of team strength than close game outcomes. But while BigWin% treats close games as ties, Unit Pythagorean gives a "strength" rating on a graduated scale to every outcome: Winning a game by 23-22 gives the winner a 53% strength rating for the game (before adjusting for home field advantage and strength of schedule). So one can easily produce a ranking of all teams by strength using Unit Pythagorean and a strength-of-schedule adjustment.

The idea of applying Pythagorean expectation game-by-game came to me when considering the peculiar fact that when applying BigWin% backward to the past 15 years of playoff games, it outperformed not only regular WL% at "predicting" winners (as expected) but also outperformed Pythagorean. That was surprising, because while BigWin% considered only won-lost-tie data, Pythagorean considered all game scores in a theoretically sound manner. It seemed peculiar that the method more crudely using less data would be more successful.

Then it occurred to me that as Pythagorean used only one observation (total points for and against for the season) while BigWin% used 16 (all game outcomes) regular Pythagorean might actually be using less data.

Be that as it may, here are what I tried and the results. Take them for what they are worth and, if you wish, use it all as you will.

Just for fun, below is the Dolphins' game-by-game Pythagorean performance adjusted by strength of opponent. Worst performance: Game 5 against the Jets. Best: Game 10 against the Cowboys.

Trend: Strongly up.

Maybe things are looking up for the next coach.

6 comments:

Boston Chris said...: Nice work, Jim. I really like it.; December 24, 2011 at 1:28 AM
Tom said...: Very interesting, nice work. One suggestion regarding the averaging... May I suggest you average the odds and not the probabilities? There's sound mathematical reasons to do this.; December 25, 2011 at 7:04 PM
Michael Beuoy said...: I like the concept, but I wonder if the improved correlation isn't just a case of over fitting.

For example, using your unit approach, if you ratcheted up the exponent from 2.67 to 500, I would expect that the correlation with win percentage would be rise to nearly 1.0.; December 25, 2011 at 11:09 PM
Jim Glass said...: Boston Chris, thanks.

Tom, Pythagorean has always been expressed in terms of win probabilty, I stuck with that for the sake of conformity (however mindless). I don't doubt there are better ways to do these things but they are for others. This was just an exercise in curiosity for me. I'd never have brought up the subject here but for the FOers commentary.

Michael, the 2.67 exponent is to maximize the fit of regular Pythagorean. It's effect compared to the traditional 2.37 is so tiny I probably shouldn't have mentioned it at all, it's on the same order as that of the FOers exponent formula. The increase to 95% correlation is virtually entirely the result of applying Pythagorean game by game -- the same result comes from doing the same thing using 2.37.

I did however make a mistake in writing this piece up a 2am, I realized when considering the 500 exponent you mentioned. The standard deviation numbers quoted aren't for the difference between expected and actual wins, they are for the Pythagorean numbers themselves. That is, regular Pythagorean win expectation for all teams over the ten years had a 91% correlation with actual winning percentage and a .180 standard deviation, unit Pythagorean a 95% correlation and .130 standard deviation (one-SD winning percentages from .370 to .630), a tighter distribution around the mean. I don't know why my brain spun out the other line except for maybe I was sleep deprived. My bad on that one.; December 26, 2011 at 4:42 PM
Behan01 said...: Nice work Jim.

"So one can easily produce a ranking of all teams by strength using Unit Pythagorean and a strength-of-schedule adjustment."

Just curious, have you done this for 2011 and could you post the results?; December 27, 2011 at 7:35 PM
Andrew Foland said...: Michael Beouy is absolutely correct. Take it as a homework exercise, Jim, to show that as the exponent goes to infinity, the correlation goes to 1. The Wikipedia article on p-norms may be useful, particularly the bit on L-infinity.; December 30, 2011 at 8:25 AM