## Saturday, December 27, 2008

### Straight Up Winners

by Bob Burns

It is my contention that the box score data and stats from NFL games are the result of the down, distance, field position, score and time remaining in the game. As they are the result, they can not be the cause of winning or losing. If they are the result, they cannot be the independent variables in a regression that forecasts either scores or wins and losses. I am convinced that the data and stats are not useful for forecasting based on my trying to use them in a regression without success. But Brian Burke has a model that forecasts the probability of wins and losses and the model is based on box score data and stats. This presents a problem, how to explain Brian’s success.

In my opinion, Brian has had success because the data he is using are proxies for wins and losses. If I am correct, I should be able to construct a model based on won/lost records that should outperform Brian’s model because the actual data should out perform the proxy data.I used data from week 5 to week 17 from 1996 to 2006 build the model. The model was set up for home teams only, and used the difference in the win percentage between the home team and the visiting team. A formula was determined that the probability of a win is .60 plus .43 times the win percentage difference.

For example… in week 16 2008 the Giants were home against the Panthers. Both team had the same win percentage, so the Giants had a estimated win probability of .60. Another example is the Redskins at home is week 16 vs the Eagles. The Skins record was .50 and the Eagles was .57. The Skins probability would be calculated as .60 plus (.43 time –0.07) which works out to .57. Both the Giants and the Skins won. [ its my write up so I can pick winning examples LOL ]
That method was then applied to week 5 to 17 in 2007 and 2008 and compared to the same weeks as Brian’s model. Now the question is how to compare results.

First lets look at who had more correct. Brian’s model had 236 correct (his pick with the higher probability won) and 121 losses for a 66% win rate. Mine showed 233 winners and 124 losers for a 65% win rate. But that doesn’t reflect the differences in win probability. I took all the games rated .90 and above, computed an average probability for those games and an average won/lost percentage for that group. I did the same with the .80’s, .70’s, etc. I then ran a correlation between the average forecasted probability and the actual win rates. Brian’s model had a 0.922 correlation and mine had a 0.979 correlation.

There where 358 games and 357 decisions (1 tie). I used the same procedure as above except but group in groups of about 50, starting with the highest forecasted probability. Brian’s model had a 0.948 correlation, and my model had a 0.989 correlation.

So the models are very close in actual results. Are the results close enough to be essentially the same model?

Brian Burke said...

Very interesting. I assume you used to-date records, rather than latest season records as predictors?

There's an even simpler way, too. Just go with "team with better record wins, tie goes to the home team." It can get almost as accurate.

Unknown said...

I used the records that existed prior to the game being played. The "team with better record wins, tie goes to the home team" is simple, but is only correct about 62% of the time. Bet on the Vegas point spread favorite is another simple method and it wins straight up at about a 65% rate. If we use that method to rate our "skill", our models are not much better and we don't have much skill. LOL But neither of those methods give a probability of winning for each game, which is the outstanding feature of Brian's model.

The real questions....are Brian's independent variables really a proxy for the won lost record, and how does one prove it one way or another statistically.

Anonymous said...

where can I find the average combined score by week for NFL?

In other words in there a pattern during the season?