Tuesday, May 3, 2011

Fixing the ELO ratings system

by Tom Baldwin

Back in 2008 Brian covered a well known ratings system first conceived by the man whose name it now bears, that article offers a great insight into both the good and bad points of Elo ratings, and is certainly worth a read in preparation for this article.

Recently I got thinking about Elo ratings again, and realized their limitations, that they do not consider the score of games, can be overcome. When a game's outcome is considered by the original ELO ratings system it is done so on the basis that the game ends with a binary result, a win or a loss, but that is not all of the information available. As we have seen, a team like last year's Falcons can appear far stronger than they truly are when only their win-loss record is considered, and that is because they were on the right end of luck. Close wins are much more about luck than they are skill, winning by ten points is a lot more of an indication of one team's supremacy than winning by one, but how can we quantify this? Well, we need to know the answer to quite a simple question: when two completely equal teams play each other, what is the chance that, by luck alone, team A beats team B by X or more points? In this instance it will always be by luck alone, since both teams are technically equal, their levels of 'skill' completely cancel each other out.

That question is simple to ask, but not so simple to answer, luckily, I had already done the legwork for such a question months ago. Back in January, before the playoffs, I decided to build a simulation model of an NFL game. Within that model I created two identical (league average) teams, characterised by their chances of scoring on a drive, I also allowed for a variable that added X points to the score of whichever team I chose after the game had been simulated. By simulating thousands of games with different values of X, I was able to create a distribution that describes just how likely it is that if a team wins by X points, that it was not because of luck, but because of a higher level of skill. For example, if I set X as being 5, and then always added that to the score of team A, and observed in simulations that team B still won 40% of the time, then that would mean that team A could be expected to beat a completely equal strength opponent by more than 5 points 40% of the time by luck alone. So, in the new ELO model, if team A wins by 6 points, rather than saying that team A won with absoluteness and award them 1 minus their expected win probability, I would instead award them 0.6 minus their expected win probability.

That change can have a dramatic implication for the treatment of the outcome of a game. Where before a team with an expected winning probability going into the game of 0.9 who goes on to win by 6 points would see their ranking increased by 0.1, now they would see it reduced by 0.3, which would then be awarded to their opposition. What this method does, which the original method did not, is make a distinction between a team winning, and a team performing to expectations.

In a follow up to this article I will further improve the ELO system by separating home performance from away performance, and post some results of the predictivity of the new system, in comparison to the old one.


Adam said...

Cool idea. How is this different from the common rankings that only take margin of victory into account, like Sagarin's Predictive rankings? Rather, I know it's different, but what are some differences in the results?

James said...


I am the creator of the pointshare rating system which I use to rank college football and NFL teams and also other sports (see www.Pointshare.webs.com particularly the FAQ and my older now discontinued blog at dontsaveapitcher.blogspot.com where I first outlined the system see the google docs on the top left hand side of the blog) .

When I developed my system last year I had to wrestle with the same type of issue as you describe here. As like you I have a “game outcome function” to translate a football score into a value between 0 and 1 so have others such as Ken Massey (by the way his comparison site of >100 college football ranking systems is an essential read for you to see what other approaches people are using.)

But I think there is a flaw in your logic. If your football model says that if team A and B are equally good then team A will win by 5 or more points 40% of the time it is incorrect to then credit them with 60% skill. By definition you have said the two teams are equal so how can you credit team A with anything. This is akin to saying in statistics that because your null hypothesis has a p value of 0.05 you are 95% sure that your alternative hypothesis is correct. Which is not what a p value of 0.05 means.

The purpose of ELO (or any other rating system I guess) is to discover the underlying skill differences between teams. Knowing how likely a result is if the teams are equally skilled tells you nothing about how much better or worse team A is relative to team B.

I approached the problem from a conditional probability perspective and my game outcome function attempts to say that, given the score in the game that was played, what is the probability a team would win a rematch at the same venue.

Or in other words
if Team A defeated team B by 5 points.
How likely is this result if team A would beat team B 1% of the time if they played an infinite number of games.

How likely is this result if team A would win 2% of the time,

How likely is this result if team A would win 3% of the time
etc etc up to 99%.

This gives me a distribution of what a 5 point win tells me about the relative merits of the two teams. It looks like an inverted U, I simply then look at the area under the curve greater than 50% and compare it to the area under the curve less than 50% to determine the probability that A would win a rematch.

I don’t know if your drive model allows you to vary the skill of each team to determine how likely a 5 point win is if team A are much better than team B. My really simple model of how football works allows me to do this. If it doesn’t you could use Las Vegas to help you and set your elo coefficients based on a pointspread to moneyline conversion charts which is not the same thing as what I have discussed above but as 3pt favourites win straight up 58% of the time giving a 3 point win 58% of the credit is one way of tackling this issue.

If you like I can send you my excel spreadsheet where I run my system although it is quite big and I will need some time to tidy it up to be reader friendly.

When you finalise your system I encourage you to adapt it to college football (as there may be no more NFL!!) and submit your rankings to the Massey site as it is a wonderful resource.


James said...

I've tried to post a detailed comment on this three times but it has not appeared is it being blocked by a spam filter?

James said...


I have created the pointshare system and I had to wrestle with the same issues you have. (see www.pointshare.webs.com for more details also my old blog at dontsageapitcher.blogspot.com.)

But I think there is a flaw in your logic. If your football model says that if team A and b are equally good then team A will win by 5 points 40% of the time then it is incorrrect to credit them with "60% skill". by definition you have said that the two teams are equal so you cannot credit them with anything! instead you need to run your model with team A being increasingly better and see how likely it is that they will win by 5 points.

This is the approach I took and i am happy to share my spreadsheets with you.

A different approach to your problem would be to use a vegas pointspread to moneyline chart and use that to convert say a 3 point win to a percentage as 3 point favourites win 58% of the time i the NFL.

Send me an email at the contact us link on the pointshare website if you want to discuss this more.

What ever you do i encourage you to adapt your system to college football and then submit your rankings to ken massey wonderful comparison site.


Tom said...

Hi James, good to hear from you,

Perhaps I failed to explain myself well enough. What I did was to attempt to expel the part of the result that could be due to luck, and as such, by computing that two even teams will beat each other by five or more points 40% of the time, I am stating that there is a 40% chance that the result of this game was due merely to luck, a sort of NFL coin toss, and so the 60% that is left over is what can be assessed as being due to 'skill'. Does that explain it better?
What I have is, essentially, a spread-to-line converter, as you say.

I will be sure to have a read of your website, and I may well drop one or more set of rankings into the Massey site. This system requires no real adaptation to be used on CFB, which is nice. However, I am working on a system I far prefer that predicts a team's ratio of scoring, and the match total, and this model is showing very nice promise - it makes no great assumptions, unlike the prior distribution that an elo-like system applies.


James said...

Hi Tom

I still think you are partioning skill and luck incorrectly.

If I toss a (unbiased) coin 10 times there is a 38% chance of 6 or more heads.(analagous to your 5 pt or more win).

Would you credit me with 62% skill if I threw 6 heads out of 10?

Would this tell you anything about my coin tossing abilities if you had to bet on the outcome of the next 10 tosses? (or in your example who would win a rematch between team A and B under the same conditions as the game A won by 5 points?)

You are assuming the teams are equal and then working out the liklihood of a 5pt win.

You should instead look at a 5 point win and then determine how likely that result is at different values of teams A strength relative to B. As you are trying to determine if team A is better or worse than team B based on the scores of games.

For example in my coin tossing example if you knew that I was able to flip the coin in a way which alters the % heads from 50:50 you could then look at how likely getting exactly 6 heads out of 10 is if I could flip heads 1% of the time 2% of the time etc etc up to 99% and then work out the probability that I am better than 50%.

This is the approach I take in my system, given the score I work out each teams scoring rate per drive over a range of possible scoring rates and then compare the two teams to work out the likelihood of which team has the higher rate.

I don't think your assumption will drastically ruin your system but it is better to have a clear definition of what each step of your system means.


Tom said...

I am not assuming the two teams are equal. I assume they have a performance that is distributed logistically, as with elo, however I do not assume, as elo does, that because one team wins, it is automatically better. Instead, I imagine the situation where the outcome of the match is due entirely to luck (as in, both teams are equal), and then I say - ok, x% of the time a result such as this is due entirely to luck, and as such I take away that x% from the 100% win the team achieves. This is key, as it extends the definition of what a teams expected performance is. In regular elo, a team's performance is simply whether they won or lost, but sport is not like that. Just because team A beats team B by five points, doesn't mean that team A is better than team B, because such a result would have happened in a game between two even opponents 40% of the time, which suggests there is merely a 60% chance that team A is better than team B.

Make more sense?


Tom said...

Just to add, as it may help, if the match was the best of ten coin tosses, then the expected performance of each team is known to be equal, but lets say it isn't. At the start of the match I assume they are equal, and hence that the expected performance for both is five 'points'.
If I observe, then, that one team wins, as you say, by getting six heads, elo says: 'Ok, that team gets their performance adjusted by a number proportional to 1 minus their expected performance, so they get 1 minus 0.5, or 0.5 times some constant.' I say: 'Wait a minute, there was a 38% chance they would do that, or better, by luck alone, so I'm going to apportion them the win of value 1, but I'm taking away 0.38 because it could have been luck, and I'm taking away the 0.5 that was their expected performance prior to the game, so they get awarded 1 minus 0.5, minus 0.38, or 0.12 times some constant'.
Now, we both know that the coins are equal, but elo only knows what you've told it, what approximation you have for their ratings, and it isn't trying to make that rating fixed, it's trying to fit the ratings to the results, so using a naive win=1, loss=0 metric, one of the coins gets a big boost, the other gets really knocked down, even though they are in actual fact equal. My method says that whilst one did win, it wasn't convincing, it might've been luck, and as such you're only getting a smaller award, as whilst there is evidence you might be better, there isn't much, only 0.12...

In fact, you have given me an idea. I shall create a data set which describes entirely 50:50 match results for say, 300 games, and then 'rate' the teams. That way I can show how elo over-rewards tight wins, and how my method is fairer in comparison.


Tom said...

Ok, ok, last post, just to finally clarify that luck is being proportioned correctly.

Take the example of the match between team A and team B, A wins by five points. We known that a five point advantage corresponds to a 60% chance of winning the match. Now, prior to the match the elo ratings were used to give the expected winning chances of the two teams. Prior to the match it turns out elo says there is a 60% chance that team A will win, which converts to an expectation that they will win by five points. As we said, they achieved this, but despite the fact that the team in fact performed exactly to expectations, suggesting that their rating and that of their opposition is fair, the elo system still hands over 0.4 times some constant as a reward for winning. My method says well done, you performed as expected, you get 0.6, your actual performance, minus 0.6, your predicted performance, which means you get nothing, because your rating looks accurate.


Tom said...


James said...


I agree with you in the second half of your system i.e. once you have appotioned out credit to the two teams after the game your ELO method seems perfectly appropriate. In my system if a 200 team plays a 100 team they must get 66% (200/(200+100))of the game outcoem function to keep their rating at the same level as that is the "expected" result.

But I have to confess your skill vs luck resoning still seems flawed to me. You are explaining it clearly but I think your reasoning is wrong.

In statistical terms you seem to be confusing two things.
1) the p value of a particular outcome under a null hypothesis (40% chance of a 5+ point win if the teams are equally skilled)

2) probability that the null hypothesis is correct.(40% chance that the two teams are actually equally skiled.)

In "normal" (frequentist as opposed to bayesian) statistics you only ever deal with scenario 1. Even if you reject the null hypothesis becasue p<0.05, or whatever threshold you set for rejection, you never actually calculate the probability that the null hypothesis is correct.

One thing that has occured to me is, is the 40% the chance that specifically team A would win by 40% (and so team B also has a 40% chance of winning by 5+) in which case there is an 80% chance that one team would win by 5+ points so should your luck skill breakdown by 80:20 rather than 40:60.

I think you should work out the probability of team A winning by exactly 5 points (not 5+) if
team A is 1 point better than team B on average if team A is 2 points better than tema B on average etc etc
and do the same for team B being better despite losing.

Add all of these probabilities up.
Sum the probabilities where A is better.
Divide these two numbers and that gives you an indication of the percentage chance that A is better than B given their 5 point win.


Tom said...

I will be honest, I cannot see how I could possibly be apportioning the points wrongly. All the measure does, is say did they play as expected, better, or worse, but simply to a finer degree than the win or lose method. I calculate their prior chance given their prior rating, and then their performance using the spread from the actual game. That way, a team that wins by more than expected is always rewarded, and rewarded according to how many points they were better than expected, but converted to a probability. A team that performs as expected stays the same, and a team that underperforms loses out.
All I am doing is putting a metric on how much a team 'won', and the way I am doing it is in fact equivalent to what you suggested, but much quicker. It is only necessary to consider two league average teams in order to attain a spread-distribution for the whole league, since I can retrospectively add however many points I like to team A or team B to see how important a certain spread is.


Tom said...

I should point out that this method is a form of Bayesian inference, rather than being of the 'normal' style. In essence it is adjusting ratings according to the size of the uncertainty between the results and the ratings, i.e. ten points extra is worth a lot more to a team that is expected to be five points better, than to one that is expected to be ten points better, even tough they both performed ten points better than expected.

James said...

I think your system is fine in setting a target that the better team need to achieve rather than just using wins and losses. We will have to agree to disagree on the first step of your system I.e. How the game outcome function is generated. Best of luck in developing the system and hopefully we will do battle on ken masseys site during the college football season.

Post a Comment

Note: Only a member of this blog may post a comment.