Friday, February 24, 2012

Spygate: The Effectiveness of Cheating


by Paul Benjamin

Enough time has passed to evaluate the effect of Bill Belichick's cheating. The cheating took place from 2000-2006, and was ended early in 2007, giving 5 years of data since.

The known cheating consisted of two components, as revealed by Eric Mangini. First, the Patriots would tape opponents defensive hand signals. This permitted coaches to correlate the signals with the defensive alignments and figure out what each signal meant. Second, the Patriots used unregistered radio frequencies, so that the second time they played that team the offensive coordinator could watch the defensive signals and choose the perfect play to tell the quarterback. Normally, the referee cuts the registered radio frequency 15 seconds before the snap, so the offensive coordinator cannot communicate with the quarterback after the defense makes its substitutions, but the Patriots were the only team in the league that had radio equipment that could broadcast on multiple frequencies simultaneously. After the referee would cut the registered frequency, the quarterback could still hear the coordinator on the other frequency, so he could be told the defensive alignment he was facing and what play to call.

So the plays and blocking schemes were always perfect ones to exploit each defensive alignment.

To measure the effect of this scheme, we need to isolate pairs of games in which the Patriots would be obtaining tape then using it. Some teams are played only every three years, so that data is not likely to show much. Teams in the same division are played twice every year and tape from the previous year can be used, so the effect of the cheating scheme may be harder to detect. I started by analyzing games played in the same season against out-of-division opponents. The second game against such an opponent is always a postseason game, so this has the advantage of measuring postseason success. After looking at these games, I saw that the Patriots have twice faced a division opponent (Jets) in the postseason, so I added those games, too. So the analysis below is of all opponents the Patriots played in the regular season and postseason of the same year. There are 30 such games.

The overall result is that during the spying era, the Patriots were 4-5 during the regular season but went 6-2 against the same teams in the postseason, so they did much better the second time. Since the cheating was ended, the Patriots have gone 5-2 in the regular season but dropped to 2-4 in the postseason, with their offense per game dropping by 10 points in the postseason, so they are doing much worse the second time, and primarily because their offense is tanking. This is despite the Patriot offense scoring more points per game during 2007-2011 than they did in 2000-2006, which reflects they have more talent.

The numbers indicate that the cheating scheme was worth more than a touchdown a game.

Results of Patriots' games, showing only opponents they played both in regular season and postseason:


After cheating was ended:
 OpponentRegular SeasonPostseason
2011Denver BroncosW 41-23 (Away, week 15)W 45-10
 New York GiantsL 20-24 (Home, week 9)L 17-21
2010New York JetsL 14-28 (Away, week 2)L 21-28
 New York JetsW 45-3 (Home, week 13) 
2009Baltimore RavensW 27-21 (Home, week 4)L 14-33
2007San Diego ChargersW 38-14 (Home, week 2)W 21-12
 New York GiantsW 38-35 (Away, week 17)L 14-17 (Superbowl)

SummaryRegular SeasonPostseason
Record5-22-4
Points Scored per Game;31.922.0
Points Allowed per Game21.120.2

While cheating:
 OpponentRegular SeasonPostseason
2006New York JetsW 24-17 (Away, week 2)W 37-16
 New York JetsL 14-17 (Home, week 10) 
 Indianapolis ColtsL 20-27 (Home, week 9)L 34-38
2005Denver BroncosL 20-28 (Away, week 6)L 13-27
2004Indianapolis ColtsW 27-24 (Home, week 1)W 20-3
 Pittsburgh SteelersL 20-34 (Away, week 8)W 41-27
2003Tennessee TitansW 38-30 (Home, week 5)W 17-14 (Superbowl)
 Indianapolis ColtsW 38-34 (Away, week 13)W 24-17
2001St. Louis RamsL 17-24 (Home, week 10)W 20-17 (Superbowl)

SummaryRegular SeasonPostseason
Record4-56-2
Points Scored per Game;24.225.8
Points Allowed per Game26.119.9

23 comments:

Andrew said...

Do you have a link for the radio frequency allegation? I'm pretty sure that's an unsubstantiated rumor, yet you are writing as if it is a given fact. You're also drawing conclusions from tiny samples here. The numbers don't really indicate anything because you don't have enough data to draw any definitive conclusions.

In each of your pre-and-post cheating buckets, the Patriots had 3 games decided by a touchdown or less. In one they went 3-0, in the other they've gone 0-3. That is just as likely (probably more likely) to be due to simple luck than anything else. Ascribing it to cheating is disingenuous.

Boston Chris said...

Agree with Andrew above. This entire article smacks of hack job. Small sample size, rumor-mongering, etc. This post, and you now, have lost a lot of credibility.

Boston Chris said...

Oh, and on top of all of that, your conclusion doesn't even make sense, even with your limited sample size. According to your analysis, points scored per game stayed equal "while cheating", yet is points allowed per game that went down by almost a touchdown. Both of your accusations of cheating should have helped the offense and not the defense. So, on top of the above issues, your conclusions don't even follow from your flawed analysis.

Andrew said...

Correction on my earlier comment: the Patriots went 3-1 in games decided by a TD or less in the "cheating" bucket, not 3-0. My original point, that the Patriots had better luck in close postseason games during the 01-06 period than in the 07-11 period, still stands. That, more than anything, is why there is a difference in their postseason record pre-and-post 2007.

Paul Benjamin said...

OK, you two. So you're Pats fans. I feel for you. Here are a couple of links about the radio frequency issues:
1 - http://ipbiz.blogspot.com/2007/09/radiofrequency-issues-in-patriotsjets.html
2 - http://ipbiz.blogspot.com/2007/09/goodell-on-nbc-on-sept-16-about.html

You think Belichick spent all that effort taping defensive signals for nothing? What good were those tapes if they could not use the info? You've lost all credibility with me if you don't realize that Mangini told us what Belichick was doing. Check the news stories and see both of his charges. He was a Pats coach, and he *knew* what they were doing. He revealed it, and it's obvious.

As far as the sample size issue: you're clueless. It's been 5 years. How long do you want to wait, 20 years? Tell me what an acceptable sample size would be. Is it 1000 games? He cheated 7 years, and it's been 5 years since. It's clear that he's fallen off the table. He was great when he cheated and he has no championships since. He's a good coach but not a great one. Think Schottenheimer.

How can you think "luck" is the factor when the NFL already established that he cheated? The point of science is to test an hypothesis. We already know he cheated. The only remaining question is how much it helped. It's more than a touchdown. And if you don't understand that improving an offenses's ability to posses the football helps the defense, then you are really ignorant of football.

Michael Beuoy said...

Neither of your links substantiates, or even touches upon, the following:

"After the referee would cut the registered frequency, the quarterback could still hear the coordinator on the other frequency, so he could be told the defensive alignment he was facing and what play to call."

And before you start poisoning the well, I'm a Colts fan. My hatred for the Pats is hard-coded into my DNA.

Personally, I think the fault lies with Wes Welker. The Pats were great without him, and now they have no championships since his arrival.

Paul Benjamin said...

OK. Here are a few links:
1 - http://sports.espn.go.com/nfl/columns/story?columnist=sando_mike&id=3035449
2 - http://ipbiz.blogspot.com/2007/09/goodell-on-nbc-on-sept-16-about.html
3 - http://sports.espn.go.com/nfl/news/story?id=3014677

This issue has been floating around for a while, such as:
4 - http://www.secsportsfan.com/radio-communication-is-the-real-spygate-issue.html
5 - http://www.mrflamm.com/current-events---patriots.html

Now, none of this is legal proof. For that, we'd have to have in-game observation of the frequencies. But this is not a courtroom, it's the court of public opinion. They recorded the signals, decoded them and got the info to their qbs somehow. It really doesn't matter how, but it appears to be with extra radio frequencies. I do recall reading (but can't find the link) that the other 31 teams' equipment was looked at and could broadcast on only one frequency at a time.

I'm not really a Pats hater. I just don't like Belichick because he cheated. I used to drink the same Koolaid that he had figured out something amazing between his times in Cleveland and New England, and become a supercoach. But now I realize he figured out how to bring modern industrial espionage techniques into the NFL.

Look, I'm sure there are other coaches who cheat. Some pipe crowd noise onto the field, etc. It's not like Belichick is the only guy who's looked for that kind of edge. The reason I did this analysis in the first place was to see how much it helped. People were saying it didn't really make any difference, so I looked at the games where it would be clearest, and it's more than a touchdown. They've gone from being nearly unbeatable in the postseason to being very beatable.

Anonymous said...

The statement "nearly unbeatable in the postseason " run counter to the ideals of the ANS website. Why? Well, because they only won the 3 SBs by 3 points each, which tells us any of those games could have easily gone the other way.

I believe Belichick cheated. But your article in no way identifies how much his cheating helped. It doesn't matter how loud you yell, your argument is very weak,

Andrew Foland said...

There is a standard statistical technique for establishing that something mattered. Namely, you ask, "Can I statistically reject the null hypothesis that these two W-L records are drawn from the same underlying distribution, at level of p=.05?" (The choice of .05 is arbitrary but conventional.)

Then carry out the same null-hypothesis rejection analysis on the non-repeated games in the two time periods to rule out the "Wes Welker" effect, or any other differences.

So, statistical program is: demonstrate significant (p=.05) rejection of null hypothesis in the specific samples that taping should have affect, followed by demonstrable non-rejection in extraneous samples covering the same two time periods.

Carry out that statistical program (i.e. drive it to p-level rejection of null hypotheses, rather than simply quoting W-L records) and there will be much less to quibble about. (The null rejection tests, by the way, will by themselves account for the relative sizes of the samples.)

Paul Benjamin said...

OK, Andrew, I will do that computation. But I do not agree with it. Testing a null hypothesis is appropriate when you want to see if an effect exists. But if you already know it exists, then the significance level is arbitrary. Suppose I have a rigged die that rolls a 6 20% of the time. Such a test will not detect it. But if I tell you that it's rigged, then you're just measuring the effect. We already know Belichick cheated. We're not trying to see if he did. These are two distinct populations. The Pats went from gaining a touchdown in the second game to losing a touchdown. It's just a measurement of a known effect. The accuracy of the measurement is certainly subject to scrutiny, but that's a different question.

I have to work for the next few days, and will get to it afterwards. Questions so that this will satisfy you: do you want your null hypothesis to include point differential? Do you want the population to be all NFL postseason games in history?

Andrew Foland said...

We don't know an effect exists, we know a difference exists. The question is whether the known difference actually has any impact on the underlying distribution of either W-L or PF. (Impact on PA is second-order in defensive hand signaling.)

I don't want anything, it turns out; I find it hard to care about this question, except in the most abstract sense that it's interesting to know how much of game outcome is at all attributable to play-calling, which this question can illuminate a little bit. Mostly I just want standard tests applied so nobody has to argue :)

I will point out that if you believe cheating explains what's going on, the most natural pattern is to see that in the non-cheating era, the PF is the same in the postseason as in the regular season, while in the cheating era, PF is higher. This is not at all the pattern. Instead, in the non-cheating era, there is an average 10-point loss in scoring; while in the cheating era they are very nearly equal. So you somehow have to posit that "naturally, NE loses 10 points the second time they see an opponent, but cheating used to help them gain it back." If so, this suggests that the much more interesting question is why NE would lose 10 points naturally.

Anonymous said...

Change the distinction from regular-season/postseason to 1st-game/rematches. Basically, I think the numbers should look like this:

After cheating ends
First game: 4-2; rematches: 3-4

During cheating
First game: 4-4; rematches: 6-3

Paul Benjamin said...

The first thing I examined was what happens to PF between the regular season and the postseason. Overall, between 2001-2011, a team scored 1.68 points less per playoff game. I broke that down by each team and year, which shifted the mean down to -3.58 because teams with bigger negative numbers tend to get eliminated sooner and play fewer games. The data is here:
http://pbenjamin.net/postseasondiff.xlsx

For each team, I calculated the season PF, the postseason PF, and the difference. This gave 132 datapoints. It turns out to be especially easy to analyze. The mean difference is -3.58. All five New England teams in the taping era score above that, and all four New England teams in the post-taping era score below that. If our null hypothesis is that taping did not affect New England scoring in the postseason (or more precisely that New England would be as likely to score above or below the postseason mean independently of taping in the regular season), then this has probability .5^9=.002, so we can reject this hypothesis, and conclude that the taping did affect their postseason scoring. Analyzing it this way avoided the issues of whether they played division opponents in the postseason.

Then I examined a little bit the general question of how much we could expect New England's PF change from the regular season to the postseason. Taking the postseason team each year that had the highest regular season PF, we have 11 teams with an average change of -5.68 points. This represents about 18% of their regular season points. In the entire data set, the drop of 3.58 is about 14.5% of their regular season points. I did not do a complete regression analysis here, but it appears that very high-scoring teams lose a bit more of their PF. From the taping to non-taping era, the New England regular season scoring increased from 24 points/game to 32 points/game, so we could expect about 1.25 more of a drop in the postseason. In addition, to estimate the advantage taken from taping we assume that they would have performed at the mean of -3.58 without it, and they performed at +1.29 during the taping era, giving an advantage of +4.87. So taking this away in the post-taping era, plus adding their additional offensive strength, gives a total expected drop of 6.02. Now, this isn't 10 points, but it is well within one standard deviation (8.16).

Andrew said...

Even if there is a noticeable, measurable difference (and I'm still not convinced that there is; you're correlating about a half season's worth of games with another 6-9 games, which are ridiculously tiny samples that will be prone to large amounts of variance and heavy influence from outliers), that does not mean that you can say, "It's because of the cheating."

You: "[S]o we can reject this hypothesis, and conclude that the taping did affect their postseason scoring."

Um, no. You can't. Even if you reject your null, you cannot ascribe the difference to cheating. I know that you really, really want to, but correlation does not equal causation.

As Andrew Forland said: "We don't know an effect exists, we know a difference exists."

That is a crucial distinction which, in your zeal to demonize Bill Belichick, you are completely ignoring. Again, even if there is a significant difference here, you cannot simply ascribe it to cheating and say that that was the cause of the difference.

Paul Benjamin said...

Hmmm, you said "I find it hard to care about this question", but you do seem to care. And you "just want standard tests applied, so nobody has to argue". I applied standard tests, and you're arguing. I think you care deeply since you're from Boston, and you will argue no matter what I do. I asked you what population you wanted this compared to, and you did not answer. I applied it to all postseason games in 2001-2011 to get games from the same era, and you criticize it for being only half a season's worth of games. So what population will satisfy you?

Of course correlation doesn't equal causation! This is statistics, which cannot actually prove anything, but merely reveal likelihood. The probability of chance was .002. I carried out the procedure you described in your post of Feb 26, and now you reject it. You're a fraud.

You reveal yourself as just another Pats fan living in denial. I am not demonizing Belichick. He has a very sophisticated intelligence, and infused an old cheating scheme with new technology. I am just measuring the effect of it. It's more than a touchdown a game.

And please explain how I'm correlating half a season's games with 6-9 games. I'm correlating all postseason games from the past 11 years with 22 New England postseason games in the same period.

Andrew said...

I am not Andrew Forland. We are two different people. That this escaped you, even though I made it explicitly clear in my last comment, does not surprise me.

Admittedly, I don't quite understand what the hell you did in your last comment. Apparently you threw a bunch of stuff into a cocktail shaker and found some significance of some kind. Regardless, anyone who understands how scientific research works could tell that your conclusion ("It's all because of the cheating") is not endorsed by your data. Whatever your data says, you don't know what is causing that difference. That you are presuming to know what is causing that difference is the very reason that your analysis is flawed.

In your initial post, you took a sample of 6-9 regular season games and compared those data with 6-9 postseason games that occurred during the same time frame. You then did the same comparison for a different time frame. And then you claimed that you could reduce all variance within those data to one specific variable. That is ridiculous. The Patriots's rosters have changed. Opponent's rosters have changed. The scoring environment within those seasons has changed. Turnovers have a large affect on game outcomes, even though they are largely the product of random variance (over small samples, which is what you have). You did not make any attempt to control for those, nor any other untold variables that could affect different outcomes of those matchups.

I don't understand how you, or anyone, could confidently ascribe all of the differences you supposedly found to one specific variable when there are a myriad of other variables that could affect the differences here.

This is a great example of how not to conduct a research experiment. All you have proven is that you are capable of bad science.

Paul Benjamin said...

Sorry I misread your identity. I was surprised and admittedly upset, because I thought Andrew Forland was objective and straightforward. It turns out he was, and I apologize to him. You are not. And you seem confused. Have you actually looked at the spreadsheet I pointed to? It doesn't seem so. Please take a look at it. If that's what you're referring to with your "cocktail shaker" comment, then yes, you don't understand what I did, and it would be better if you did not criticize it until you do. I don't really respect those "anyone who knows this ..." remarks. If you are one of those who knows this, then make quantitative criticisms of the data or conclusions on the spreadsheet.

Andrew said...

You continue to ignore any of my criticisms. He asked you to perform a test of significance on your original data. You did not do that. We all know why you didn't (because you would not have found any significance). Instead, you concocted a different set of data, in some sort of opaque, obtuse fashion, because it gave you significance. This is putting the cart before the horse. That is bad science, that is backwards science and it is disingenuous.

You are not interested in discourse or science. You merely have an axe to grind with Bill Belichick.

I think we're done here.

Paul Benjamin said...

Andrew, he (Andrew Forland) asked me the following: 'There is a standard statistical technique for establishing that something mattered. Namely, you ask, "Can I statistically reject the null hypothesis that these two W-L records are drawn from the same underlying distribution, at level of p=.05?"' In this way, I would perform a test of significance on my data.

I asked him if he had any specific underlying distribution in mind, and he said no. I chose all playoff teams from 2001-2011, and investigated the null hypothesis that these two W-L records could have been randomly drawn from that distribution with p=.05. I found that the measured p was .002, which is less than p.05, so the null hypothesis was rejected. Do you really understand what that means, and specifically how that approach accounts for the "myriad of other variables" that you talk about?

Andrew said...

Wait... so, all you did was multiply .5 to the 9th power? That's how you got your .002 number? So.... you didn't really do a statistical significance test at all. Meaning, you don't really have a p value. Well, yes, I do "really understand what that means." It means that you don't know what you're doing. Even more so than I had previously suspected. Since it's becoming increasingly clear that you're out of your element here, I'll just do the tests for you.

Your original claim was that the difference between the Patriots' performance in the postseason when facing teams that they've previously played in the same season is proof of cheating. So I ran a t-test on your original data. I calculated the Patriots' differentials between points scored versus an opponent in the regular season and that same team in the postseason. (I used averages for the two Jets regular season games in each bucket.) That gave me 8 pre-Spygate differentials and 6 post-Spygate differentials. Punching those into a t-test, I got p=.15. So you cannot reject your null.

Just for shits and giggles, I ran t-tests within each bucket too. I ended up with p=0.74 for pre-Spygate and p=0.16 for post-Spygate. So there was no statistically significant difference between regular season and postseason scoring within each time frame either.

In sum: using the data points that you selected, no statistically significant difference was found within each pre- or post-Spygate bucket or between each bucket when looking at differences between regular season and postseason scoring.

Maybe now you will you finally stop this nonsense?

Paul Benjamin said...

Oh brother. You're still ignoring the spreadsheet I posted and using the original data. Others had already posted that they found that sample too small and not convincing, so I did the work to put together a larger population. My null was on that population, but you are trying to be clever and ignore that, and apply the null to the small original sample. I guess you realize the null is correctly rejected on the spreadsheet data, and just don't want to face it.

Since you are such an expert, try running your t-test on my spreadsheet data, please, and post the details to educate those of us who are so inferior in knowledge.

Andrew said...

I guess the answer to my question is "no."

I find it greatly amusing that the person who wrote this:

"As far as the sample size issue: you're clueless. It's been 5 years. How long do you want to wait, 20 years? Tell me what an acceptable sample size would be. Is it 1000 games?"

is now complaining that the sample size which I used (which you were forcibly defending in the above quote) is now inadequate. Hilarious. If the data that I used is so flawed, then why did you post it in the first place? Why did you defend it so vigorously when it was challenged?

Of course, the sample is too small, which affected why there was no significance found. But that is what myself, and others, have been saying from the beginning, a complaint which you dismissed up until it was shown that there was no statistical significance in your data. Funny how that works. That you are suddenly in agreement about the sample size being inadequate just further shows your disingenuousness (which has already been shown in spades).

There is no need to do anything with that spreadsheet of yours. That data was collected in response to Andrew Forland, who articulated how you could attempt to endorse your hypothesis through statistical significance tests. Which is what you then attempted to do. Oh, wait, you didn't. You essentially made up a p value instead of performing actual statistical tests. That notwithstanding, these were his words:

"So, statistical program is: demonstrate significant (p=.05) rejection of null hypothesis in the specific samples that taping should have affect, followed by demonstrable non-rejection in extraneous samples covering the same two time periods."

As I just showed, we cannot reject the null in the specific samples that taping should have affected. Mr. Forland's definition of success required a rejection of that null, therefore there is no reason to examine the extraneous samples. If you want to take the time to perform a t-test on your spreadsheet data, then be my guest.

Of course, your challenge to me to perform this test simply raises the question of why you did not do so yourself in the first place. He asked you to perform tests of statistical significance, which you did not do. Assuming that you know how to perform significance tests (perhaps a flawed assumption), then why did you feel that you didn't need to use any? You spent all of that time collecting data, yet you didn't feel the need to plug the numbers into a significance test? And now you expect me to do so? Sorry, I already did that for you once. I'm not doing it again. Especially when I've already discredited your claims.

It is at this point that I should remind you that the burden of proof here is on you, not me. You brought this up. You made the decision to post this here. You decided what data to use. You made the claim that your data "indicate that the cheating scheme was worth more than a touchdown a game." Don't cry foul when someone shows that your own data indicate no such thing.

If you feel like you can find something which backs up your original hypothesis, which up to this point you have failed to do, then please do so. Until such an occasion (which I will eagerly await with baited breath), you will not hear (err... read) any more from me.

Best of luck.

Paul Benjamin said...

1) Clearly I saw the need to produce more and better data to back up my claim. That's why I put in the time to track down and type in all that data.

2) I didn't feel the need to perform a t-test because the conclusion was intuitively obvious. If in the room you're in now, all the nitrogen molecules were to move to one side of the room and all the oxygen to the other side of the room, would you need to perform a t-test on the positions of all the molecules to see if it's not random? That's what happened here: all the taping numbers were above the mean and all the non-taping were below. Any difference test must show this.

The reason I asked you to do it is that I did it while typing my previous response, and I knew the answer. It took 2 seconds. The result is p=.004, and is on the spreadsheet now, which is here:

http://pbenjamin.net/postseasondiff.xlsx

This was trivial. You could have done it quickly if you are as knowledgeable as you claim.

I posted the spreadsheet a while ago, and the whole purpose of doing that is to allow others to work with the data and see if there are flaws, and if necessary to point out a better way to do things. You could have run the t-test in far less time than it took to type in another aggressive and insulting post. The only thing I can conclude is that your true purpose is to write as many such posts as you can.

Post a Comment

Note: Only a member of this blog may post a comment.