Sunday, September 19, 2010

An Analysis of Placekicker Salary Distribution in the NFL

This contribution comes from the desk of Dan Schlauch.

Abstract:

The importance of an effective kicking game in the NFL is undeniable. Kickers are regularly the highest scoring players on a team and are repeatedly asked to perform in key situations of high stakes. The difference between the best performing kicker (Janikowski) and the worst performing kicker (Brown) in 2009 totaled over 35 points, which would likely make the difference in several games. Additionally, the mean salary of kickers was the lowest in the NFL at just over $1.5 million.

However, despite these considerations, this analysis shows that kickers receive a disproportionately high percentage of team salary and that money spent on kickers has a startlingly low return on investment.


Introduction:

The database of information used in this study was collected from various play-by-play sources using a java parsing tool. The data covers detailed information on every NFL play in the modern era (since 2000). This totals 2654 games and 429,000 plays.

The purpose of this study is to evaluate the production of kickers relative to their salary. We should be calculating the marginal benefit of each marginal dollar spent on kicking. Kickers receiving more than average salary should produce more than average and vice versa. Our goal is to evaluate the degree to which this is true. In order to evaluate a kicker’s performance relative to a standard benchmark (in this case we use the expected value of average performance) we need to perform the non-trivial task of determining an expected kicker performance. Common kicker metrics, such as percentage and points, are effectively useless for this study. The degree to which sampling error of degree of difficulty influences a kicker’s success rate is enormous. A complete study needs to evaluate every kick performed by every kicker over the past decade and compare it to the expectation of success derived from the collection of all NFL kicks based on a reasonable number of measureable variables.

Methods:


First, we must build the expectation database. From here we can begin the plotting of field goal success versus the most dominant variable- distance. At each yard line, success rates over the past 10 years were evaluated, and a local regression was applied to the resulting figures.

For each yardage, we can calculate an expectation of each kick in the NFL based on the distance to the end zone. Each kick has an Expected Success (ES) between 0 and 1, and a binary result, 0 (miss) or 1 (good). For example, a kick from the 30 (47 yard FG), has an ES of .632. If the kick is good, the kicker will earn +.368 against his ES, or +.368 Expected Success Added (ESA). Conversely, he will earn -.632 ESA for a miss. The obvious benefit of this system is the low-bias, situation specific evaluation of kicker production that normalizes for kick difficulty. Above average kickers will have positive ESA’s and below average kickers will have negative ESA’s. By comparing the ES with every kick of a particular kicker’s season we are able to compare every kicker’s season to the global average to determine what benefit, if any, that player had to his team.

Our next step was to identify systemic differences (besides skill) that favored certain kickers over others in order to neutralize them. A kicker from a team that plays home games in a dome has the benefit of kicking under ideal environmental circumstances for half of his games. Ignoring uneven situational factors would likely result in a systemic bias to kickers of certain teams in spite of their skill.

Distance is, by far, the most important variable in this equation. But for the purposes of completeness, several others, including dome/outside, temperature, wind and humidity need also be evaluated. Each of these factors was found to impact the analysis in some way, though not uniformly according to distance and with varying degrees of impact. Whether or not the kick occurred in a dome was the most impactful variable and also the easiest to evaluate because of the binary nature of the data. Unfortunately, the effect is not uniform, as we can see from the graph below. The benefit of kicking in a dome increases as distance to the goal increases, but we can calculate the benefit for each distance and adjust the model accordingly. The likelihood of a 52 yard field goal in a dome was approximately equal to the likelihood of a 47 yard field goal outside whereas the effect is negligible inside of 5 yards.

By including a variable for stadium design into the equation, we can better approximate the ES for a specific kick situation.


The effects of wind, temperature and humidity also played a role, though diminished. From the comparison graph below we can see that of those three factors, temperature had the greatest effect on success rate. Intuitively, low wind, high temperature and low humidity helped kickers, while the opposite hurt them. Humidity produced almost negligible effect, though the highest levels of humidity very slightly reduced kicking ES. These three factors were included in the data for outside kicks.



It is interesting to note that while cold temperature affects kicking linearly with distance, there is a distinct inflection point around the 20 yard line for the effect of wind. This is intuitive, as kicking in high winds has an exponential effect on difficulty with distance. Though the difference is slight, if given a choice, it appears that warm conditions are preferable for shorter (<38 yards) FGs, but calmer conditions are preferable for longer (>38 yards) FGs.

By combining the variables of distance, dome, temperature, wind and humidity, we can achieve an ES that describes any kick difficulty to a copious amount of precision. Kickers who repeatedly outperform their ES are undeniably valuable and should be considered a premium talent in the league.

The next question is the following: Who, if anyone, is consistently outperforming ES? By simply observing kicker EPAs, we can know who has produced more or less than should have been expected. But, we are also interested in knowing the confidence we can have in the repeatability of performance. Kicking, by nature, is a highly variable task. A kicker who was +3.0 ESA had an excellent year, producing 9 more points for his team than should have been expected of him given the circumstances of each

of his kicks. However, in a sample size of only one season, it becomes difficult to know whether the success should be attributed to luck or skill, an extremely important distinction. Skill repeats itself while luck does not. To address this problem we need to assign a p-value to each score signifying its statistical significance.

We do this with an iterative approach, simulating each sample size (1 season, for example) and finding the percentage of average kickers who would be expected to outperform the real kicker given an identical set of kicks. We might have more confidence in the abilities of kicker who is +6.0ESA over 5 seasons with p-value = .01 (1% of average kickers would expect the same level of success over the same set of kicks) than a kicker who has +3.0ESA with p-value = .15 over a single season despite the higher rate of success.

We can take a look over the entire careers (year 2000 on), to find any statistically significant performances. Using 79 kickers who attempted more than 5 field goals since 2000, we would expect the distribution of p-values to be uniform only if all kickers were essentially average. What we see is not far from it. Only 7 out of 79 kickers were able to outperform the top 5% of a similar set of average kickers, only one outperformed 99% of that set. If all kickers are essentially average, purely by chance we would expect about 4 kickers to be considered top 5% and about 1 kicker to be top 1%. What this indicates is that the distribution of skill in kickers in the NFL very strongly resembles the distribution of a set of completely average kickers with no discernible skill differences. The most productive kicker of the decade, Matt Stover, added only 4.2 points per season to his teams’ offensive production. Even this meager amount is likely due mostly to variance and is subject to regression to the mean in the coming years.

To put it all together we need to evaluate the ESA of each kicker based on his perceived worth. We use salary cap space entering the 2009 season for all starting kickers and measure their production based on ESA. There is a very small Pearson correlation between ESA and salary cap value of .13 indicating a very weak association between those that are paid more and those that produce more. When plotting ESA against kicker salary and applying a linear regression we see almost no increase in ESA with additional salary expenditure. The exact calculations indicate that the each additional point per season above the mean costs approximately $1.25 million when purchasing value through kickers.

Discussion:

A common argument is the value of “clutchness” for kickers. The belief is that certain kickers perform better under pressure than others and that they earn their salaries through excellence in a handful of key situations in their career. The very nature of clutch kicking requires the situations to be sparse, and consequently be subject to small sample bias. It would be very difficult to determine whether a kicker was truly excellent in the clutch or whether this perception developed from simple variance within small samples. We can approach this problem by observing the league as a whole. In clutch situations, defined as a deficit of 0-3 points with less than 3 minutes remaining or overtime, the league as a whole deviates less than half of a percent from normal game situation averages. This indicates that the sum

of kickers is essentially clutch-neutral, but leaves open the possibility that there are clutch kickers and “chokers” in equal proportions.

We can also borrow data from other situations and sports, such as baseball, which have performed numerous studies indicating that clutchness is not a quality that is inherent to an individual in professional sports. The perception of clutch and non-clutch players is a result of small sample variance. (http://www.baseballprospectus.com/article.php?articleid=2656)

Another consideration is the effect of kicker specialization. A kicker who excels in long-range FGs, but under-performs in short-range situations would appear virtually average if he is utilized in a normal distribution of kick distances. However, if it is possible to specifically take advantage of a particular kicker’s skill set, then his value will become more apparent through this model.

In any case, any justification for kicker expenses will require a value to be attributed to higher salaried players that is above and beyond their ability to exceed average kicker production. Are those kickers clubhouse leaders? Or do those kickers make their money by excelling at kickoffs or serving as backup QBs?

Conclusions:

This study is an initial statistical analysis. While there remain areas for improvement and refinement, the conclusion is unlikely to change significantly. The conclusion is that the cost in terms of kicker salaries with respect to their expected returns is enormous. Though they play a high-pressure, vital part of the game, the evidence shows that there is little difference between the perceived best and worst kickers in the league. Past success and struggles have show to be extremely poor predictors of future performance. It is this parity that should drive the market price for this position down. Teams should not pay $3 million for a “proven” veteran kicker, when that kicker is scarcely more likely to succeed next season as the unknown who makes the league minimum. The value-over-replacement for kickers approaches 0. A more in-depth future analysis will calculate marginal costs and benefits of other positions for use as a direct comparison and evaluation of team salary distribution. A team’s kicker’s salary proportion should be reduced only if they believe that the return on investment for another investment is greater than 1 point per $1.25 million.

14 comments:

Vince said...

Kickers also do kickoffs, so it's hard to evaluate salaries without including value from kickoffs. A kicker who is paid $1.25M over average may only add 1 point per season in FG value, but I'd bet that he also adds some kickoff value.

Focusing only on FGs, I wonder how much kicker salaries are driven by the goal of avoiding bad FG kickers. There might not be much difference between an average kicker and a very good kicker, but there could be a big difference between the bad kickers and the average ones. Since there are few established kickers (who have proven themselves to be at least averageish) available in free agency in a given year, teams will pay a premium to get one and avoid going with an inexperienced kicker who may turn out to be bad.

James said...

I think Football outsiders also found that FG accuracy varies from year to year but kickoffs are much more consitent. The fact that teams will sometimes have a KO specialist on a roster implies that those teams see it as valuable.

Off the top of my head I think a touchback is worth about 0.5 points as the average start after a KO is the 27 yard line and 14yds=1point.

In 2009 the average TB% was 15% so if your kicker does better than that he is valuable. Using this approach David Buehler of Dallas comes tops as he had 29 TB in 77 KOs which is 38%, an average kicker would only have had 11.5 so those extra 17.5 TBs are possibly worth about 9 points if you assume a TB is half a point. Which is comparable with the best FG performance in your study and if KO performance is more consistent then FG year to year this may justify having a KO specialist. And I am ignoring any effect that good KO guys have on KOs that aren't TBs which could also add to their value. Because I cant find any data on where KOs are fielded rather just TB rate and Avg return.

IS there a place where your data tables and graphs are as I would love to have a look.

Well done on being the first in the revamped Advanced NFL stats community

James

James said...

I would also like to see graphs or charts along with your conclusions.

To respond to James' question about KO guys - David Buehler is an active member of the kickoff coverage unit and other special teams. Unlike most kickers who just stand around after kickoffs, Buehler sprints downfield with the rest of the unit and tries to make the tackle. He also played on the kickoff return team and punt team last year (I'm unsure if he still is this year now that he's kicking field goals). That must provide some, albeit very very little, margin value, right?

Oh, and this year the coaches are encouraging him to sacrifice touchbacks in favor of placing his kickoffs around the one yardline so that the other team is forced to return the kickoffs. The idea is, combined with increasing hang time sufficiently, they can tackle the returner inside the 20. Personally I think I'd just take the deep kickoffs and high touchback percentage and avoid mistakes leading to long returns, but it's an idea worth trying.

Dan Schlauch said...

I think you both might be right that I've been underestimating the value of kickoff distances here. I had assumed it wouldn't have a huge impact, but it looks like it might given how consistent kickoffs are.
Next, I will look into kickoff distances and see how well they correlate with salaries. My pre-analysis guess is that the best and worst kickers will get about 5 yards above and below average which will yield about 2-3 yards of field position. Not insignificant when it occurs 5 times per game.

Regarding the charts, I sent them in with the word document. I don't know if Brian plans on having the community support images.

Laura said...

Great Analysis.
How did you decide what percentage of (average) kickers would outperform the selected kicker from the identical situations? How does this account sample (number of kicks by the selected kicker in a season)?
Im just trying to understand how you got the statistical significance part; that's a great touch in recognizing that much kicking is random and trying to find a way to measure that.

Dan Schlauch said...

Beans-

For each kick, I randomly chose a number between 0 and 1 to simulate an average kicker's success.

For example, if a kicker took 3 field goals, of probability p1=.85, p2=.45, p3=.95 we could simulate this by picking random values (<1,>0) for each kick and see how the simulated average kicker did. The random values .76, .58, .25 would mean the simulated kicker hit the 1st and 3rd, but missed the 2nd.

Extend this over a whole season (or career) and then run it tens of thousands of times and you can see how the real kicker does relative to the simulated kickers.

Ed A said...

When I first saw the Dan's article the graphs and charts were not visible. I've checked the document and they now appear. I will get them posted this morning. Apologies to Dan.

Michael L said...

Enjoyed the article Dan. I have a couple suggestions on ESA, and a few concerns about the soundness of the perf/salary analysis:

- First, you factor in wind only by presence, but surely you'd see significantly different results between headwind vs tailwind vs crosswinds? Admittedly this isn't 100% discernable as wind direction changes, and wind data is only recorded at kickoff -- but finer resolution here seems warranted.

- Secondly, I view the salary/perf graph with a grain of salt. Two small clusters appear to skew the regression line -- 3 points in Q4 (~ +$1000, -10) and 4 points in Q2 (~ -$1200, +5). I'm curious if those can be explained via special circumstances as outliers.

- I think salary data must be treated carefully due to distorting factors -- namely rookie deals, and specialists (KO or long FG). One could argue of course that there's a ready pool of new "talent" signable on the cheap, but that could simply be due to 1) inadequate personnel evaluation methods/metrics, or 2) lack of bargaining power... both of which may correct as metrics analysis improves.

- The graph for 2009 only right (or is it career)? Was there a reason you limit the data to a single season? I see only ~29 data points, which is awfully small to derive a significant relationship.

- I would consider mapping all single season kicker results using amortized cap charge (to even out contract years), but mapped against ESA/game (to remove any negative bias for games missed to injury for alleged "star" kickers).

- Also I think it's a mistake to ignore PAT performance. Kicks made won't contribute much, but misses will.

- One other bias that is hard to remove, but certainly teams are willing to pay for is: coach confidence to give opportunity to the kicker to win/lose the game. Since we can't easily count when coaches decline to let the kicker make a try, we're stuck here. However some poorer kickers will take a +0 ESA because the coach (wisely?) chooses to not give him the opportunity to miss it. Likewise a great kicker (with an average/weak leg) may be given opportunities to fail kicks most other kickers wouldn't get.

- Finally, I note your conclusion states there's no correlation between seasons for individual kickers for ESA, yet you never showed that in your analysis. Was this done separately?

- Oh yeah, one other thing: it's possible to read too much into the 1 kicker in top 1% and 7 in top 5% "looking like" a random distribution. Skill is frequently normally distributed too. If you really think they are all essentially average, then you should see the same distribution performance no matter where you set the minimum qualifier (even at 1 attempt).

Dan Schlauch said...

Michael-

A well written critique with excellent points. I'll try to address each of your points in order.

1.) Agreed. Though factoring in wind direction would be an enormous task and I don't plan on doing that. I assumed that a particular kicker would not consistently benefit over average from wind direction alone a whole season.

2.) I do see those clusters. They don't seem too unusual and I don't think we can consider them outliers any more than the couple of points in quadrants 1 and 3 that skew the slope positive.

3.) True, there's a lot that goes into salary allocation. My overall idea was that whatever metrics are currently being used do not seem to show a strong correlation between salary and my metric (which I try to show is sound).

4.) Yes, it is only for 2009. Ideally I would expand it to more years. The salary tables I used only included that year, unfortunately.

5.) Agreed. That would improve this study.

6.) I still think it's fair to ignore PATs. They are 99% effective, and I think it's reasonable to assume that the range of kicker success rate is far smaller than a single percentage point. Additionally, the kicker himself is only partially responsible for those types of misses. I think the expected PAT ESA from the best and worst kickers for a whole season is well under a single point.

7.) True, ES for certain kicks, particularly very long ones, might be biased higher because of a slight tendency for this to be taken by stronger legged kickers. Also, weaker legged kickers tend to take fewer of these kicks. This is tough to remove, as you said, and mostly we can just assume the effect to be small, which I believe it is. I will get around to comparing salary to kickoff distance, which should shed some light on this issue.

8.) This low correlation came from salaries and kicking success, with the assumption that salaries were an indication of past performance and of expected future performance. Probably should have worded it differently.

9.) Not quite. The null hypothesis is that each kicker is exactly average. Given enough kicks, in theory any deviation from the average can be found with any p-value. If kickers actually followed a normal skill distribution, that should be observable with a disproportionately high number of high and low p-values.
This part of the study used all starting kickers and all kicks from the past 10 years. The fact that it still strongly resembles the null hypothesis after all this data makes me believe that not only are skill differences extremely small, but also that systemic biases in my study are not large either.

Michael L said...

Well, I'm not suggesting one call an outlier any inconvenient data =). I was mostly wondering if they were the special cases I had mentioned. Rookie kickers rarely can sign for more than the minimum, and as they have no body of NFL work there is no reason to assume they will all be worse on average than veterans. In other words, if kicking is a skill (above and beyond making it onto an NFL team and surviving training camp) then those salaries should be mostly uncorrelated to performance and some _should_ have good seasons.

Also some highly paid kickers may well be specialists that aren't expected to be as accurate overall, but can handle kickoffs or long FGs. Not really sure about that one.

However, another very distorting factor is this: veterans make minimums that scale linearly by year and cover a broad range of your graph scale. Thus even average or below average veteran kickers _can't_ be paid at the bottom of the scale. The fact that teams don't discard these players when their salaries surpass their performance on your scale indicates either they don't know what they're doing, the players provide more value than your analysis shows, or teams are more interested in limiting the floor of performance with a known quantity versus taking on an unknown one.

I don't doubt most personnel eval could be a lot better, but I suspect one or both of the latter 2 are more likely.

Also, in a league where skill matters (or at least breakdown avoidance ability does) and talent is not readily replaceable, you would expect to see a lot of tenured, old veterans who are employed until their skills have demonstrably deteriorated to replacement level (a process that can take teams years to determine... understandably so given the variance and sample rate). This means that looking at any single year any number of these players could be starting or in a downward slide, and would confound any regression analysis.

In other words, in a fluid market without tenure-scaled minimums or undifferentiated rookies, perhaps you could see pay commensurate with performance. But these, plus external performance factors (kickoffs, gadget skills like onsides, directional/hang ability, and I suspect avoiding major mistakes), will confound any underlying relationship...

Sorry for the long-winded response.

Michael L said...

When you say the "p-values are uniform" are you saying the distribution is indistinguishable from normal? Or did you check only the top 1% and top 5%? A test for normal conformity should be more than checking two (coarsely-grained) cdf values.

Just as one example, a bi-modal population evenly split could very easily have similar tail areas outside of 2 and 3 std devs... especially if you consider 75% (7/4) error close.

What would a league with 22 average kickers, 4 kickers each with +3%/-3% chance at all FGs, and 1 kicker each with +6%/-6% look like? Would your p-values look not far from that distribution too?

Still, if you're right that no kicker has done better than average more than +1.4 ESA over multiple seasons... then perhaps kickers really are all essentially equivalent in accuracy (once their career survives 5 kicks). Personally, that seems to defy logic and evidence, but perhaps it's true.

Michael L said...

Ah, now I understand how you administered the test. You tested each individual against the random distribution formed by the set of kicks he performed in his career (the part that fell in 2000-09), and counted how many of them achieved the 95th and 99th percentile.

Ok, then I have a few new concerns. First off 5 kicks is way too small of a sample to measure p < .01 or even .05. Every kick will likely be ES > .6, which means even 5 successes in the worst conditions won't distinguish a p < .05 performance (as .6^5 = .078). The cutoff needs to be much higher -- at least 20 (for .01).

Even once you get enough kicks to produce the minimum granularity, the distribution will initially suffer badly from quantization effects by minor differences in each of the FG ES probabilities. Only raising the number of required kicks even further will smooth the distribution at the boundary of interest. For example, 21 kicks at ES=.8 means you can miss any 1 FG to rate at p (=.0092) < .01... but if 10 of the kicks have slightly different conditions making them ES=.81 then now you can only miss one of the .8 kicks to meet the .01 cutoff. This "pick n' choose" effect will plague small n binomial-type distributions with similar but varying ES values.


Finally, the more I look at it I think this method is simply flawed -- at least at this order of magnitude of kick attempts. Your distributions (and thus your p-values) are a function of random outcome only and not skill, yet you are using it as a discriminator for skill -- but skill will not always (nor even frequently) overpower randomness.

Let's illustrate this in a setting where no one disputes skill is involved: hitting in MLB. Let's say a league average hitter against a league average pitcher hits at .250, and that the std dev in skill amongst hitters is .025 (so 95% of the league hits .200-.300). Now if a player faced 80 career ABs, all against average pitchers then using your p-test at .01 an average hitter would randomly get 30 hits only 1% of the time. Yet a gifted hitter at .300 would only achieve 30-of-80 hitting 9.1% of the time. Your discrimator would fail to identify 10 out of 11 gifted hitters as being skilled. Even an extraordinarily gifted hitter at 3 std devs (only 1 in 700 batters) batting .325 would achieve 30-of-80 only 20% of the time. Your discrimator would find 4 out of 5 of these men "not far" from average.

The problem is that the this test is just too coarse, the population is too small, the per kicker sample set is too small, and you haven't attempted to hypothesize what the distribution of skill would look like if it existed and constructed the test to discriminate appropriately. The null hypothesis isn't validated when a test rejects 90%+ of the skilled scenarios as random.

Dan Schlauch said...
This comment has been removed by the author.
Dan Schlauch said...

The cutoff of 5 career kicks was meant to simply remove the various punters, WRs or 3rd string QBs that had taken a kick or two. The vast majority of the kickers used in the study have made hundreds of field goal attempts. It would probably make sense to increase the cutoff, but I didn't want to introduce too much survivor bias.

I understand the problem with high false negatives. However, rather than look at a single kicker we are looking at an entire dataset.

To use your baseball example, if 1000 hitters were .250 hitters, we would see an average of 2.3 hitters go 30 for 80 or better. If 50 out of 1000 players (5%) were 2 std devs above average, we would expect to see about 5.2 players at 30 for 80. This also assumes that the 95% are exactly average. Since your example describes a normal distribution of skill, we would expect substantially more than 5.2 "significant players".

Differentiating false positives from true positives is another story, but you can still learn something from the dataset as a whole.

Also, the average kicker in my dataset took 120 kicks in the last 10 years which will increase the power substantially over the 80 used in the baseball example.

I agree that this technique is pretty weak as I applied it to individual kickers, but it was intended to support the idea that the league as a whole is indistinguishable from the average.

The main point is that players that get paid more don't produce significantly more as measured by my ESA. Why? I'm not sure. Kickoff distances seem like a good explanation, but I haven't had a chance to investigate.

Post a Comment

Note: Only a member of this blog may post a comment.