The Audibles are a series of short-form research articles about thorny issues in college football analytics
Assigning OL Gaps to Run Plays
Box score data only lists the game totals for ball carriers and their rushing plays: attempts, yards, first downs, touchdowns, and fumbles.
Play-by-play (PBP) data offers more granularity but is still restricted to the basic statistics above. You might know that RB J. Doe carried the football for five yards on 3rd and 4 and produced a first down, but what does that mean for the offensive line? Nothing is known about how they individually performed on the play. Ideally, on each play, we would know which players were materially blocking for the run (OL and other positions, like FB and TE) as well as the rush gap that the RB attacked on the attempt.
For example, was it an off-center iso play behind a lead fullback? Was it an off-tackle (or off-tight end) toss sweep? While those statistics are never available in box scores , the studious observers at PFF have been collecting that information on a play-by-play basis for nearly ten years.
At PFF, the PBP data is not made available to the public, but marginal totals are available to subscribers. For instance, total rush attempts, first downs, and touchdowns by gap (from left end to right end, including breakouts such as jet sweeps, QB scrambles, kneels, and trips) can be leveraged against publicly available PBP data scraped or otherwise obtained online. (PFF includes some other stats, but mcillecesports.com research identified attempts, first downs, and touchdowns as the most useful optimization statistics.)
This offers the analytics specialist two useful datasets: a vector {X} of rush-gap totals and a matrix {Y} of team PBP rush data.
This becomes a fairly simple mathematical optimization problem to assign a particular gap to each of the rush plays in {Y}. The analyst only needs to define an objective function to minimize, and in this case, the Euclidean distance function incorporating the squared differences of the gap totals for yards, first downs, and touchdowns is a natural choice. (As a reminder, we know the marginal totals in {X}, and we can sum the PBP totals over {Y}, given the gap assignments. That is, you want to minimize the squared difference between gap G YDS in {X} versus the total assigned gap G YDS in {Y}, summed across all plays, and then continue over all the gaps and statistical categories in the distance function.)
When structuring the optimization, a probabilistic solution drastically reduces the computational burden; i.e., instead of using a binary (0,1) classification to each rush gap on each rush play, the system will converge exponentially more rapidly if a nonbinary solution is allowed, given the constraint that the sum of the gap assignments across all G gaps (from left end to right end) is equal to one. In this sense, the results become a probabilistic assignment.
Et voila! once these probabilistic assignments are made to each gap on each run play, inferences can be made about the players manning those gaps, which PFF also provides. In other words, you can start to evaluate the offensive linemen (and other lead blockers) themselves based on advanced statistics, like expected points or win probability effects, for the runs that hit their gaps. For instance, if runs off the left-center and right-center gaps are stronger than other attempts, that indicates that the center is outperforming his fellow linemen as a run blocker.
How to distribute blocking weight across all blocking players, given a particular gap for a rush play, is another tricky problem and will be the subject of another Audible.