Looking back and analyzing a season is something players, managers, and fans do alike. What worked well? What went wrong, and why?

Player-level, season-end data provide a salient perspective for assessing players across the league. These data afford us the luxury of breaking down player performance without the game-to-game noise that must otherwise be accounted for when trying to understand and quantify a player’s output.

Furthermore, analyzing season-end data allows us to assess similar players (how should a team replace leaving players or build squad depth strategically?), to identify/evaluate outliers (are there any players who stand out from the crowd?), and to consider player scarcity (are there positions or player profiles that are rare? Or does a team need to jump on a making an offer for some new players?). Finally, beyond these observations, we can investigate how players changed from season to season.

Currently, the predominate approach for analyzing player performance with season-end data is to cluster together players based on their statistical output— a technique that helps us group like-players, identify outliers, and address the aforementioned considerations. Moreover, we can even look at player clusters league-to-league and draw some conclusions about the style of players or play. Or at the team level, we can draw conclusions about that individual teams tactics by identifying which clusters its player fall into.

When analytics teams talk about analyzing like-players and performing statistical clustering to analyze season-end performance, they are often referring to widely used techniques like Principal Components Analysis and K-means clustering. In this article, I will introduce these concepts, address some pitfalls of current these approaches for analyzing season-end data, then introduce the more analytically robust techniques of **Diffusion Mapping** and **Spectral Clustering**, which improve on current dimensionality reduction and clustering—leading to better footy analysis! I will introduce these topics in a basic manner, but I also talk some stat-nerdiness for readers who love a good statistical model.

### Challenges with Football Data

Modern technology has enabled much of what happens on the pitch to be recorded and quantified. Looking at an entire season means we have a data set of over 400 players and 78 variables (from the Bundesliga’s 2013-2014 data).

400 players is, by modern standards, a very small data set, but 78 variables is not easy to grasp—for example: how the variables relate to one another, how the players compare in these variables, which variables are important, which variables players have drastically different combinations of, who stands out, who has a ‘like-for-like’ replacement value for their club, or for a different club. Thus, two primary challenges arise if your goal is analyzing player performance with season-end data:

*Challenge #1*: Dimensionality.

*Challenge #2*: Clustering.

In the next two sections, I’ll address each of these problems.

### Challenge #1: Dimensionality – how has it has been addressed in Football analytics?

This challenge can be summarized in form of the following question: What can we do to understand the 78 variables in a systematic and objective way? How do we “narrow down” our variables to a manageable set for analyzing players—who, after all, are our main topic of interest.

The answer to this question is dimension reduction, which helps us understand the relations between so many variables. On the surface, the solution to this challenge seems easy: if you’re a football expert, and know which variables you are most interested, you simply select these variables and go on your merry way. However, this approach leaves out so much data that could be useful in analyzing players. In other words, if you hand pick your variables, you may be losing out on other useful variables, which could help you distinguish or cluster players.

There are many mathematical techniques that can help you do this. In general, these techniques use the variables’ correlation to try and discover (or compute) a reduced set of variables that are combinations of correlated variables. This technique is called a **Principal Components Analysis**. Simply put, it reduces the number of variables in a data set by computing a *new* data set, where the variables are linear combinations of the previous variables—think of it as a mathematically-derived index, like a stock market index.

[**Warning #1**: What follows is an amalgamation of stat/math nerdiness that may not have been seen in the *Bundesliga Fanatic* before, but if you like numbers, read on! If you want to skip and follow the main concepts, feel free.]

Dimension reduction is most commonly associated with a technique called Principal Component Analysis (PCA) (but there exist many linear methods). PCA iteratively identifies the largest eigenvalue in the variance-covariance matrix of your variables (mean-centered and standardized, because PCA is scale-sensitive), removes the identified component (eigenvector) from the matrix, and re-eigen-decomposes the new matrix.

We subjectively select the number of components, which must be less than or equal to the original number of variables, by in general only using components whose eigenvalues are greater than one, and whose components are interpretable (many analysts also plot eigenvalues on a scree plot to see if here is an elbow in the plot, limiting component selection to those at or before the elbow).

Here it is important to see that the PCA orthogonalizes the correlated variables into a smaller set of uncorrelated components. Then we conduct *a posteriori* “interpret” (yes, I know, I’m conflating factor analysis and PCA, but for the sake of brevity let’s continue on, especially since 90% of PCA is interpreted this way—formal nomological structures are rarely explicitly well-defined) the components by analyzing the loadings of the variables, essentially mapping the individual variables to the component. We also multiply the individual observation, which, in our context, is a transposed vector of a player’s statistical output, as well as multiply the matrix of column-components, giving scores for each player and component.

Ok, great, right! Well, kinda…sorta. PCA is a linear dimension reduction technique that fails to capture more complex data—often, non-linear components (curves, surfaces) would do better and hey! Look! That’s the type of analysis we do further below!

### Challenge #2: Clustering Players – good first steps in Football analytics

PCA is a very common way to reduce dimensions (again, think “create index”), but what about clustering players? How can we compare like-players? Which metrics should we use? How do you define ‘similarity’ in players? These are questions that clustering algorithms help answer.

The method by which you choose to cluster our players has a direct impact on the clusters that are selected. Whichever way you go about it, the goal of clustering players is using mathematical techniques to group players into clusters so that the similarity between players *within* a cluster is maximized while the similarity *between* clusters is minimized.

Once you do this, you can interpret the results for which players are in which clusters, what the clusters are representing, and ask essentially, what types of players exist in the Bundesliga? How many of them are there? Do a certain team’s players fall into specific clusters?

[**Warning #2**: stat/nerdiness follows next in 3, 2, 1 …]

Frequently-used methods like linkage clustering (hierarchical agglomerative or divisive), normal K-means clustering (centroid-based), and distribution-based clustering (a mixture of Gaussians) have glaring holes in their approaches: either sensitivity to noise (like linkage clustering or even density based clustering), local optima (like K-means), or their parametric and distribution-assuming.

For example, let’s took a closer look at K-means clustering, since it is probably the most widely used algorithm in this regard. K-means’ tries to create a K-number of clusters, whose cluster points (in our example, players) are closer to their cluster’s center than they are to other clusters’ centers in a D-dimensional space (*D* here being the count of dimensions/variables we are looking at – this may be setting off eureka bells about how useful PCA is right now).

In layman’s terms, we’re trying to define groups of players who are similar to each other and dissimilar to other groups of players, which is probably the natural way to think about clustering. Unfortunately, K-means has a number drawbacks like not being able to handle non-centroid data cluster structures and local optima solutions.

### A current example of clustering players

If this discussion seems a bit abstract, take a look at Will Gurpinar-Morgan’s post. Will gives a *great* introduction to applying clustering in football, while staying far away from the technical side of the analysis (I highly recommend listening to the short YouTube video linked to in his post).

He simply shows the application of K-furthest clustering (a linkage clustering algorithm) by separating players according to their positions. He also uses the K-furthest clustering algorithm to cluster players across many top European leagues, and plots these position-level clusters on their first two principal components.

### Taking Football analytics further

Clustering players is one way to answer many of the questions above—if we can cluster players in such a way that we capture the “true but unobservable” clusters of the players, we can fulfill the goals of analytical goals achievable with season-end statistics. The challenge is capturing the “true structure” of the data—in other words, creating clusters that are real representations of the players.

Many currently implemented techniques cluster players based on techniques with weaknesses that, if unaddressed, can potentially lead to, well, questionable clustering; for example, grouping players together who are not *in reality* similar players, because the analytical technique used simply cannot properly account for the data’s underlying structure (see above on drawbacks of PCA and K-means).

Instead of PCA and K-means, in my analysis further below, I use a non-linear dimension reduction technique (note: *non-linear* as opposed to the linear PCA) called Diffusion Mapping and a clustering technique called Spectral Clustering. These techniques address noisy data and assume no parametric form. In other words, they are better suited to clustering players.

### Diffusion Mapping

First, I’ll start with a very, very brief overview of diffusion, then dive into details – though still in a very casual way – in another stat-nerdiness section.

Basically, diffusion mapping calculates a “distance” measure of observations (i.e players) based not on their Euclidean distance from each other, but based on their connectivity. So instead of clusters being defined by their centroid-shape, they are defined by how connected they are to other observations, no matter the shape – contrast this approach with the above overview of K-means.

The graphic below depicts a simulated example of how K-means and diffusion mapping *then* K-means would assign clusters. The far left image shows a regular 2D plot of the data; here, you can easily see that there are two clusters: one dense cluster inside a ring-cluster.

(http://eprints.maths.ox.ac.uk/740/1/bah.pdf)

The center image shows the assigned clusters based on K-means, where K-means “mis-classifies” the clustering, because it is very simple distance-based algorithm. When assigning two clusters, it picks the “best” cluster centers and assigns each data point to the closest cluster center.

However, as you can see in the right image of diffusion-mapping, the K-means cluster is accurate. The diffusion map feeds the “connectivity” into the K-means instead of the distance. So even if the observation at the highest point in the graph is very far away from the lowest point in the graph, they are strongly “connected” by the ring shape of points.

[**Warning #3**: stat-nerd talk in 3, 2, 1 …]

For the very brief (and very simplistic) technical overview of diffusion mapping, first define each observation as a node and the probability of transitioning (random-walk) from one node to another as the connectivity between the nodes. This is proportional to a kernel, as defined by a measure of similarity between local nodes in the neighborhood. Next, define your diffusion matrix as the symmetric, row-normalized matrix of probabilities in transitioning from Node i to Node j in a P(i by j) matrix (a first order Markov matrix), where “i” are your row entries, “j” your columns, and the intersection are the probability of transitioning from i to j in one step. If you take P^{t} and increase t, (running your first-order Markov chain forward), you reveal the geometry of the data w/r/t the connectivity of the nodes. For a certain level of t, you now have a P matrix that defines your diffusion distance. From here, you can embed in a lower dimension using dominant eigenvalues and eigenvectors (the mapping part of diffusion mapping). In sum, you are treating your diffusion distance as your simple Euclidean distance, in the end.

### Spectral Clustering

Like I did with diffusion mapping, I’ll attempt a very, very brief (i.e. “non-technical”) overview of spectral clustering, then dig in slightly deeper with a short technical overview.

**Spectral clustering**, in simplistic terms, uses a matrix called a similarity matrix to define similar data points (players whose output is similar), then uses dimension reduction techniques before the data are clustered. In other words, spectral clustering tries to “summarize” the data before clustering them, instead of clustering the data as-is (like K-means).

[**Warning #4**: nerd-stat talk begins in 3, 2, 1 …]

The overarching idea is that data is embedded on a low-dimensional manifold in a high-dimensional space, and if we can discover the low-dimensional manifold, then we can better cluster the intra-manifold points.

Spectral clustering defines a similarity graph and adjacency matrix (sparse symmetric matrix of nodal connectivity), informing the transition matrix – all of which inform a graph Laplacian matrix (degree matrix minus your adjacency matrix), and in our case is normalized for a normalized Laplacian matrix.

This normalized Laplacian matrix combines graphing understanding of the data into matrix representations. We then eigen-decompose the matrix, embed in major eigenvectors and eigenvalues, then do a typical K-means.

The **diffusion **mapping element comes into play when we take our transition matrix to some *t* power for dimension reduction—in other words, using diffusion we can “clean” our transition matrix and this “shrinks” our eigenvalues from the eigen-decomposition.

### Enough of this stat-nerd rubbish, where’s the bloody football?!

First, we need a quick summary: our analysis differs from the current football clustering (that I am aware of!) by combining Diffusion and Spectral Clustering (**D/SC** hereafter) instead of PCA and K-means.

Next, I need to explain my methodology. Our data and analysis differs from Will Gurpinar-Morgan’s, for example, because I only have Bundesliga data from the 2013-2014 season, I don’t have “dribbling” data, and since we use the D/SC technique we can use *all *the variables available to us. Thus, I’m not subsetting our variables to just a handful; instead, I utilize *all* variables that we can find!

However, like Will, I’m normalizing any “count” variables (like passes, shots, etc.) per minute played, but am keeping the rates (like the accuracy of through balls, % of aerial battles won, etc.) as they are. I’m also filtering for players who have played at least *900 minutes*, removing outliers whose stats might not reflect their true production possibility (though I’m open to abating this filter). For comparison’s sake, I’m also running a standard K-means cluster, using all the variables, because without them, we would have no means to evaluate the D/SC approach.

### The Results? Clusters Created!

Here are the results. 19 total clusters were selected, which I grouped under three main categories (**Forwards**, **Midfielders**, and **Defenders**). I analyze the clusters below, then plot them on some of the principal components extracted. Though PCA is not the method by which we reduced dimension and clustering, it is still a good way to quickly look at the clusters extracted by different methods.

I plot each group of clusters on two relevant Principal Components. The clusters, in some plots *look *like they are not define well (overlapping), but remember this is one representation of a ton of the clusters embedded in this PCA space — this is not what they “really look like.” This lack of perfect representation in the component space actually points to the necessity for using the D/SC approach.

In the 2-D graphs, each observation and cluster is plotted on two relevant components from the PCA. The further you go up on the graph, the higher the player is on that component (labeled on the axis). The further right a player is on the graph, the higher they are rated on that component. For instance, in the first graph below, players who are higher on the chart have a higher “score” in the “Goal-bound” component (index), and players further to the right are players with higher “scores” in the “Creating” component (index).

#### 3 Forwards Clusters

*Center Forwards*(with aerial ability and attempted headers); e.g. Lewandowski, Kiessling, Ramos, Mandzukic.*Center Forwards*(who use their head less and are slightly less efficient finishers); e.g. Petersen, Aubameyang, Meier, Olic.*Flexible Forwards*(dynamic creators who assist, score, but not often with their heads); a small cluster with just Reus, Ribéry, and Robben.

#### 9 Midfield Clusters (plotted on two separate charts)

*Very Attacking Midfielders*(Assist providers, shots outside the box, similar to Flexible Forwards but with fewer goals/attempts on goal); e.g. De Bruyne, Farfan, van der Vaart, Mueller, Draxler, Firmino, Goetze, etc.*Attack-Minded Midfielders*(who often lead fast breaks, like the above cluster but less assists/goals); e.g. Perisic, Caligiuri, Boateng, etc.*Pass Masters*(accurate passes, long balls, assists, few goals, lots of final third entries and some attempts from outside the box—this includes some outside backs); e.g. Schweinsteiger, Geis, Sahin, Lahm, Kroos, Thiago, etc.*Center Midfielders*and*Wide-Players*(who can get into the box and shoot); e.g. Elia, Prib, Stindl, etc.*Box-to-Box Midfielders*(most balanced cluster of all – offense/defense); e.g. Can, Lanig, Junuzovic, etc.*Possession-Winners*(who release team forward with through-balls); e.g. Moravek, Baier, Arslan, etc.*Possession-Winners*(who clear more, deep-lying); e.g. Kehl, Dias, Badelj, etc.*Deepest-Lying Midfielders*(more clearance-driven defensive midfielders); e.g. Lars Bender, Kramer, F. Kroos, etc.*Deep-Lying Midfielders*(who pass/through ball at a great rate – just two players in this); Xhaka and Reinartz.

**Graph 1 (for midfield clusters 1-4)**:

**Graph 2 (for midfield clusters 5-9)**:

#### 4 Defensive Clusters

*Balanced Full-backs*(who can create and get their share of shooting opportunities, given they are full backs); e.g. Piszczek, F. Johnson, Kessel, etc.*Crossing Full-backs*(self-explanatory); e.g. Oczipka, Grosskreutz, Chandler, etc.*Defensive Full-backs*(full backs who don’t venture forward often, or at least don’t touch the ball when they do!); e.g. Wendt, Durm, S. Garcia, etc.*Center Backs*(clearing and tackling is what they do!); e.g. Stranzl, van Buyten, Spahic, etc.

*Note*: Goalies and a few other outlier clusters that indicate some position flexibility aren’t worth plotting.

### Comparing the analytical methods

In comparing the K-means clustering (not shown above) and the D/SC clustering, K-means removes Kießling and Huntelaar from the Forwards cluster and places them in a group with Müller and the likes of Drmic and Meier.

On the other hand, D/SC puts Kiessling and Hutelaar into an “aerial dominant” center forward cluster that also includes Ramos, Mandzukic, and Lewandowski, then places Müller into a cluster with the likes of Draxler, Götze, De Bruyne, etc. which seems like a better fit, in which passing rate and assists separate this group from the Forwards, even though Müller uses his head to great effect.

Next, it looks reasonable to cluster some Bayern midfielders/outside backs (Schweinsteiger, Thiago, Rafael, Lahm, and Kroos) together, since Pep Guardiola’s system necessarily predisposes players toward certain statistics, like completed passes, which is fine and is actually a point of validation for the clustering, which identifies a team’s tactics here.

However, K-means added no one else to this group, while the D/SC added Johannes Geis and Sahin to this cluster, who both they play similar roles. This inclusion makes sense to me. On the other hand, K-means dropped Nuri Sahin into a more box-to-box role, which he does fit in general, but he looked like an outlier in this group with his above average through balls and final third entries. Moreover, Geis was clustered in a more advanced role with Rafael Van der Vaart and Jefferson Farfan, but accurate long-ball numbers made him stand out.

For detecting outliers, the D/SC approach allows you to see what separates them. For example, in clustering the group of holding/possession/passing based players together, Thiago wins an exceedingly large amount of possession in the attacking third and wins a ton of contested balls. This type of within-cluster “outlier” detection is great at identifying players who play like a group of other players, but have specific idiosyncrasies to offer – another benefit offered by season-end analysis.

### Putting it all together

Football, like the rest of the world, is pumping out data left and right. Honing in on this data at the end of a season, on a player-level, allows us to take a deep look back and analyze players, teams, tactics—everything!

One way to approach evaluating players (that lends itself to analyzing teams and tactics) is through dimension reduction and clustering. In this article, we saw how more recent techniques (**Diffusion Mapping** and **Spectral Clustering**) can be used to better analyze players at the end of a season.

Next, I’ll soon follow-up with deep-dives into each group of clusters in future articles–including 3-dimensional representations of the clusters!

*Note*: If anyone has some good data (*like the 2014-15 Bundesliga data! – Editor*) that is begging to be analyzed, please send it our way! I’d even be happy to trade the R code that ran the analyses above.

#### Josh MacCarty

#### Latest posts by Josh MacCarty (see all)

- What Can Season-End Bundesliga Data Teach Us? - July 10, 2015