Beating soccer odds using Machine Learning — Project Walkthrough

Published in

Analytics Vidhya

10 min readJun 18, 2021

It isn't news to anyone that predicting soccer matches is a tough task. After all, how can you predict something so unpredictable as soccer? Based on this challenge, I decided to use my Data Science knowledge to create a project in order to not only have satisfying prediction results, but also beat the bookmakers obtaining profit in my predictions.

Web Scraping

First, I had to obtain as much data as I could find from soccer matches. Fortunately, the odds portal had available information about matches of almost 16 years worth of Premier League seasons.

I got the data from the website using two Python libraries for Web Scraping, BeautifulSoup and Selenium. Basically, I did the whole scraping structure in BeautifulSoup and used Selenium to click on the next page, since this library allows you to perform actions on web pages just like a human would.

After the scraping, I had obtained data from 6077 Premier League matches from season 2005/2006 to 2020/2021 with 7 columns of information:

season: Match’s season.
date: Date of the match.
match_name: “Home team vs Away team” string.
result: Match result.
h_odd: Home team's winning odd.
d_odd: Draw odd.
a_odd: Away team's winning odd.

Data Cleaning

I already had a pretty clean dataset, so I only had to prepare the data for performing a much more robust feature engineering in the next step. In this step I:

Transformed date into datetime.
Changed odd columns to float.
Got beginning year of a season.
Split match name into home and away teams.
Split scores into home and away score.
Created feature with the match result.

After these transformations, this is what the dataset looked like:

Feature Engineering

This is one of the most impactful parts of the project. One can create the best and most complex Machine Learning algorithms possible, stack and tune models as much as one wants, but at the end of the day a simple model with features that explain the relationship between the dependent and independent variables well enough is going to be much more efficient.

Since it’s only possible to obtain match statistics after a match, the features that would be created had to be based on information that could be obtained before a match.

First, I created regular statistics such as wins, losses, goals scored and suffered and team position in the table up to the respective match date in the respective season. I also created features such as average goals and points in the last 3 games and the team's current win, loss and draw streak. The feature I was the most curious to try out was the Weighted Exponential Average, which, according to this article, might be a good fit for soccer predictions.

Here's the dataset columns info after the feature engineering process:

Note that some columns contain null values. I’ll get to them later in the model building phase.

All the features descriptions are inside a file called “features.txt” in the Github repo of the project.

EDA

The goal here was to perform a brief data exploration in order to better understand the relationships between the dependent and independent variables.

The dataset contained 3 possible classes in the target variable, home team winning, away team winning and draw. The classes had the following proportions:

It’s possible to see that the dominant class is the home team winning, which makes sense since it’s common knowledge that in soccer the home team usually has more advantage.

One can assume from this that it would be more difficult to predict a draw than to predict a win from one of the two teams, since 75.45% of the dataset contains matches that one of the two teams had the victory.

In the picture above, it's possible to see the highest 5 features in correlation with each class and each class in a separate dataframe. We can interpret these positive correlations by assuming that if a variable goes up, so does the chance of the class positively correlated to it happen.

We can see that in the winner_h and winner_a dataframes (home win and away win), what seems to influence the class happening is how bad the other team is doing. For example, the other team’s odds and rank are the ones with the highest correlation.

Meanwhile, in the winner_d dataframe (draw), which had a very low correlation compared to the other two dataframes, it’s possible to see that the features that are the more correlated to a draw outcome are features that say how bad the home team is doing.

In the modeling phase I used a feature selection tool, which I’ll talk about later, and performed a brief analysis of the distributions and scatter plots of 13 features selected by the tool.

In the histograms shown above we can see that the features aren’t normally distributed and are also in different scales. That’s why I applied them the MinMaxScaler transformation in the modeling phase.

It was also possible to see the relationship between the features through a pairplot:

Some linearities we could observe from this plot:

The w.e.a. points have a positive correlation with the w.e.a. goals metric and a negative correlation with the w.e.a. goals suffered metric, which makes perfect sense. If a teams makes more points, it’s going to score more goals and suffer less goals.
The home team's odd goes up as the w.e.a. points and w.e.a. goals of the away team also goes up. Which means that the more the away team scores, the less the chance of the home team winning.
The draw odd is also sort of correlated to both teams' odd, probably because every game is going to have a team with a higher odd, and the draw odd tends to be near the higher odd.
We can also see some obvious linearities, such as the home team odd going down as the away team rank goes up. Which means that the lowest the away team’s position in the table, the higher the chance of the home team winning.

Model Building

First, I removed all features that wouldn’t be available before a match and were only used in order to create other features.

For the columns with null values I filled them with a “-33” value to help our models interpret them. This way, I allow the model to understand that there’s a reason for these values to be null. For example, in the “_rank” column, if a team doesn’t have a last rank it’s either because the match belongs to the first season present in the dataset or because this specific team wasn’t in the Premier League in the previous season.

Models used:

Logistic Regression — Linear model expected to work due to the linear relationships between features.
Random Forest Classifier — Tree based ensemble model expected to work well due to the sparsity of the data.
Gradient Boost Classifier — Just like Random Forest, a tree based ensemble model expected to work well due to the sparsity of the data.
K-nearest neighbors Classifier — Non supervised learning algorithm where an object is classified by a plurality vote of its K nearest neighbors.

Before training the models, I split the data into training and test set with a test size of 20%, created dummy columns for the “ls_winner” column and scaled all features using the sklearn MinMaxScaler.

To evaluate these models I chose the accuracy metric, since the model’s job is to predict the exact match result, no matter what class it belongs to.

Accuracy is the total number of correct predictions divided by the total predictions. It’s the proportion of correct predictions in our model.

Here are the models trained and their training results:

Even though the accuracy levels weren’t too high, it’s important to notice that the best model performed better than just a random prediction or a “home team always wins” prediction, meaning our model was able to explain some of the data.

Given that Logistic Regression had the best training results, it was the model focused along the modeling phase.

Since there were more than 40 features in total, I performed a feature selection using the Recursive Feature Elimination (RFE) method.

Feature selection is important to:

Increase the model explainability.
Reduce training time.
Avoid the curse of dimensionality.
Reduce overfitting.

The method used selects features by recursively considering smaller and smaller sets of features. In this graph, it's possible to see the relationship between amount of features used versus the accuracy level of Logistic Regression:

Not only was it possible to maintain the level of accuracy to a satisfactory level, but it was possible to even increase the level of accuracy at a certain number of features. I selected 13 as the ideal number of features so that we could have more explainability and the accuracy level difference was minimum compared to using all features.

These were the features selected:

h_odd
d_odd
a_odd
ht_rank
at_rank
ht_l_points
at_l_points
at_l_wavg_points
at_l_wavg_goals
at_l_wavg_goals_sf
at_win_streak
ls_winner_-33
ls_winner_HOME_TEAM

We can observe that all the odds features were selected. This highlights their importance. It’s understandable since bookmakers do a great job at creating odds and predicting matches. We can also notice that the features with information about ranking are present, alongside the average points in the last 3 matches.

The Weighted Exponential Average features also worked their way into the top 13 features but only on the away team side.

Note that the “ls_winner_-33” means that there weren’t last matches between the 2 teams, which apparently is also important to state the match outcome.

Simulating Investment

Here, I used the test data as a simulation of an investment in betting markets using the model’s predictions and the matches’ odds.

The simulation contained test data of 1216 matches. I settled a fictional investment of $100 per match, making a total investment of $121,600.

Here’s how the model is performed:

What is noticeable is that the KNN algorithm also had profit, even though it had the worst accuracy among the 4 algorithms in the training phase.

To understand why the KNN model had profit and the other 2 models didn’t, we need to clarify what’s precision, recall and F1-score:

Precision: Of all the predicted positives, how many were actually positive?
Recall: Of all positives, how many did the model said it was positive?
F1-score: The harmonic mean of precision and recall.

Note that these metrics are calculated for each class, so the metric considers the referred class 1 (positive class) and the other two classes 0 (negative class).

Here’s the classification report of each model:

The class 0 is a draw, the class 1 is away team wins and class 2 home team wins.

It’s possible to see that even though the KNN model had the lowest accuracy, out of all models it had the best F1-score in the draw class by far. Besides that, it had almost 50% accuracy and a F1-score in the winning classes not very distant from the other models. That’s why the model still had profit in its predictions.

Logistic Regression didn’t predict correctly any occurrence of the draw class, probably because there wasn’t much linear relationships that pointed out to a draw outcome. But it still had profit because it did a good job predicting the winning classes.

Productionization

Here, I built a Flask API endpoint, which I deployed to a Heroku application.

The application is available and hosted at https://pl-matches-predictor.herokuapp.com/predict. The input should be as follows:

h_odd — Float number that represents the home team's winning odd.
a_odd — Float number that represents the away team's winning odd.
d_odd — Float number that represents the draw odd.
ht_rank — Integer that represents the home team’s current rank.
at_rank — Integer that represents the away team’s current rank.
ht_l_points — Average home team points in the last 3 games.
at_l_points — Average away team points in the last 3 games.
at_l_wavg_points — Weighted Exponential Average away team's points in the last 3 games.
at_l_wavg_goals — Weighted Exponential Average away team's goals in the last 3 games.
at_l_wavg_goals_sf — Weighted Exponential Average away team's goals suffered in the last 3 games.
at_win_streak — Integer that represents the away team’s current win streak.
ls_winner — Last match winner between both teams (“HOME_TEAM, “AWAY_TEAM”, “DRAW”). If there wasn’t a last match don’t fill this input.

The output of the request is a Data Frame containing a “prediction” column with the match predicted outcome.

Here’s a Python example of the request:

Improvements

In the future, if I would try and implement improvements in the profit and accuracy level, I probably wouldn’t use Logistic Regression. Even though other models in this particular project didn’t have the profit L.R. did, they have shown more potential to predict all 3 classes outcomes.

Also, creating new features and adding external data such as how many fans are present and in game stats like shots on target and fouls committed may help improve the accuracy and profit level.

Conclusion

Even though predicting soccer matches is not an easy task, I found the results of this project very satisfying considering it was possible to not only have an accuracy score better than just a random prediction or a “home team always wins” prediction. But it was also possible to beat the odds, having a profit percentage on an investment simulation of approximately 4.81% using the model’s prediction.

If you wanna see more details about the project or contribute to it, check its Github repo!