My Automated Machine Learning Tipster Bot, on a Raspberry Pi

I programmed my £30 Raspberry Pi to automate picking football tips using machine learning techniques, which you can follow on twitter @PiBot_predicts

I say, anyone for tennis?

A while back I blogged about the potential of using predictive modelling for betting on Tennis.

I commented that bookmakers would likely tune their odds to maximise their profits based on what bets they had taken and what they would be likely to take. Therefore the bookmaker’s implied probability would sometimes be significantly different to “actual” probability and machine learning could be used to identify the most under-valued picks.  I hypothesised that as tennis is largely enjoyed for the sport itself and as such there would be a lot of data published on each match which could be exploited for profit.

flaw in plan

The ATP makes available a good selection of stats, first serve in % etc… but for the WTA all we easily get it the sets box-score. Which is a shame as if forces you to either halve the number of observations or drastically reduces the number of variables you can use. Sadly I conclude there’s not enough data on enough matches.

That thing they show at the end of the news

With tennis ruled out, on to football. Tennis and horse racing data is locked away, either by broadcasters or pay-wall APIs. Football by contrast is just always in your face. I have no interest in football but even I have a rough awareness of what’s going on. With four divisions widely reported on in the UK there is a lot of consistent data available, and a lot of potential match-ups to predict against.

Raspberry Ry

I have been wanting to do something significant with my Raspberry Pi for a while now, and automating picking betting tips seemed like a fun project.

I wrote a number of scripts to scrape a few sites for past results and upcoming fixture odds and the scripts I used to generate the initial training data were re-used on this new data. My predictions are then made for the following day’s games and top picks are selected based on those theoretically returning a profit combining book-maker’s implied with my predicted probability.

The model was initially trained on my gaming laptop (because why wouldn’t you want to use a GPU) so porting the code to run on an ARM device was less simple than I had hoped.

I chose to implement much of the logic in R as I’m much more familiar with this than Python. Unfortunately the R version available on raspbian’s apt-get repo was 3.1.1 which is not up to date enough to install the current version of the R library dplyr. And also not up to date enough to install the R library devtools to install a previous version… compile from source it was then. Took about two hours to compile on the Pi humming at 99% CPU utilisation but helpfully this gives me the same version on the Pi as I am running on my laptop.

Now it’s time to sit back and watch the money roll in.


Slight lie, I don’t actually bet any of my own money on this because I’m not entirely confident it works. I cross-validated against the 2016/2017 season, and then again including the year to date as depicted in the below plot.


The chart shows large fluctuations in success, plus a lot of loosing bets are made. In this simulation the bot comes out ahead, barely. And it may have ended on a peak. Making £25 net profit from 326x £1 bets over a year isn’t a stunning return, so instead I programmed the Raspberry Pi to post predictions to a blog here:, and also tweet top tips here:

Share/follow/re-tweet if you’re so inclined but please don’t put any money on what a £30 Raspberry Pi says. If you’ve not seen it yet, watch Moneyball. Also read the book.



Hacking Tennis for luls and profit

As with many tech nerds, although employed in a specific area of IT I like to dabble in others in my free time. My most recent dabbling has been in data science. Although I say “science” I’m afraid my intentions are less noble than the word implies. I’m more interested in exploiting data for profit.

Odds of that?

Were I a bookmaker setting odds I could simply guestimate the probability of an outcome, knock a bit off for my “fee”, and offer those odds to my pundits. But where’s the profit if no one backs the looser?

The bookies have an awful lot of information at their disposal that they can use to balance a book. For example they know which teams / sports stars are popular with punters and will have a reasonable idea of how many bets they can expect when they offer any given odds. Were I setting odds I would be more interested in predicting how many people will take my odds and for what stakes than the messier business of predicting the outcome of a sporting event.

My goal as a book maker would be to make as much money as possible as reliably as possible. I would not be at all interested in “gambling”. I suspect larger bookmakers already do this, which would put an interesting inefficiency in the market ripe for exploiting in that odds are representative of the punter’s expectation of the outcome and not the probability of the outcome.

Why Tennis?

I like tennis. Well I don’t watch tennis, but if I were to I think I’d like it. Tennis is an ideal candidate sport for odds profiteering for a number of reasons:

  1. Singles tennis is a simple competition between two players without group dynamics and summing of component parts to account for
  2. It’s enjoyed by many for the sport itself, meaning a wide range of data is publicly available for fans enjoyment unlike horse racing where useful data is behind a pay-wall
  3. Underdogs win fairly regularly. In 2016, nearly 28% of matches were won by the underdog[1]

I see predicting which underdogs win as a good area to make money. I theorise there are unsupported, relatively unknown players that few pundits want to back. Bookies will incentivise with higher paying odds on these players to balance their book and remove the gambling element.

I have been exploring this area with machine learning algorithms with promising results.

First Pass

As a proof of concept I used datasets from and simulated predicting the 2016 season. I used an out-of-band validation technique where for a given day only data from previous days were considered to train the model, and the model was then used to predict that day. In my implementation training the model was the bottleneck, to shorten runtime I tested three days at once meaning the second and third tested days would be using an “outdated” model. I was careful to avoid leakage and deemed this an acceptable compromise as it could only make results worse[2]

I implemented some very simple features based on the data easily available, this was mostly game win percentage per set, and comparisons with competitor and used this to train a predictive model in R to calculate a rough probability of the underdog winning using only data that would have been available before each match.

This probability is combined with the betting odds to calculate a theoretical “average” return[3] for backing the underdog based on my assigned probability.

The Results

My results were very promising indeed. If you back every underdog you loose, some come in but not enough to recoup other lost stakes. But if you were to back every underdog my model estimates to have a theoretical return greater than 1.0 then you would make a profit.

The plot below illustrates the profit made and the number of bets made based on setting the threshold in different places.


The trick to maximising profit is deciding where to set the threshold for which underdogs you back. This is a conundrum as it is very dangerous to set the threshold for a predictive model with data after the fact.

My biggest criticism of the results is the small number of bets worth making were found. Setting the threshold at 1.5 results in only 200 matches identified as worth betting on across the whole year, and only 36 of these come in. The odds were high enough to recoup losses but these small quantities seem too much like “gambling” and vulnerable to fluctuation. With the limitation of only one reality to test outcomes  it is unfortunately impossible to know if this is the good or bad end of possible outcomes.

What next?

I am pleased with the direction of my results but do not believe them conclusive enough to put this into production. I only used a small number of “features” to train my model and believe there to be more valuable mining that can be done here.

The major bottleneck in my experiments was the time it took my computer to train the model in R. The winter holidays has been a good time for me to do this, not only have I had time off work to write my code but also time with family away from my computer allowing it to work whilst I don’t.

To make real progress I need more throughput. I do have experience in c++ but limited access to good machine learning algorithm implementations in it. Learning Spark seems like a good way forward, benchmarks I’ve seen place it way better than R and it’s scale out parallel design would allow me to add more cheap hardware if I see more good results.



[1] by Bet365’s odds, 734 of 2626 recorded matches (three were excluded for not having odds available).

[2] I’d argue “could” should be read as “should” if this were written by someone else.

[3] Warning, don’t discuss philosophy with a computer guy: A theoretical average where the same match is played a number of times simultaneously in which different results are possible. Assumes “fate” isn’t a thing but also that instances are finite.