My Automated Machine Learning Tipster Bot, on a Raspberry Pi

I programmed my £30 Raspberry Pi to automate picking football tips using machine learning techniques, which you can follow on twitter @PiBot_predicts

I say, anyone for tennis?

A while back I blogged about the potential of using predictive modelling for betting on Tennis.

I commented that bookmakers would likely tune their odds to maximise their profits based on what bets they had taken and what they would be likely to take. Therefore the bookmaker’s implied probability would sometimes be significantly different to “actual” probability and machine learning could be used to identify the most under-valued picks.  I hypothesised that as tennis is largely enjoyed for the sport itself and as such there would be a lot of data published on each match which could be exploited for profit.

flaw in plan

The ATP makes available a good selection of stats, first serve in % etc… but for the WTA all we easily get it the sets box-score. Which is a shame as if forces you to either halve the number of observations or drastically reduces the number of variables you can use. Sadly I conclude there’s not enough data on enough matches.

That thing they show at the end of the news

With tennis ruled out, on to football. Tennis and horse racing data is locked away, either by broadcasters or pay-wall APIs. Football by contrast is just always in your face. I have no interest in football but even I have a rough awareness of what’s going on. With four divisions widely reported on in the UK there is a lot of consistent data available, and a lot of potential match-ups to predict against.

Raspberry Ry

I have been wanting to do something significant with my Raspberry Pi for a while now, and automating picking betting tips seemed like a fun project.

I wrote a number of scripts to scrape a few sites for past results and upcoming fixture odds and the scripts I used to generate the initial training data were re-used on this new data. My predictions are then made for the following day’s games and top picks are selected based on those theoretically returning a profit combining book-maker’s implied with my predicted probability.

The model was initially trained on my gaming laptop (because why wouldn’t you want to use a GPU) so porting the code to run on an ARM device was less simple than I had hoped.

I chose to implement much of the logic in R as I’m much more familiar with this than Python. Unfortunately the R version available on raspbian’s apt-get repo was 3.1.1 which is not up to date enough to install the current version of the R library dplyr. And also not up to date enough to install the R library devtools to install a previous version… compile from source it was then. Took about two hours to compile on the Pi humming at 99% CPU utilisation but helpfully this gives me the same version on the Pi as I am running on my laptop.

Now it’s time to sit back and watch the money roll in.


Slight lie, I don’t actually bet any of my own money on this because I’m not entirely confident it works. I cross-validated against the 2016/2017 season, and then again including the year to date as depicted in the below plot.


The chart shows large fluctuations in success, plus a lot of loosing bets are made. In this simulation the bot comes out ahead, barely. And it may have ended on a peak. Making £25 net profit from 326x £1 bets over a year isn’t a stunning return, so instead I programmed the Raspberry Pi to post predictions to a blog here:, and also tweet top tips here:

Share/follow/re-tweet if you’re so inclined but please don’t put any money on what a £30 Raspberry Pi says. If you’ve not seen it yet, watch Moneyball. Also read the book.



The Sport of Programming


I’ve been envious of cryptographers for the many public challenges bodies such as GCHQ have been setting for many years now. I have always wanted to jump in but cryptography is not an area I have an interest in, and the barrier to entry for me has just been too high. Which is why I was delighted to see a competition in an area I do have some knowledge in, data analytics.

The Data Science Challenge was fronted by the UK Government’s Defence Science and Technology Laboratory promised to give ordinary members of the public the chance to play with “representative” defence data. Two competitions were set, a text classification and a vehicle detection competition. Both took the format of providing a training data set to create a model, and then scored were based on making predictions for an un-labeld test data set.


The text classification competition was detecting the topics of Guardian articles from the content, whilst the vehicle detection competition was detecting and classifying vehicles appearing in satellite images. I saw this as an excellent opportunity to practice two technologies I had not used much before, Spark and TensofFlow.

How’d I do?

Good. Tragically as the user area of the website has already been taken down by the time of writing this retrospective I can’t check my final standings, however I entered both competitions and from memory finished just outside the top 20 in each (of 500-800ish entrants in each competition)

Which I’m pretty happy with. I noted the top-10 in each competition did not enter both competitions, so I’m happy that my skill-set is general enough to pick up new (to me) technologies quickly and perform reasonably well, even if not quite matching those specialised in a particular area.

How’d I did it – Text Classification

I had been starting to learn Apache Spark in the run up to this competition as R was proving too difficult to parallelise efficiently for large data sets and thought it a natural fit here. I found the map-reduce aspect of Spark easy to pick up, it’s very similar to functional programming & lambda calculus I studied in University many years ago, which further goes to show nothing’s really new in IT. Even neural networks aren’t too evolved from the hyper heuristics of 10+ years past.

My solution was based on a comment from Dr Hannah Fry in the BBC4 documentary The Joy of Data that I watched a few weeks earlier, where she summarised that the less frequently a word is used, the more information it carries. For each topic I conducted a word-count and compared the frequency with which a word was used in the topic with the frequency with which it was used outside the topic. The words which saw the mist significant increase in use frequency were then used to classify topics.

I found setting thresholds for the number of distinct articles a word was used in to be key as this prevented words used many times in a small number of articles from selecting over-fitted keywords. Once the keywords for each topic were identified, it was easy to count them in all articles which reduced the problem to simple classification based on numerical data.

I experimented with a range of models including random forests and multiple variable linear regression, extreme gradient boosting showed the best accuracy.

At this point I was still quite far off the pace set by the leaders, I then extended my solution to also use bigrams (sequential pairs of words). This took a little more effort particularly as punctuation now had to be accounted for whereas previously it could all be stripped but a fun coding session later I was running.

There are obviously a lot more pairs of words than there are words, and this is where I met the computational limitations of my machine. Memory was manageable but I needed more compute to do more analysis on bigrams, and further trigrams. The majority of my code was Spark using pyspark so moving on to AWS would be fairly simple, but two driving forces made me stop there:

  1. There’s another competition and I really want to do both
  2. I’m a cheapskate and don’t want to pay AWS

How’d I did it – Vehicle Detection

Basically, I hacked somet together with TensorFlow and did surprisingly well.

This is far from anything I have done before, but I consider myself a well-rounded programmer and was keen to take up the challenge. I wrote my dissertation many years ago on Computer Vision and feature detection and so had some understanding of image processing, but had not yet touched neural networks.

With time now being of the essence since I spent too long on the first competition I dived into some tutorials and worked backwards. In my eyes the problem became find out what I can do, then hammer that into a format that answers the question.

I’d previously dabbled in a Kaggle digit recognition competition and used this as the starting point, however it was Rstudio’s Tensorflow tutorial that really got me up and running. With a little code modification to account for three colour channels I was able to pass image “chips” in labeld with what they contained (if anything, random un-tagged chips were also used) and use those to train a Softmax model, and then a Multilayer ConvNet, both using a range of different chip-sizes and chip-spacing to find a good balance.

An example source image, and two chips containing vehicle (not to scale)

As a beginner I started with the CPU-only version of TensorFlow but quickly moved to the GPU accelerated version using NVIDIA’s cuDNN library. Wow, the improvement was staggering. The training stage was just over 7 times faster using my modest GTX 960M (4GB version) than using just my i7-6700HQ.

Closing Thoughts

I enjoyed the challenge but there were a couple of points which let it down. Firstly the promise of playing with representative defence data was totally exaggerated, the data was articles from and google satellite images  of a UK city. It was nice to get the data in an easily machine processable format but this data is already publicly accessible via HTML and APIs.

Secondly although building a community was a stated goal, the competition was not set up to facilitate that. The leaderboard was limited to viewing the top 10 and the community forums already seem to have been taken down. Hopefully they can learn from Kaggle and its thriving community here.

but I am very satisfied how close I came to the winners in each competition and look forward to the next round. Time to see what else I can do with my growing Spark & neural networks knowledge.