The Sport of Programming

 

I’ve been envious of cryptographers for the many public challenges bodies such as GCHQ have been setting for many years now. I have always wanted to jump in but cryptography is not an area I have an interest in, and the barrier to entry for me has just been too high. Which is why I was delighted to see a competition in an area I do have some knowledge in, data analytics.

The Data Science Challenge was fronted by the UK Government’s Defence Science and Technology Laboratory promised to give ordinary members of the public the chance to play with “representative” defence data. Two competitions were set, a text classification and a vehicle detection competition. Both took the format of providing a training data set to create a model, and then scored were based on making predictions for an un-labeld test data set.

DSC

The text classification competition was detecting the topics of Guardian articles from the content, whilst the vehicle detection competition was detecting and classifying vehicles appearing in satellite images. I saw this as an excellent opportunity to practice two technologies I had not used much before, Spark and TensofFlow.

How’d I do?

Good. Tragically as the user area of the website has already been taken down by the time of writing this retrospective I can’t check my final standings, however I entered both competitions and from memory finished just outside the top 20 in each (of 500-800ish entrants in each competition)

Which I’m pretty happy with. I noted the top-10 in each competition did not enter both competitions, so I’m happy that my skill-set is general enough to pick up new (to me) technologies quickly and perform reasonably well, even if not quite matching those specialised in a particular area.

How’d I did it – Text Classification

I had been starting to learn Apache Spark in the run up to this competition as R was proving too difficult to parallelise efficiently for large data sets and thought it a natural fit here. I found the map-reduce aspect of Spark easy to pick up, it’s very similar to functional programming & lambda calculus I studied in University many years ago, which further goes to show nothing’s really new in IT. Even neural networks aren’t too evolved from the hyper heuristics of 10+ years past.

My solution was based on a comment from Dr Hannah Fry in the BBC4 documentary The Joy of Data that I watched a few weeks earlier, where she summarised that the less frequently a word is used, the more information it carries. For each topic I conducted a word-count and compared the frequency with which a word was used in the topic with the frequency with which it was used outside the topic. The words which saw the mist significant increase in use frequency were then used to classify topics.

I found setting thresholds for the number of distinct articles a word was used in to be key as this prevented words used many times in a small number of articles from selecting over-fitted keywords. Once the keywords for each topic were identified, it was easy to count them in all articles which reduced the problem to simple classification based on numerical data.

I experimented with a range of models including random forests and multiple variable linear regression, extreme gradient boosting showed the best accuracy.

At this point I was still quite far off the pace set by the leaders, I then extended my solution to also use bigrams (sequential pairs of words). This took a little more effort particularly as punctuation now had to be accounted for whereas previously it could all be stripped but a fun coding session later I was running.

There are obviously a lot more pairs of words than there are words, and this is where I met the computational limitations of my machine. Memory was manageable but I needed more compute to do more analysis on bigrams, and further trigrams. The majority of my code was Spark using pyspark so moving on to AWS would be fairly simple, but two driving forces made me stop there:

  1. There’s another competition and I really want to do both
  2. I’m a cheapskate and don’t want to pay AWS

How’d I did it – Vehicle Detection

Basically, I hacked somet together with TensorFlow and did surprisingly well.

This is far from anything I have done before, but I consider myself a well-rounded programmer and was keen to take up the challenge. I wrote my dissertation many years ago on Computer Vision and feature detection and so had some understanding of image processing, but had not yet touched neural networks.

With time now being of the essence since I spent too long on the first competition I dived into some tutorials and worked backwards. In my eyes the problem became find out what I can do, then hammer that into a format that answers the question.

I’d previously dabbled in a Kaggle digit recognition competition and used this as the starting point, however it was Rstudio’s Tensorflow tutorial that really got me up and running. With a little code modification to account for three colour channels I was able to pass image “chips” in labeld with what they contained (if anything, random un-tagged chips were also used) and use those to train a Softmax model, and then a Multilayer ConvNet, both using a range of different chip-sizes and chip-spacing to find a good balance.

An example source image, and two chips containing vehicle (not to scale)

As a beginner I started with the CPU-only version of TensorFlow but quickly moved to the GPU accelerated version using NVIDIA’s cuDNN library. Wow, the improvement was staggering. The training stage was just over 7 times faster using my modest GTX 960M (4GB version) than using just my i7-6700HQ.

Closing Thoughts

I enjoyed the challenge but there were a couple of points which let it down. Firstly the promise of playing with representative defence data was totally exaggerated, the data was articles from theguardian.com and google satellite images  of a UK city. It was nice to get the data in an easily machine processable format but this data is already publicly accessible via HTML and APIs.

Secondly although building a community was a stated goal, the competition was not set up to facilitate that. The leaderboard was limited to viewing the top 10 and the community forums already seem to have been taken down. Hopefully they can learn from Kaggle and its thriving community here.

but I am very satisfied how close I came to the winners in each competition and look forward to the next round. Time to see what else I can do with my growing Spark & neural networks knowledge.

 

 

Hacking Tennis for luls and profit

As with many tech nerds, although employed in a specific area of IT I like to dabble in others in my free time. My most recent dabbling has been in data science. Although I say “science” I’m afraid my intentions are less noble than the word implies. I’m more interested in exploiting data for profit.

Odds of that?

Were I a bookmaker setting odds I could simply guestimate the probability of an outcome, knock a bit off for my “fee”, and offer those odds to my pundits. But where’s the profit if no one backs the looser?

The bookies have an awful lot of information at their disposal that they can use to balance a book. For example they know which teams / sports stars are popular with punters and will have a reasonable idea of how many bets they can expect when they offer any given odds. Were I setting odds I would be more interested in predicting how many people will take my odds and for what stakes than the messier business of predicting the outcome of a sporting event.

My goal as a book maker would be to make as much money as possible as reliably as possible. I would not be at all interested in “gambling”. I suspect larger bookmakers already do this, which would put an interesting inefficiency in the market ripe for exploiting in that odds are representative of the punter’s expectation of the outcome and not the probability of the outcome.

Why Tennis?

I like tennis. Well I don’t watch tennis, but if I were to I think I’d like it. Tennis is an ideal candidate sport for odds profiteering for a number of reasons:

  1. Singles tennis is a simple competition between two players without group dynamics and summing of component parts to account for
  2. It’s enjoyed by many for the sport itself, meaning a wide range of data is publicly available for fans enjoyment unlike horse racing where useful data is behind a pay-wall
  3. Underdogs win fairly regularly. In 2016, nearly 28% of matches were won by the underdog[1]

I see predicting which underdogs win as a good area to make money. I theorise there are unsupported, relatively unknown players that few pundits want to back. Bookies will incentivise with higher paying odds on these players to balance their book and remove the gambling element.

I have been exploring this area with machine learning algorithms with promising results.

First Pass

As a proof of concept I used datasets from tennis-data.co.uk and simulated predicting the 2016 season. I used an out-of-band validation technique where for a given day only data from previous days were considered to train the model, and the model was then used to predict that day. In my implementation training the model was the bottleneck, to shorten runtime I tested three days at once meaning the second and third tested days would be using an “outdated” model. I was careful to avoid leakage and deemed this an acceptable compromise as it could only make results worse[2]

I implemented some very simple features based on the data easily available, this was mostly game win percentage per set, and comparisons with competitor and used this to train a predictive model in R to calculate a rough probability of the underdog winning using only data that would have been available before each match.

This probability is combined with the betting odds to calculate a theoretical “average” return[3] for backing the underdog based on my assigned probability.

The Results

My results were very promising indeed. If you back every underdog you loose, some come in but not enough to recoup other lost stakes. But if you were to back every underdog my model estimates to have a theoretical return greater than 1.0 then you would make a profit.

The plot below illustrates the profit made and the number of bets made based on setting the threshold in different places.

tennisprofit01

The trick to maximising profit is deciding where to set the threshold for which underdogs you back. This is a conundrum as it is very dangerous to set the threshold for a predictive model with data after the fact.

My biggest criticism of the results is the small number of bets worth making were found. Setting the threshold at 1.5 results in only 200 matches identified as worth betting on across the whole year, and only 36 of these come in. The odds were high enough to recoup losses but these small quantities seem too much like “gambling” and vulnerable to fluctuation. With the limitation of only one reality to test outcomes  it is unfortunately impossible to know if this is the good or bad end of possible outcomes.

What next?

I am pleased with the direction of my results but do not believe them conclusive enough to put this into production. I only used a small number of “features” to train my model and believe there to be more valuable mining that can be done here.

The major bottleneck in my experiments was the time it took my computer to train the model in R. The winter holidays has been a good time for me to do this, not only have I had time off work to write my code but also time with family away from my computer allowing it to work whilst I don’t.

To make real progress I need more throughput. I do have experience in c++ but limited access to good machine learning algorithm implementations in it. Learning Spark seems like a good way forward, benchmarks I’ve seen place it way better than R and it’s scale out parallel design would allow me to add more cheap hardware if I see more good results.

Plus I may be looking for a new job in Data Science / Big Data in the near future and Spark is the feather to have in your cap right now.

 

Footnotes:

[1] by Bet365’s odds, 734 of 2626 recorded matches (three were excluded for not having odds available).

[2] I’d argue “could” should be read as “should” if this were written by someone else.

[3] Warning, don’t discuss philosophy with a computer guy: A theoretical average where the same match is played a number of times simultaneously in which different results are possible. Assumes “fate” isn’t a thing but also that instances are finite.

time to take Java seriously again?

Like many Computer Science graduates Java was the first language I’d say I really learnt. Sure I’d dabbled in C and VB but Java is where I first wrote meaningful code beyond examples from the text book. Again like many Computer Science graduates, I turned my back on Java pretty soon after that.

The need is not to get the most out of your hardware but to get the most out of your data, as quickly and continuously as possible to retain your advantage.

My experience in video game programming as well as my current day job around research computing (although not in a programming capacity) both feature squeezing every drop out of hardware which sadly leaves little space for Java. In both code written in fast low-level languages is optimised to exploit the hardware it will run on.

remove-c-give-java

The ongoing data analytics and machine learning revolution, surely the most exciting area in IT at the moment, is bringing with it a data-centric approach of which we should all take note. The need is not to get the most out of your hardware but to get the most out of your data, as quickly and continuously as possible to retain your advantage.

Spark for example is written in Scala, which compiles into Java byte code to run on the Java Virtual Machine which itself finally runs on the hardware. Furthermore many Spark apps are themselves written in a different language such a R or Python which have to first interface with Spark. This is a lot of layers of abstraction each adding overheads which would be shunned by performant orientated programmers.

kill-java

Yet when I look at these stacks I instead see wonderful things being done and begin to see past my preconceptions.

I’m also seeing containers grow in prominence which are a natural fit for Java development. With S2I builds (source to image) developers can seamlessly inject their code from their git repository into a Docker image and deploy that straight onto a managed system.

Whilst C++ will remain the norm for mature performant orientated applications, hypothesis testing and prototyping to yield quick results is giving an extra life to Java.

If I could predict the future I’d sell it to the highest bidder

Super Heroes

Super Heroes (Photo credit: Olaf)

Technology should give us super powers. Still waiting for flight and not holding my breath for invisibility but predicting the future is getting better all the time.

Big Data isn’t about “big” data. Its orders of magnitude lower than what the Large Hadron Collider spits out in a day. But the combination of the right nuggets of information can yield some big predictions.

What I find most interesting is the huge breath of variables tracked and used to make predictions. More obvious ones include the weather impacting public mood and buying decisions but things like results of football matches can have an impact.

If Liverpool have a big win against Manchester United there’ll be lots of gamblers in Liverpool with some extra money. Target your advertising that way? Likewise if United win there’ll be windfalls in London. What if Everton win, Microsoft’s stock drops, it rains in Kent and there’s a wedding on East Enders?

Monitoring and weighing up all these small impacts and working out which ones are relevant to making an accurate prediction about the future is currently beyond me, but you may have heard of the US clothing company Target being able to predict a teenager is pregnant before she has time to tell her parents.

This is an exciting but terrifying area and one I’ll be following closely.