Data Quality – Here Come The Outliers

[Reading time: 5mn]

We’re at a point where playing with our first outlier detection functions become interesting: it start generating ideas. This is mainly due to the fact that we have used Clojure to do these experiments.

As a functional programming language, Clojure puts functions first, allows for abstraction through composition and offers a dynamic and fun approach to developing using a REPL.
I know, it’s Lispy – and this is precisely the point. Code as data (or is it the reverse), and laziness (meaning lazy evaluation) provides tremendous opportunities when it comes to data analytics and data quality investigations.

I know, it’s JVM-based – but this is incredibly useful. It means that Java integration is baked in. You can use your usual Java libraries in Clojure code, and you can also create Clojure libraries to use in Java. You can even dynamically invoke Clojure code from Java – something we’ll talk about in a future post indeed.

In this post, I’ll briefly expose our experiments at building an interactive outlier detection function. Further posts will expand on this to cover visualization of the outliers using the incredible Incanter R-like statistical library. The source code is available in Github – but beware, it is definitely not production ready.

 

So okay, let’s say you have a bunch of related data: for example historical data like time series of financial entities. Stock prices, FX spots, market indices, or even electronic trades – you name it – over a period of time.

People close to these matters know that the simple act of recording high volumes of data influences it, much like looking at the Schrödinger cat might actually kill it. The multi-threaded data recording and distribution systems are prone to errors (which are known to be among the hardest to debug) which can start producing stange values at times. And you also have human errors: never saw an operator mixing yens for dollars when entering a trade?

Enter outliers. Bad data points which spike and break the nice roundness of charts. Wrong data which block settlement processing. Strange data that raise the signal versus noise ratio and add a burden on analysts.

And now, in order to prevent all these issues, you need to clean up the data. This is precisely what we have recently started experimenting using Clojure, which we thought is interesting enough to share with our readers.

But first we need some data. Let’s go to Yahoo Finance and pick a handful of SP500 prices, which you can download to a CSV file on your machine. For the purpose of this article, we’ll limit our extraction to 2010 which is about 250 values. Later on, you can experiment with the whole set available (Yahoo Finance has them since 1950).

Then of course you need Clojure. You can download version 1.2.1, and the accompanying clojure-contrib library from the Clojure web site. We don’t use any IDE, just the REPL in a terminal window which we start from a Leiningen project. This has the supplementary advantage of maintaining the classpath for you – because yes, you will need additional libraries.

The easiest is to clone our project from github: git clone git@github.com:artekcs/outliers.git. Once done, you can ask Leiningen to download the dependencies for you: lein deps – attention, this might require some time because the dependencies include incanter (which in turn has many dependencies). We won’t use incanter in this session though.

When the dependencies are resolved, you can start a Clojure REPL: lein repl. Then, start by creating your own namespace which will reference all the librairies you need:

user=> (ns testing (:use [outlier utils core]))
nil
testing=>

Okay, now we can start. Let’s load the SP500 history data you’ve just downloaded. This is a CSV file, so we used the CsvReader Java library to handle it (source code for our function is in our project’s csv.clj file). And here’s how you load your file:

testing=>(def csv (load-datevalue-from-csv "/Users/artekcs/Development/data/sp500/SP500-2010.csv" :Date :Close))

You have noticed that we only load 2 columns: the date and the close value. But we also need to get the values in order. Because we will be assessing each of them in a time sequence, we want them sorted by date ascending, whereas the csv as loaded is in descending order, starting at Dec. 31th… Not an issue, let’s use CLojure’s sort-by:

testing=> (def sorted-csv (sort-by :Date csv))

Now, we’ll extract just the close value:

testing=> (def values (into [] (map #(:Close %) sorted-csv)))

What happens here is that we use the key “:Close” as a function on the “csv” map so that it just returns the column we want – yep, this is reminiscent of relational algebra’s projections. As we want to do this repeatedly, we use the map function and send the accumulating results into a brand new vector “[]” which is bound to the variable “values”. The “map” function returns a lazy sequence so our “values” is lazy, which is great when you store millions of rows. Just remember to **not** evaluate it at the REPL – for example don’t do a (count values)…

Okay. Now look for outliers in these values:

testing=> (def outs (outliers-median values 5  1.5))

What we do here is call out outliers-median function (we also have an outliers-mean function) over the SP500 2010 values, evaluating each data point in a sample set of 5 points (meaning: surrounded by 2 neighbors left and 2 neighbors right), and checking whether its distance from the sample meadian is equal or more than 1.5 standard deviations (actually, the standard deviation calculation we use is median-based when we use outliers-median function, and mean-based when we call the outliers-mean function).The result: 23 outliers.

The outilers functions actually return a collection of maps defined like so:

  • diff: the difference between the outlier value and the sample set median / mean.
  • stddev: the standard deviation for the sample set surrounding the outlier.
  • comp: the sample set median / mean value.
  • value: the outlier value.
  • idx: the index of the outlier in the values collection. This is pretty useful when you want to get back to the original array of values that you are analyzing (for example, if you want to specifically highlight that value on a chart, as we’ll do in a future post).

As I said above, these are just experiments, and as such are totally subjective to the experimenter’s mood and ideas. For example, the oulier functions we have designed are not purely academic – one reason is that there are many different ways of doing it more academically, another reason is that the best approach is dependent on your use case and thus needs a mix of academic and empiric design patterns. And with Clojure, we can actually interactively change, adapt, and compare both the functions’ design and their implementation.

Until next time.

Leave a Reply

Your email address will not be published. Required fields are marked *