[Reading time: 5mn]
In our previous post, we briefly explained how we used Clojure to do data outliers detection with descriptive statistics. Since then, we have enriched our prototype library with further detection methods: MAD (Median Absolute Deviation) and IQR (Interquartile Range). The source code is available on github if you want to play around with it.
Now, how good are these outliers methods? Obviously, as the functions return a collection of offending points with calculation details, it is rather difficult to notice whether the results are pertinent or not. For this, you want to see the time series on a chart with outliers highlighted – well, let’s say that we want to see this.
Enters Incanter. Incanter is an R-like Clojure library which you can use interactively to play with data. You can use it standalone or include it in your project. We did the latter with our outlier experiments. Here’s the relevant extract of our project.clj file
:dependencies [[org.clojure/clojure "1.2.1"] [org.clojure/clojure-contrib "1.2.0"] [net.sourceforge.javacsv/javacsv "2.0"] [incanter "1.2.3-SNAPSHOT"] [clj-time "0.3.0"]]
Incanter has a lot of features like persistent datasets, and charts, which we’ll use to see our outliers.
Let’s get back to the example we used in our last post, when we tried to detect outliers over the S&P500 close price daily history for 2010 (). This time, we’ll use Incanter charting functions to draw the time series and the outliers.
First, let’s load all necessary libraries in a dedicated namespace:
user=> (ns testing (:use [outlier utils core]) (:use [incanter core charts io pdf]) (:use [incanter.stats :exclude (mean median)]))
We are excluding the Incanter mean and median functions because we have implemented our own. There’s little doubt Incanter’s versions are better though, and we’ll leave it to the astute reader to switch to them instead.
Now, let’s load the data like we did previously and detect outliers.
testing=>(def csv (load-epochdatevalue-from-csv "/Users/artekcs/Development/data/sp500/SP500-2010.csv" :Date :Close)) testing=>(def sorted-csv (sort-by :Date csv)) testing=>(def outs (outliers-median (into  (map #(:Close %) sorted-csv)) 5 1.5))
Actually, you can display a table of the outliers using Incanter. First, change the sequence into an Incanter dataset and then call view on it:
testing=>(view (to-datatset outs))
You’ll see something like the following:
Now, let’s create a chart of the original S&P500 time series, and display it.
testing=>(def chart (time-series-plot :Date :Close :data (to-dataset sorted-csv))) testing=>(view chart)
Okay, now on to highlighting the outliers. To do that, we’ll use Incanter’s ability to draw pointers on a chart: an arrow with a label. The add-pointer function takes the point’s coordinates (x and y) and a text label. Our outliers’ funtions return a collection of maps (one for each found outlier). In these maps, we return the index of the outlier in the original data set. Using this index, we’ll easily retrieve the x and y coordinates of the outlier wich we’ll pass to the add-pointer function.
Lisp is an expressive language, and the Clojure dialect demonstrates it. The lengthy explanation above is translated into the following one-liner:
testing=>(map #(doto chart (add-pointer (:Date (nth sorted-csv (:idx %))) (:Close (nth sorted-csv (:idx %))) :text "Outlier")) outs)
Then, you can display the chart again:
And here’s what you’ll have.
Note that you can also save the chart. For example, this is pretty useful as a basic means of comparing several different outliers detection algorithms. If you want to save as a PDF in the project’s directory, here’s what you’ll do:
testing=>(save-pdf chart "./outliers-msd.pdf")
We’ve just scratched the surface here. Clojure and Incanter form a terrific combination for any data analysis work, and thanks to them, our outliers detection experiments have been so much easier, fun – and productive.
Until next time.
- 8-Sept-15: Upgraded source code on Github to Leiningen 2.5.2, Clojure 1.7.0 and Incanter 1.5.6