A few days ago, I came across a drag’n'drop, wire it together visualisation and data analysis tool called Orange.
Here’s a quick run through of some of the basics (at least, a run through of the first few things I tried to do with the tool…)
First off, we need some data. Orange likes TSV (tab separated values) rather than CSV, so I grabbed some TSV from one of the Guardian Datastore spreadheets on Google Docs (use “Save as Text” to get the tab separated value format…)
Orange is a canvas based visual programming environment, in which functional blocks are added the the canvas and certain parameters set within the block. Here’s how we get some data into Orange from a TSV file:
The File icon is giving me a warning (no dependent variable) but I’m not sure why…? I’m sure Orange has managed to detect labels and quantities correctly from other files I’ve tried?
Anyway… we can inspect the data by looking at it in a data table widget – just wire one in:
The table is sortable by column, and the Report button can be used to save a version of the table. Looking t the data table, we see it has identified columns with missing entries. We can clean these from out data set using the Preprocessing widget:
If we now wire the output of the Processing widget into the Scatterplot widget, we can generate a variety of scatterplots:
If you want to save a copy of the chart, it’s easy enough to do so. (I can’t get colour palettes to work on my Mac, so I’m stuck with greyscale displays. Also, the blob sizing doesn’t seem very responsive…)
The Report tool allows us to create a report from various bits of the dataflow, including adding information from several widgets to either separate report pages or the same report page.
Saving a Report saves all the report pages to a navigable set of HTML pages that resemble the Orange Report viewer.
Here are a couple of other things we can do with the data, this time using a data set that isn’t throwing the “dependent variable missing” error, in particular the distribution of comments in a small Friendfeed network…
So for example, here’s how the number of comments made by members of the network is distributed:
Alternatively, we may look at the distribution in a more “statistical” way:
(Remember, we can generate these reports interactively, and then add them to a growing report.)
The survey plot gives us a macroscopic birds eye view over the whole of the data set:
Okay, that’s enough for starters – hopefully you get the idea: wire stuff together and generate visual reports… So why not go and download Orange now?!;-)
There are a whole range of clustering tools, too, which look like they could be interesting…
And I think the platform is extensible, which means there’s a way of adding your own widgets (written in Python, maybe..?)
"
No comments:
Post a Comment