Interview with Abraham Thomas, co-founder and head of data at Quandl

0
2354

Abraham Thomas

What is Quandl and why is it so useful for data journalists?

Quandl, at its core, is a search engine for numerical time series data.  The data we have is heavily influenced by what our users want, and as such we tend to have datasets on important or trending topics.  For example right now we just created a number of datasets encompassing all the inequality data included in Thomas Piketty’s new book “Capital in the 21st Century”.  We also have a huge number of datasets on standard reference topics: economics, financial markets, society, demography and so on.  All these datasets are easily accessible in applications for analysis or for export to graphs.  Best of all it’s all free.

Is it easy to use?

Quandl‘s mantra is to make data easy to find and easy to use. We try to do this in a number of ways.  
The first step is helping users find the data they need.  Having millions of datasets is no good unless you can find what you’re searching for. Most current search engines don’t do a very good job at pure numerical data searches. So we built our own custom search algorithm that is optimized for numerical data.  You can filter by data source, filter by data frequency, perform advanced search using Boolean queries and so on.  Of course there’s still a long way to go; and we’re constantly improving our backend algorithm to give you the data you were looking for. 
Another mechanism we use to help users find the data they need is by “browsing” our data collections.  Collections are hand-selected, curated groups of high-quality datasets on specific topics.  So instead of searching for specific datasets, users can explore in a more free-form manner via this method.
The next step is actually working with the data you’ve found.  We offer offer various options for downloading and graphing the data though the website.  Perhaps though our real strength is our API; lots of users have written their own apps and programs that use Quandl data delivered via this API.  We’ve also written (with generous contributions from our users) a number of libraries that help you get Quandl data directly into the analysis tool of your choice — R, Python, Excel, Matlab, you name it — without visiting the website or invoking the API.

Does the site provide data in a form that is easy to manipulate?

The important thing about making data easy to manipulate is understanding that different users have different needs, and we need to be able to facilitate that.  That’s why we offer all our data in multiple formats (JSON, CSV, XML), irrespective of what format the data was originally published in.   That’s also why we’ve built our API and all the tools and libraries that interface with it.  We want to make the process of taking our data and getting it into whatever tool you choose to use as frictionless as possible. 

How did you first come up with the idea? Did you spot a gap in the industry that needed to be filled?

The idea came from our founder Tammer Kamel.  Tammer was having a difficult time finding the data he required for his personal consulting business, without paying thousands of dollars to firms like Bloomberg or Reuters.  And it turns out that there are many people in similar situations.  As it currently stands (without Quandl) if you are not working for a large company with a large data budget, it is surprisingly difficult to get even simple public statistics, like the GDP of China over time, into your workflow. 

Last year you were described by journalism.co.uk as being the “YouTube of data” – do you think this is a fitting description?

It very much describes our aspiration. We would like to get to a point where some users are contributing massive amounts of data that other users are consuming.  We’re currently building the tools to enable this in a frictionless, functional manner.  (See answer 7 below)

How do you source the data you host, and how do you ensure that it is always up to date?

We source data from all over the internet and sometimes physical media as well. We have multiple scheduling and freshness checks in place to make sure everything is updating properly. 

Last year you mentioned that you are hoping to allow users to upload their own data – what are the latest developments here? What is the thinking behind this? And does this not make it difficult to ensure that all data is accurate?

Right now we are still in the testing phase of this project internally.  We’ve also slowly started inviting a few alpha-testers to try it.  We feel we have created a fairly frictionless experience getting data from Quandl, and we want to provide that same frictionless experience putting data on Quandl as well.  
There are two reasons for moving in this direction.  First, as a team there is only so much data we can add ourselves.  Secondly we cannot pretend to be experts at everything.  Here at Quandl we have a very talented group of people with varying skills and domains of expertise.  However the wealth of data out there — and knowledge of it — is so vast we could never dream of understanding it all.  Luckily our users as a whole do have this knowledge.  Right now, every dataset that is being added to Quandl has been specifically asked for by a user, and it has been this way now for months.  We are very confident that with the right tools, our users will be able to create high quality, usable datasets.  These datasets will be associated with their creator, and other users can choose to trust or distrust these creators just like they’ve chosen with Quandl as a whole.

Is there anything else similar in the field at the moment?

Yes and each has its strengths depending on what a journalist might need.  Zanran.com has crawled a huge number of PDF documents on the internet for tabular data; they have some really esoteric stuff.  Datamarket.com has great visualization tools.  Datahub.io also looks interesting to us as an open-data platform.  Exversion.com offers access control and version control for datasets which are both interesting features.  WolframAlpha.com doesn’t offer much raw data but their natural-language query system is very impressive.  So there’s lots of activity in this field right now.

Leave a Reply