Alberto Cairo: To succeed in data journalism, steal from the best

Alberto Cairo thinks you can solve almost all your problems with data journalism in two steps.

“Journalists,” he tells me, “in my opinion, tend to oversimplify matters quite a lot.

“Let’s say that you are exploring the average height of people in an area. If you only report the average, that may be wrong or right depending on how spread out the data is. If all people are more or less around that height then reporting that average is correct.

“But if you have a huge range of heights, the average is still the same but you may not be reporting how wide the spread is. Or if your distribution is bimodal, the average will still be the same, but you have a cluster of short people on one end and a cluster of tall people on the other end, that’s a feature of the data that will go unnoticed if you only report the averages.”

His other complaints about journalists attempting to write about data will be familiar to despairing members of data desks in newsrooms around the world. Speculative extrapolation, inferring causation from correlation, a lack of understanding of probability and uncertainty, and many other journalistic foibles all fall under Cairo’s fierce scrutiny.

“The thing is,” he insists, “all the things I’m mentioning are easily solved, so it’s not that you need to take a course on advanced correlation and regression analysis, it’s just a matter of learning Stats 101 and then read two, three books on quantitative thinking and quantitative reasoning.

“That’s how you avoid 80 per cent or 90 per cent of the problems, and the other 10, 20 per cent will be avoided if you consult with experts every time you do a story based on data, which is something we need to systematically do.”

Even most students of data journalism probably can’t say that they’ve read two or three books on quantitative thinking and quantitative reasoning, but if we’re serious about our pursuit of the truth, perhaps next time a disgraced politician publishes their memoirs we should Google ‘best books on statistics’ instead.

A decent place to start

Though Cairo is of course prolific in data visualisation in his own right, he is clearly a teacher at heart. After two decades working in infographics and data visualisation, he could be forgiven for losing an ounce of enthusiasm, but the Knight Chair in Visual Journalism and author of two books on data visualisation still has a twinkle in his eye as we discuss his recent work.

He’s just finished teaching an online course on using data visualisation as more than just a method of communication. Instead, he is focusing on using it “to find stories in data”.

“There’s a whole branch of statistics,” he explains, “which was defined around the 60s, 70s and the 80s by a statistician called John Tukey.

“He wrote a book titled Exploratory Data Analysis. The whole field of data visualisation in computer science and statistics focuses mostly not on communication, it focuses on exploration, how to explore data, how to discover features of the data that may remain unnoticed if you don’t visualize them.”

Alberto Cairo is a member of the jury for this year’s Data Journalism Awards. Unfortunately he won’t give me a direct road map to victory, but for students hoping to enter this year (the first year in which a student category has been included), his advice is surely invaluable.

“Steal from the best.

“This is advice I give my students every semester: we all learn by copying other people. By copying I don’t mean plagiarising, but getting inspiration. Look at work from ProPublica or the Washington Post or The New York Times, and copy their style, copy their structure, copy the way they present information.

“Don’t try to think of graphics as if they were embellishments to your story, but as analytical tools and communication tools within your story. They should never be afterthoughts when you’re developing your story. They’re an integral part of your story and an integral part of its communicative power.”

Even with his experience, Cairo says, he still does this himself. Though many journalists seem addicted to credit, and are unlikely ever to admit to anything short of completely original works of genius, in data journalism, collaboration is endemic.

“Nobody works inside of a cocoon,” notes Cairo. “The community of data reporters and investigative reporters is very open. I just came back from the NICAR conference, and some of my students who attended were amazed that they could approach – I don’t know – Scott Klein from ProPublica and ask him questions directly. They believe there’s some sort of hierarchy, but there’s not.”

This lack of hierarchy should lend confidence to aspiring data journalists. To slightly amend Alberto Cairo’s steps to success: all one need do is get a decent grounding in statistics, consult with experts, and join the data journalism ecosystem.

How to make the perfect podcast

Starting a podcast can be a daunting prospect, but if ‘Data Day’ can teach you anything, it’s that the barrier for entry is incredibly low.

On this final episode of ‘Data Day’, Bridie Pearson-Jones joins Luke Barratt as they discuss what makes some of their favourite podcasts great, why podcasting is such a compelling format for modern journalists, and the difference between podcasts and radio programmes.

Plus, special guest and longtime fan of the show Faye White joins the team to discuss some funny podcasts, because apparently our podcasting experts are really boring.

This is the end of season 1 of Data Day, and Luke’s last episode. It remains to be seen whether future Interhacktives take up the mantle.

Email newsletters for journalists: a guide

Despite numerous technological innovations — from 360-degree video to social media live-streaming to robot journalists — the trusty old email seems to be increasing in importance in the newsroom.

Quartz, for instance, has several journalists based across the world work on its daily newsletter, which releases at about 6am across time zones in Asia, Europe/Africa and the Americas, every day.

This shouldn’t come as a surprise. The email inbox is, for many people, the first app or webpage they open in the morning and one that they return to multiple times over the day. Amid the noise and barrage of social media posts, the email newsletter may be the easiest and quickest way to reach a reader directly.

In the podcast, we compare our favourite email briefs, such as the Times Red Box and Politico Morning Media tipsheet, and why we like them. We also discuss the ethics of email newslettering.

If getting access to a reader’s email inbox is like gaining a private audience with him or her, how much should publications use it for their own marketing or political purposes? If publications, political parties and companies alike use people’s information for political or marketing purposes, does this count as an abuse of big data and reader’s trust?

Listen to the podcast to find out. Don’t forget to sign up to Interhacktives’ newsletter here, too.

Using data in sport journalism

In this week’s Data Day, Luke Barratt is joined by Matteo Moschella to discuss the use of data in sport journalism.

Data is omnipresent in the reporting of sport, particularly recently. The closing of the Barclays Premier League January transfer window has prompted a glut of visualisations on the month’s top stories.

Check out some of the code used by the Guardian on their Github.

Athletes and sports teams are using more and more data nowadays to optimise their performance. But crucially for journalists, the vast audiences drawn by sports demand extensive data.

Opta provides detailed data feeds on a number of different sports.

While providing this data to users in raw format is common, there is also great scope for journalists to use data to analyse issues in sport.

Here, Rob Minto uses data to defend a potential increase of the number of teams taking part in the FIFA World Cup.

One crucial area where this kind of journalism has flourished is in predictions. Nate Silver, now renowned as a polling expert, made his name using data to predict the results of baseball games.

Visit his site, FiveThirtyEight, which still applies its methods to sport.

Similarly, the Financial Times has built a complicated statistical model to predict the outcome of the 2016/17 Premier League.

Daniel Finkelstein has a weekly column in The Times using similar methods to analyse football. Here, he uses sport to teach his readers a lesson about probability through a parable about the likelihood of giant-killing in the FA Cup.

We’ve also seen data used for in-depth investigations into sporting issues. Buzzfeed used data from betting markets to uncover indications that certain players had been guilty of match-fixing.

The Sunday Times, meanwhile, in a more traditional piece of data journalism, made use of data from a whistleblower to find evidence of doping throughout the world of athletics.

Have your say on government open data

On this week’s ‘Data Day’, Ayushman Basu and Luke Barratt discuss the opening of a survey for journalists by the Government Statistical Survey. The Government is looking for feedback on how to improve their provision of open data.

You can respond to the survey here.

The main focus of the survey is on the possibility of creating a single outlet for releasing data from the government, and on this podcast, we discuss some of the inconvenience of the current system. Datasets have to be sourced from various different portals and subsequently combined, which creates significant delays for journalists.

The survey is not especially focused on data quality, but we discuss the importance of this issue, which is made more serious by the worrying fact that the government has no centralised policy on data quality.

Finally, since Ayushman Basu has specific experience in this area, we discuss how some of these issues present themselves in India. The government there has a central data portal, but the quality of releases is very poor, with PDFs often used instead of Excel spreadsheets. India’s large population also makes data collection very difficult.

India’s government data portal can be viewed here.

Have a very data Christmas

This is Interhacktives’ latest attempt to persuade you that data journalism can be relatable and human, and this time we’ve teamed up with a powerful ally: Christmas.

Christmas is a time of year for turkey, mince pies, stuffing, stockings, trees, treats, presents, and… data? On this week’s Data Day, James Somper and Luke Barratt look through the news to find some of their favourite examples of data-driven Christmas journalism.

Luke made his mince pie joke again, but this time you don’t have to wait until the end to hear it.

Follow James on Twitter @jsoomper, Luke @lukewbarratt, and Interhacktives @Interhacktives.

You can read Kate Hughes’ article on the true cost of Christmas here. The Money Editor of the Independent counts up our rising festive spending, and comes up with some eye-popping numbers.

The financial services company PNC has done what it’s been doing every year for the past 30 years, and calculated how much the presents in the ‘Twelve Days of Christmas’ would actually cost you. Partridges are cheaper this year, but what about pear trees?

Finally, Anjana Ahuja has a rather more serious story for the Financial Times about the mounting evidence that the vast quantities of alcohol consumed every Christmas are having a very serious effect on our physical health.

Thanks to Podington Bear for our theme song, ‘Am-Trans’.

How to win a Data Journalism Award

Correction: In the podcast, we refer to the organisation running the Data Journalism Awards (GEN) as the General Editors’ Network. It is in fact the Global Editors’ Network.

Entries are now open for the Data Journalism Awards 2017, as of 28 November. Interhacktives are the media partners of this year’s awards, and on this episode of Data Day, Luke Barratt and Ryan Watts give them an introduction.

Past winners have included the Panama Papers, but this year for the first time, there is a category for students and young data journalists! With that in mind, we discuss some of the things that impressed us about last year’s winners, and what strategies might help you to win one this time around.

Simon Rogers, Data Editor at Google News Lab, is the director of the DJA, and the president is Paul Steiger, Executive Chairman of ProPublica’s board of directors.

This is the second year that Interhacktives have been involved with the Data Journalism Awards as media partners, and this podcast is the first in a series of content we will be running as we approach the deadline. Interhacktives will be your guide to the different categories, and a vital source of information on creating a winning entry. We will also be renewing last year’s series focusing on past winners.

The deadline for submission to the Data Journalism Awards 2017 is 7 April 2017. Winners will be announced on 22 June at the DJA 2017 Ceremony & Gala Dinner in Vienna.

More details can be found on the Data Journalism Awards website.

Süddeutsche Zeitung’s award-winning Panama Papers investigation.

Al-Jazeera America’s successful entry into the Breaking News category, using data to chart the process of an Amtrak train’s derailment.

Enter the Data Journalism Awards

Data day: The rise of fake news on Facebook

Did Pope Francis endorse Donald Trump? Did Hillary Clinton sell weapons to Isis? If you don’t know the answers to these questions, you may have been the victim of fake news. In the first episode of a new podcast from Interhacktives – Data Day – Ella Wilks-Harper and Luke Barratt discuss the rise of fake news, question whether the crisis has been overstated, and examine some possible solutions to the problem.

Fake news on Facebook has been the subject of a frenzied debate recently, especially around a US election that has seen a country divided bitterly. As Americans – and Brits – retreat into online echo chambers of their own making, filling their Facebook feeds with people who agree with them, is it any wonder that ideology might start to trump fact? Some consider fake news the logical conclusion of the filter bubble. Will it be a wake-up call for Facebook to recognise editorial responsibility and abandon the utopian dream of its impersonal, all-ruling algorithm?

Mark Zuckerburg’s initial response to the fake news scandal:

Buzzfeed’s story about Macedonian teenagers using fake news to garner ad revenue:

A letter from the editor of Aftenposten attacking Zuckerburg over the censoring of a picture from the Vietnam War:

Buzzfeed’s analysis of engagement with fake news on Facebook in the last few months before the US election:

6 tips from Google on making compelling visualisations

Journalists aren’t used to taking advice from tech companies. Indeed, the row between Facebook and the journalism industry has intensified over the last two months. A Vox article published earlier this month attacked Mark Zuckerberg, who has been called “the world’s most powerful editor” for abandoning his editorial responsibilities.

Recently, Facebook fired the human editors on its ‘Trending’ team, prompting a flurry of clearly fake news stories from its automated algorithm.

Meanwhile, Google seems to have been moving in the opposite direction. Its News Lab has now been around for over a year, and has the explicit intention of collaborating with and empowering journalists. The Lab, which is run by former Boston Globe reporter Steve Grove, frequently works alongside journalists.

News Lab hosts a monthly Data Visualisation Round-Up in the form of a live YouTube discussion between Simon Rogers, the Google Trends Data Editor and former Guardian journalist, and Alberto Cairo, the Knight Chair in Visual Journalism at the School of Communication of the University of Miami.

From their 31 October discussion, here are some key points:

1. Graphics need a human side

Data journalism sometimes gets a reputation for being cold and calculating, as a place where statistics matter more than humanity. But data journalists are more than just automated counting machines, who often bring their emotions and convictions to bear on their work, and it is vital for data journalism to reflect that.

In the video, Simon Rogers recommends the September 2016 book Dear Data, by Giorgia Lupi and Stefanie Posavec. These two information designers, physically separated by the Atlantic, spent a year befriending each other by sending weekly hand-drawn data visualizations on postcards back and forth.

The cards contain many examples of innovative ways of displaying data, but the project was about more than that. Rogers calls it “a reminder that graphics should feel human and warm”.

2. Imprecision is fine

The era of Big DataTM has encouraged the growth of imprecise data analysis. In days gone by, sampling was the only game in town, and it was necessary for data to be incredibly precise, since datasets were relatively small. Now that data analysis and data journalism is starting to use big data, the sheer sizes of today’s datasets eliminate any problems that might arise from occasionally imprecise points.

This image, for example, illustrates the relative interest around the world in Hillary Clinton and Brexit.

Google News Lab teamed up with Accurat, a data research firm, to create World Potus, a project that uses Google Trends to look at how people in countries around the world were discussing the US election, by analysing their Google Searches.

Naturally, when using data from every single Google Search, some data points will be unhelpful. Someone might misspell ‘Clinton’ in an unpredictable way, or search while on holiday, making their geographical data misleading.

But since Google Trends uses big data, this doesn’t matter. There are so many points in this dataset that imprecision pales into irrelevance.

3. Data journalism should be collaborative

While more traditional journalists jealously guard their scoops, and are full of stories about the ruthless methods they’ve had to employ to get to the scene of a story first, data journalists can often be seen asking for (and receiving) help on Twitter from their colleagues. What’s more, articles often come complete with a link to the original data, so that other data journalists can dig for their own stories.

This is why all the code used in projects like World Potus is available on the Google Trends Github page.

4. We have to think more about our audience

Data visualisation is no longer the insurgent force it once was in the journalism industry. These days, infographics are pretty much par for the course, so much so that Giorgia Lupi has described our current period as “post-peak infographic”.

Sure enough, the New York Times has announced that it will now be producing fewer huge visuals. Does this mean that we’ve got over our initial enthusiasm for data visualization?

Rogers has a more nuanced view: “People are fussier about what they’ll love.” In other words, because of the recent glut of infographics, there is more importance on ensuring that the visualization serves the story and serves the audience.

5. Print can be more powerful than online

It is often assumed that data visualization is native to the Internet. While it is true that the online medium brings with it huge potential for interactive features, print can still play a vital role in visualization.

Alberto Cairo explains that he still buys print newspapers, and enthuses about the New York Times’ double page spread listing people who have been insulted by Donald Trump. The online version  is impressive, and gives the reader the ability to click through to specific insults, but the size and physical presence of a double page spread in the New York Times really brings home the extent of Trump’s vituperative qualities.

Cairo also cites the National Geographic magazine as a perfect example, specifically highlighting sketches by the artist Fernando Baptista, made for a large pictorial illustrated infographic about the Sagrada Familia cathedral in Barcelona.

“It’s gonna be like people listening to music on vinyl.” This remark from Simon Rogers perhaps betrays nostalgia stemming from his journalistic background, but probably chimes with the views of many modern journalists.

6. Data journalists must think about posterity

Excitingly, Rogers and Cairo seem to be planning some kind of grand archive for data journalism. One pitfall for visualization is the expiration of online programmes. For this reason, when Google starts a new initiative, it always has a plan for making sure that projects made using that programme will survive even if Google discontinues it.

As with much online journalism, data visualizations can be ephemeral, fading away after their first publication. Data journalists need to think about preserving their work, much of which will remain relevant for long periods of time.

Watch the full discussion here:

#Flashhacks: an evening on company data and corporate networks

It was over donuts and sushi that Interhacktives found out more about corporate networks and how to access company data.

Last Wednesday, a pack of 13 of us attended the meetup “Flash Hacks: Map the Banks”, an initiative by the London-based organisation Open Corporates that aims to create a more accurate picture of the financial sector. The initiative has a point: the corporate world has a huge impact on the wider world. With the financial crisis costing society over 10 trillion dollars, they argue that businesses should be held to account in the same way as public bodies and their data should be available to view freely.

The participants were divided into two groups: those who knew how to code went to write the necessary scrapers to help in the task, and those who didn’t were taken on a tour of Open Corporates’ tools and database.

Here are some of the tools, which can be useful to journalists for sourcing and verifying company data:

Open Corporates’ database

The organisation has information available for more than 84 million companies from more than 100 jurisdictions. It is possible to search for companies, directors and filter by jurisdiction.

Corporate network and Octopus

Since 2012, Open Corporates has been working on making company networks public, what the organisation calls the “Holy Grail of business information”.  The tool is great for understanding the complexity of multinationals and what ramifications they have. They create visualisations from data, for example from the Federal Reserve about banking companies in the US. In this one below, it is possible to find out that Goldman Sachs consists of more than 4,000 separate corporate entities all over the world. Open Corporates moreover has a tool called Octopus that allows anyone interested to contribute to creating new networks.

Source: Open Corporates

Who Controls It

A recently launched tool, proof-of-concept, open source beneficial ownership register that would make it possible for anyone to check who or what controls a company. Who Controls It is still a prototype, but it sounds like a promising tool for checking possible ramifications of a business’ activities and for investigating fraud and money-laundering.

Open Corporates APIs

For those who know a bit of programming, these tools might be useful. The organisation has two APIs: the Rest API and the Open Refine Conciliation API. They allow access to the organisation’s full database on a more granular level and make it possible to match company names to external data. Rest API is for retrieving information from the Open Corporates database. The Open Refine Conciliation API allows Google Refine users to match company names to legal corporate entities. It is especially useful when you have an existing spreadsheet with many different companies and you need to reconcile yours with other datasets.


The Map the Banks initiative is ongoing: if you code and want to help, the organisation has a list of missions to be completed. At the moment, 10% of the missions are done.

What’s on Reddit’s front page?

Reddit is an online super-community with hundreds of millions of users, and has become in recent years an arbiter of what’s cool and what’s not on the web. If something makes it to the front page of reddit, where it is most visible, it will inevitably receive millions of views.

Reddit stats

The way the site works is users post content – pictures, article links, conversation starters etc – and the success of that content is determined by whether the reddit community likes it (upvotes) or talks about it (comments) or just clicks on it.

Submissions are made to the relevant subreddit – a subject specific community – and should they prove popular, can rise to the front page. This is the reddit mainstream. And I scraped it.

Digg vs Reddit via Quantcast
Digg vs Reddit via Quantcast

Three times a day, for two weeks, in March and April of this year, I scraped the data from front page of r/all to see what is popular on reddit, and what that means.

Reddit is growing. It’s the 58th most visited site on the net (up 6 places from last quarter), and the 21st most popular in the US. Since it defeated Digg at the turn of the decade, reddit has established itself as really the only aggregate site in town – and with that comes power.

If reddit helps shape the internet conversation, what does the data say about reddit?

reddittop25 This is the top 25 subreddits over that fortnight of scrapeage – the front page of subs, if you will.

Perhaps predictably, r/funny is at the top. It appeared the most on the front page, received the most upvotes, and the second most comments because, naturally, it has the most subscribers (over 6 million).

Other predictably popular subs include memes (#2), cute animal pics (#4) and video games (#5).

Interestingly, a few of the more stereotypically reddit subs barely made the front page, or didn’t even at all. The site is known for its militant atheism, and yet that subreddit only made it to #25. While the site’s marijuana predilection could only reach #26 – no place with the best of the best.

Only two of the top-25 are substantially NSFW (Not Safe For Work). The sub r/WTF – wherein people post strange and disturbing things – is about a third NSFW whereas r/gonewild, the site’s most popular porn sub, is exclusively not for the workplace (unless you work from home).

The rankings largely stay the same when using comments instead of upvotes as the key parameter, except there is a notable rise of interaction-led subs like r/askreddit and r/IAmA. Askreddit, in particular, skyrockets to the top of the front page despite only appearing 9 times over the two weeks to r/funny’s 226.

As for the average scores and comments for front-page posts, r/pics and r/askreddit are respectively the top dogs. Where r/funny rules in front page appearances and accumulated points, it doesn’t even reach the top 10 in either category. That suggests that reddit’s biggest sub is more quantity than quality.

There is an obvious outlier amongst these broad and mainstream subs and that is r/leagueoflegends.

It’s a community dedicated to an exceedingly popular 2012 PC game. With almost 500,000 subscribers, it is the 41st largest subreddit but its community activity exceeds even that.

Stats for r/leagueoflegends
Stats for r/leagueoflegends

One of the moderators of r/league of legends, arya, said: “This subreddit is the largest unofficial community for LoL. We get between 500-1000 new subscribers per day I’d estimate. Big events do show an influx of new users and higher activities. I remember during Worlds when the stream shut down due to technical errors, the thread about it reached the top of r/all within minutes.

KingKrapp, another mod, said: “From what we’ve experienced, a lot of our users only come here and don’t really interact with the rest of reddit. We’re a very specific community compared to other big subs.”

It’s the success of niche-y subs like r/leagueoflegends that prompted reddit to introduce trending subreddits at the top of the front page in April.

Umbrae, mod for trendingsubreddits, said: “The thinking behind trending was essentially that there’s a lot of diversity to reddit, but that many of the visitors to the homepage don’t see or understand that. This gives a good hint to the breadth of reddit, while at the same time giving deeply engaged folks a new source of interesting communities.”

The initiative has so far been a success, with Umbrae reporting: “A lot of smaller subs have definitely gotten exposure.”




Only 20% of top subreddits are not and have never been default to new subscribers. Default subreddits have more subscribers (naturally) and more interaction, but they consequently have less community.

At the beginning of May, r/mildlyinteresting became a default sub. Its popularity, according to mod RedSquaree, is because “all the content is original, and chances are that nobody has seen anything posted here before. It also doesn’t aim to be amazing content, so expectations are low and people are happy.”

mildly interesting stats
Stats from r/mildlyinteresting

Of its new status, RedSquaree said: “Our growth was very steady until the recent increase as a result of being a default. [It has led to] more removals and a deteriorating comments section.”

It seems that a sizeable sub comes at the expense of a close community. Karmanaut, mod of r/IAmA, said: “Unfortunately, there isn’t a very strong r/IAma community. I think one of the main reasons behind this is that there is no core of submitters, because there are very few people with multiple submissions. Unlike most other subreddits, all of r/IAmA is original content and has to be done by the original person. And each person has a limited involvement. In its infancy, there was a smaller group of individuals who were very involved in the subreddit but since growing to its larger size, those individuals are no longer necessary to recruit AMA subjects.”

So those are the communities, but what do the actual posts say?


These are the most frequently used words in that two-week period. You can see where the interests of the site lie – there’s an inordinate number of mentions of Oculus, the VR company Facebook bought, compared to the MH370 drama.

Here’s the most popular post of that entire period. It may have only ended up at 4,003 karma but this post received more than 56,000 upvotes.

Screen shot 2014-05-30 at 14.40.56


Perhaps it is what it always was, or what it was always going to be, but reddit is largely a chill place. People go on the front page for a joke, a pretty picture, to learn a weird fact, or take part in an amusing straw poll. It’s a nice place to hang out, it isn’t challenging. Its major contribution to the internet conversation is jokes, memes and silly things that will crop up on Buzzfeed a few hours later.

With trendingsubreddits, the site is attempting to change that in a way. Not so much the pleasant interactions, but the homogenized output. Perhaps by promoting the nichier subs, the front page will change.

Because, just as Katy Perry is not an accurate reflection of modern music, neither is r/funny representative of reddit and its many weird and wonderful subs.

How To… Visualise the ‘Healthy Life Expectancy’ Disparity

What is a healthy life expectancy?

The measure of a healthy life expectancy (HALE) is different from that of average life expectancy. The latter refers to the average (mean) amount of years a human can expect to live; the former refers to how many they can expect to live in good health.

The disparity between them is often depressing: According to Eurostat, in 2011 the average French citizen could expect to live in good health for 63 years, while their total life expectancy was 81 years. That’s almost two decades of living in poor health. Since ‘life expectancy’ is frequently used as a measure of a country’s success, that statistic often masks the reality of its population’s senescence and poor health.

So to visualise exactly how great the disparity between life expectancy and HALE really is, I created an interactive visual on Tableau Public. Using the data from 2012 – 2008 (the last five years available), I mapped out each country’s life expectancy and HALE. Navigate through the years through the tabs at the top, hover over a country to see the difference in stark figures, then click on the country to see that graphed.

While it may seem depressing on first viewing, it’s important to note that overall the healthy life expectancy across Europe has, in fact, gone up slightly. Over such a small section of analysis, this is hardly statistical significant, but since it falls in line with figures from the OECD that suggest HALE will increasingly become a larger percentage of a population’s overall life expectancy, it is certainly a cause for optimism.

Simon Rogers interview: ‘Who cares if I’m still a journalist?’

A veritable giant of data journalism, Simon Rogers launched the Guardian’s Datablog in 2011 before moving over to Twitter where he now manages the site’s vast quantities of data. We asked him about the perils of data journalism’s popularity and where it’s all headed.

Twitter has an unbelievable amount of data – what do you with it all?

It’s a lot of data — around 500 m Tweets a day. What we try to do is tell stories with it, much of which entails making it smaller and more manageable, to filter out the noise that we don’t need. People Tweet how they think and how they behave — the data can show you amazing patterns in the way we respond as humans to events as they happen. When a story breaks somewhere, or a goal is scored or a song is performed, you can discern these ripples across Twitter. It’s getting those ripples out of the data that is the challenge.

What’s the day-to-day like as data editor at Twitter?

It is such a mix and each day brings its own surprises and challenges. At one end of the spectrum I use free tools such as Datawrapper or CartoDB to make maps and charts that respond to breaking news stories or events, such as this one on the spread of Beyonce’s new album or the discussion around events in the Ukraine or the conversation around #Sochi2014. At the other end of the spectrum, I get to work with the data scientists on Twitter’s visual insights team to produce things like this interactive guide to the State of the Union speech or this photogrid of the Oscars, which is essentially a treemap with pictures. Right now we’re thinking ahead to things like the World Cup and the US Midterm Elections to answer the question: how can we use Twitter data to help tell the stories that matter?

Simon rogers twitter

Are you still a journalist?

I’ve wanted to be a journalist since the age of eight and it’s completely in my DNA. Over that time the idea of what was or wasn’t a journalist has completely changed. When I started the Datablog at the Guardian, people asked if data journalism was really journalism at all to which my response was: who cares? My feeling is that you just get on with it and let someone else worry about the definitions. My job is to tell stories and make information more accessible to people. I take Adrian Holovaty’s approach to this:

1. Who cares?

2. I hope my competitors waste their time arguing about this as long as possible.

What do you think about the Guardian’s Datablog since you left?

The Datablog was my baby and always will be special to me but I have to let it go and not interfere, so that’s what I’m going to do.

 guardian datablog

What drove you to found Datablog?

We had a lot of data that’s we’d collected to help the graphics team and we also saw there was a growing group of open data enthusiasts out there who were hungry for the raw information. So that’s how it started: as a way to get the data out there to the world and make is accessible.

Have you found there any difference in the attitudes towards or ideas about data journalism in the US and UK?

The differences in data journalism mirror the differences in reporting I would say. It’s a huge generalisation but I would say US data journalism tends to be about long investigations while a lot of the British reporting is aimed at shorter pieces answering questions. But there are exceptions on both sides. They come from different places: US data journalism is based in the investigative reporting of giants such as Philip Meyer; modern British data journalism was born out the of the open data movement and had at least as much to owe to a desire to free up public information as to big investigations.

Is data journalism ‘having a moment’ or are we in the midst of a very real paradigm shift?

It’s becoming mainstream and, just as in other areas of reporting, it is developing different strands and approaches. Partly because there are just so many stories in data now — and to get those stories journalists need skills and approaches they didn’t use before.

Facts are Sacred

Some have said that data journalism is intellectually elitist, perhaps even already out of touch. How would you respond?

I think we are really at an interesting stage. The last few months have seen a lot of reporting resources put into data journalism, certainly in the US. I think what’s happening is that it is developing different strains — in the same way as you have features and news reporting in traditional journalism. You have the ‘curious questions’ type of data journalism which focuses on asking about oddities; then there is the open data type of data journalism which is all about freeing up information. I’m not convinced that we have as a group got the balance correct between showing off how clever we are and making the data accessible and open. That last part is what I’m interested in. I don’t need to see anyone showing off.

Journalists are no longer just writers, they are designers. How important are pictures, diagrams and infographics?

I speak as someone who has just worked on this range of infographic books for children. We have visual minds and telling a story effectively with images will always have a greater impact than words on a page. Some of the most detailed journalistic work I have ever done has resulted in images and graphics as opposed to long articles.

Have you seen any recent data journalism that has particularly caught your eye? And what is it that you look for in a good article/webpage?

I love the work of the WNYC data journalism team, and La Nacion’s commitment to spreading data journalism and openness in South America is amazing and really powerful.

I love maps but there are just so many of them these days. Is data journalism becoming over-saturated?

There are a lot of maps around but it’s just one visual tool. Maybe we don’t ask enough questions about which type of visualisation is most powerful and important to complement a story or feature and a map is often easiest. But also that reflects the lack of decent tools for us to use. If I want to visualise a Twitter conversation off the shelf, that often means a map or a line chart because that is what I can do easily and quickly on my own. Part of my job is to think about new ways for us to do this in future.

Do you think data journalism runs the risk of looking at the big picture at the expense of the small one?

Not being able to see the wood for the trees? The best data journalism complements the big data picture with the individual stories and story telling that brings those numbers to life. I’ve been fortunate enough to work with amazing reporters who tell very human tales and the numbers just gain so much power from joining those two elements together.

Do you have any favourite data tools – scraping, cleaning, visualising?twitter data

My visual tools of the moment are: CartoDBDatawrapper, Illustrator and newly I love Raw (just discovered it).

Do you have any core principles when deciding how to express data?

I normally start off with some idea of what I’m trying to ask, otherwise the data is just too big to be manageable. Love that moment when you do the grunt work to clean up the data and it starts to tell you something meaningful.

Do you have any tips for aspiring data journalists?

The days when you could get a job in a newsroom just by knowing excel have probably gone or are going. Increasingly the data journalists who succeed will also be able to tell a story. The other piece of advice? Find something that needs doing in the newsroom — that no-one else wants to do — and be the very best in the world at doing it.


Carl Bialik interview: ‘Any data set has eureka potential’

Carl Bialik is a writer for Nate Silver‘s new website FiveThirtyEight, having recently moved from the Wall Street Journal where he started The Numbers Guy column. I ask him about the ups, downs and difficulties of being a data journalist, as well as what he thinks are the most important traits for being successful in the field.

You recently moved to FiveThirtyEight from the WSJ: do you think the two publications differ in their approach to data analysis?

With The Numbers Guy at the WSJ, my role was more about looking at other people’s data analyses, taking them apart and finding the weaknesses in them. I’m going to be doing some of that at FiveThirtyEight but will be more focussed on doing original data analysis.

When you first started at WSJ, were you a data journalist? Or was this more of an organic development?

When I started at the WSJ I don’t think I had even heard the term “data journalism”, and I wasn’t a data journalist for most of my first years there. The more specialised role came later when I started writing The Numbers Guy column. Then, when the WSJ expanded its sport coverage, I started to write much more about sports from a data point of view.

Which is your favourite sport to write about?

My favourite sport to follow is tennis, which is in some ways both my favourite and least favourite sport to write about. It’s my favourite because it’s largely untapped territory in terms of data analysis, but it’s also one of my least favourites because of the way that the data has been archived, making it one of the most difficult to get accurate data for. It’s a pretty fertile area, though, and although it’s not big in the USA, there’s always going to be a focus around major events.

What steps do you take to make sure that the data you are analysing is accurate?

There are some built-in error checks with analysis, which can help determine the reliability of the data. These include checking whether the data you are running the analysis on makes sense, and looking whether different analyses produce similar results. Another important question to ask yourself is whether there is some important factor that you are not controlling for.

At FiveThirtyEight we also have a quantitative editor who reviews your work and points things out for you, such as confounding variables and sources of error. Readers are really vital for this, too: the feedback we have already received from readers who tell us when they think we have made mistakes has been extremely useful.

What do you think are the most important traits for being a good data journalist?

The first is having a good statistical foundation, which includes being comfortable with coding and using various types of software. The others are the same as for all types of journalist: being a collaborator, fair, open-minded, ethical, and responsive to both readers and sources.

Which data journalists do you particularly admire?

I’ve admired the work of many data journalists, including my current colleagues, and my former colleagues  at the Wall Street Journal. Certainly Nate Silver at FiveThirtyEight: he is a large part of the reason that I wanted to work with FiveThirtyEight in the first place. Also my colleague Mona Chalabi because she has a great eye for finding stories with interesting data behind them.

What’s the best part of being a data journalist?

Compared to most journalism, I think there is more potential to have an “aha” [eureka] moment for any given story, since it can sometimes be a slog if you’re trying to get that just from interviews or other sources. Any data set has the potential to give you a couple of these moments if you’re spending just a few hours looking at it.

And the most difficult part?

I think number one is when you can’t get hold of the data for something: occasionally a topic can be very hard to measure, and you would love to write about it but just don’t have a way in. This is often the case with sport in particular, where there can be measurement problems, issues with the quality of the data, or even a complete scarcity of it. So issues with data quality and access are the most difficult parts.


The 2014 budget highlighted data journalism’s mobile device woes

Data Coverage of the Budget 2014 - Telegraph Chart
Data Coverage of the Budget 2014 - George Osborne
Image: 38 Degrees

Data journalism is in vogue these days so what better time to draw up a graph than at budget time, when communicating lots of numbers efficiently is the top priority? The 2014 Budget saw some great data coverage across the board, but it also showed that one of data journalism’s biggest challenges was finding a format that works well on mobile devices. In this post I’ll take you through some of the stuff that worked really well on mobile and other stuff that didn’t translate from desktop.

Why is it important that data journalism works on mobile?

At the Digital Media Strategies 2014 conference earlier this month, Douglas McCabe of research firm Enders Analysis said that the time people spend on the internet on mobile devices will overtake the time they spend online on a desktop by next year.

If you have a blog, you only need to take a look at your analytics to see how much of your traffic comes from mobile devices. If you haven’t already done so, it will be a lot. It is, therefore, pretty important that your content works well on mobile and that carefully crafted visualisations, designed to make visitors invest some time on your site, don’t leave your readers putting down their phones in frustration.

Try viewing this on mobile if you want to experience what I mean.

2014 budget coverage – the Telegraph

I’m kicking off with the Telegraph‘s coverage because it was probably one of the best for working on mobile devices. (All the screenshots in this article were taken from my iPhone 5, so you would expect that it would be able to handle most things.)

Data Coverage of the Budget 2014 - Telegraph Chart
The Telegraph’s data coverage of the 2014 budget with their chart-builder


Rather than attempt to embed their charts in the body of their article, the Telegraph programmed this chart viewer using their in-house chart building system and then linked to it from the body of their article. As you can see, it works really well. You can easily have the chart and the accompanying text side by side whilst being able to comfortably read both. It is also interactive and gives you the option of clicking onto the next chart.

This is all very well, but what if you don’t have the time, resources or inclination to build your own in-house chart system? 

The Guardian used Datawrapper to mixed effect on mobile

The Guardian’s data blog is a hotbed of interesting visualisations but for budget day they decided to keep it simple. They used what looks like customised versions of Datawrapper charts to display Osborne’s budget. Datawrapper is really responsive and should theoretically work really well on mobile. So on a day when a lot more people than normal are likely to be reading the data blog it makes sense to keep things simple rather than going for a more detailed graphic.

Data Coverage of the Budget 2014 - Guardian Unclear Line Chart
Budget coverage on the Guardian’s data blog

In reality, though there was a slight problem. This is what one of the line charts looked like:

The line of the graph itself showed up fine but the axes didn’t show up on the portrait version of my phone because they were too wide to fit on. Looking at it from this view, the chart isn’t very informative.

This problem was solved when turning the phone to a landscape view and this may seem like a pedantic point to highlight. However, the Guardian were relying on people realising that they needed to tilt their phones when reading the article and could well have confused those who didn’t realise this was needed. Why alienate a part of your audience, however small, when it could be accessible to them all?

When the Guardian’s charts worked well, however, they were probably the most interesting in terms of the story that they were telling. This bar chart showing that since 2010, Osborne’s budgets haven’t been particularly harsh or eventful was something that hadn’t been visualised anywhere else.

Data Coverage of the Budget 2014 - Guardian Good Bar
Bar chart from the Guardian’s data blog

The Daily Mail tried hard with a 3D pie chart

The Daily Mail obviously tried to take all this into account by playing it pretty safe with their data coverage. Although not extensive, it did extend to this non-interactive gem of a pie chart:

Data Coverage of the Budget 2014 - Mail 3D Pie chart
The Mail commit a cardinal sin with a 3D pie chart

For the purposes of this article, the Mail‘s chart succeeded because it could be read well on mobile. However, in terms of being an effective visualisation it fails miserably, committing a cardinal sin of data journalism. 3D pie charts may look flashy but the very nature of that third dimension skews how big the segments look to the naked eye. In this case the national insurance segment is actually smaller than the ‘other’ segment’ but it would be difficult to tell this by looking at the graph.

Ampp3d’s 2014 budget coverage was designed for mobile

Data Coverage of the Budget 2014 - Ampp3d Bar 2
Ampp3d’s data coverage words really well on mobile

Ampp3d is a relatively new website set up by Trinity Mirror with the remit to create socially shareable data journalism. They run their site on Tumblr and as such it is really responsive to different formats. Ampp3d was basically set up to compare favourably in a piece such as the one I am writing. And, it does.

They, like the Guardian, used Datawrapper to communicate different aspects of the budget. However, because Tumblr is more responsive than the Guardian’s site, the charts’ axes were still visible when the phone was held in portrait mode. This meant that whichever way you looked at it, it was easy for a reader to read the bar chart and subsequently understand the story.

Visualisations will adapt to mobile but we have to adapt as well

None of the visualisations discussed in this post were terrible. There were no attempts at the type of elaborate map that is impossible to read on mobile.  Some were really good and most had only minor flaws. But when trying to persuade somebody to spend time on your site, those minor flaws can be the difference between them staying or bouncing.

Visualisation software will no doubt improve in the future and render many of these problems irrelevant. Until that happens, however, data journalists have to take the limitations of mobile into account, even if it means sacrificing an impressive Tableau for a simple table.

How to extract data from a PDF

We live in a world where PDF is king. Perhaps we could even go as far as to call it the tyranny of the PDF.

Developed in the early 90s as a way to share documents among computers running incompatible software, the Portable Document Format (PDF) offers a consistent appearance on all devices, ensuring content control and making it difficult for others to copy the information contained within.

Continue reading “How to extract data from a PDF”