Interview with Eliot Higgins of Bellingcat

Verifying missile launchers, tracking down ISIS supporters and holding worldwide governments to account is just a day’s work for 36-year-old Eliot Higgins.

Last time I met Higgins, an independent intelligence analyst, he was giving a talk about his work with Bellingcat, the investigative news network he founded in 2012. It was this network that trawled the internet’s vast and polluted reservoir of publicly accessible material to track down the Russian-owned missile-launcher that took down the Malaysia Airlines Flight 17 over Ukraine in 2014.

This time, it’s me doing the tracking, struggling to find Higgins on the hectic roundabout at Old Street station. I eventually spot him standing next to a telephone booth squinting through his glasses, in a Matrix-style black coat.

We only have fifteen minutes but there’s an excitement in Higgins’ eyes as he talks about his work, while easing into his chair. Ordering a Coke, he laughs at how he’s been trying to avoid caffeine.

Bellingcat formed after Higgins’ personal blog, Brown Moses, attracted huge attention as he was able to uncover war atrocities, such as the use of cluster bombs, from the comfort of his home in Leicester. It was the readily available nature of open source tools that prompted Higgins to start Bellingcat and form a network where others were able to learn how to use such tools in their own investigations.

I ask him what his proudest moment is with Bellingcat since his formation. He screws up his face a little, “Hiring people is a lot of fun”, he says. “If it wasn’t for what we did, we would have had this whole narrative of [the] Russian government [claiming to intervene in Syria only to fight ISIS, and not prop up President Bashar al-Assad’s regime] that wouldn’t have been challenged,” he explained.. “And, you know, there are families involved who are being lied to by the Russian government and without us, there would have been no push back.”

“For me, a lot of what we do is about accountability and justice and working with international organisations on that.”

Though the website states it is by and for “citizen investigative journalists”, and many news outlets, including the Financial Times, call its founder a “citizen journalist”, Higgins himself is uneasy about the label.

Shuffling in his seat, he explains: “It’s not citizen journalism. It’s not just about conflict or journalism. It’s about all kinds of different areas. From my perspective, the work we do is not about journalism: it’s about the research and getting it [the findings and tools] to people that can actually do something with it.”

“For me, a lot of what we do is about accountability and justice and working with international organisations on that.”

While Higgins wants to distance Bellingcat from being purely journalistic, the network’s handful of contributors definitely shares a hack’s mind-set, utilising publicly available tools, such as Google Earth and social media, to investigate atrocities abroad.

Credit: SKUP 2015, Marius Nyheim Kristoffersen

Three years after the network shot to fame by solving the MH17 mystery, it now covers all corners of the Earth and is fast becoming a force to be reckoned with. This was made clear last November, when Higgins quashed the Russian government’s denials over the bombing of a hospital in Syria. By comparing satellite and on-ground photographs from 2014 to 2016, he was able to show specific areas that were in fact damaged by bombing.

Bellingcat also drew huge media attention after using social media to track down ISIS supporters. Most recently investigators used archived Facebook profile and geo-located social media photos to hunt the Berlin Christmas market attack suspect.

“We thought it would be impossible. Within an hour we had the exact location”

When I ask him about how Bellingcat uses social media in their investigations, he blushes, admitting that they recently caused a “minor panic” in Holland, after the network asked its Twitter followers to geo-locate a photograph found on an online community consisting of ISIS supporters. He laughs, shaking his head as he notices my eyes widen: “It’s nothing urgent or scary. We had one photograph [and] we just wanted to know where it was because it looked like it was in Europe. So we put it out on Twitter, asking if people could help geo-locate it.

“We thought it would be impossible. Within an hour we had the exact location: in a holiday park in Holland. The police showed up at the holiday park and the poor manager had to come out in the middle of the night.”

This brings our conversation to online privacy, as I noted that day he asked his 49k strong Twitter followers about Donald Trump. He says, with a cheeky glint in his eye: “My Twitter page looks like I do a lot online. But if I’m away, I won’t share when I’m actually away. If I post a picture of my time abroad it’s often a week after I’ve actually been there.”

He adds, laughing: It amazes me that people keep their Instagram profiles public. Who needs likes that much?”

I keep my own settings to myself as he stands up to leave, shaking my hand and plonking the Coke can on the table. At that point, I sadly decide it’s time to change my Instagram settings to private.

Data day: The rise of fake news on Facebook

Did Pope Francis endorse Donald Trump? Did Hillary Clinton sell weapons to Isis? If you don’t know the answers to these questions, you may have been the victim of fake news. In the first episode of a new podcast from Interhacktives – Data Day – Ella Wilks-Harper and Luke Barratt discuss the rise of fake news, question whether the crisis has been overstated, and examine some possible solutions to the problem.

Fake news on Facebook has been the subject of a frenzied debate recently, especially around a US election that has seen a country divided bitterly. As Americans – and Brits – retreat into online echo chambers of their own making, filling their Facebook feeds with people who agree with them, is it any wonder that ideology might start to trump fact? Some consider fake news the logical conclusion of the filter bubble. Will it be a wake-up call for Facebook to recognise editorial responsibility and abandon the utopian dream of its impersonal, all-ruling algorithm?

Mark Zuckerburg’s initial response to the fake news scandal:

bit.ly/2fZ533d

Buzzfeed’s story about Macedonian teenagers using fake news to garner ad revenue:

bzfd.it/2fYYxcZ

A letter from the editor of Aftenposten attacking Zuckerburg over the censoring of a picture from the Vietnam War:

bit.ly/2fZ4QNJ

Buzzfeed’s analysis of engagement with fake news on Facebook in the last few months before the US election:

bzfd.it/2fZ5JWt

Are analytics changing newsrooms? Interview with Federica Cherubini

analytics and newsroom

and Esmeralda Sandoval

Analytics are some of the most effective tools publishers have for distributing stories. Yet, implementing analytics and tailoring them to an organizations specific needs has proved challenging for many news rooms.

We spoke to Federica Cherubini, a media consultant and editorial researcher who worked for the World Association of Newspapers and News Publishers (WAN-IFRA) in Paris.

 

Together with Rasmus Kleis Nielsen she authored the Reuters Institute for the Study of Journalism’s reportEditorial analytics: How news media are developing and using audience data and metrics.

Cherubini, and her co-authors, conducted 30 interviews across eight countries to uncover how newsrooms are working with analytics. 

How can work in a newsroom be affected by the use of metrics and analytics?

Nowadays, tools and ways to get data information are common in a newsroom. It is very typical to see a big screen with real time traffic in a newsroom. What publishers are now trying to develop are best practices to try to use analytics, not just in the distribution process after an article has been pushed out on different platforms, but also to perform and produce better journalism.

Which examples of best practices did you find out during your research?

We defined best practice as those who use editorial analytics instead of generic and rudimental use of analytics, that means analytics are tailored to each news organization and the newsrooms decide what they want to look at.

Currently some of the most popular metrics are: time spent on a page, the number of shares, retweets,  and comments to see how the users interact with the content.

If you produce a piece of content where do you want to publish it? Is it a piece suitable for Facebook or other platforms? The problem that many newsrooms have with analytics is that they look at data as just numbers which don’t mean very much. The newsrooms that use best practices are those that give numbers a context.

Will analytics change editorial decisions?

No, because if a journalist or editor decides to write on a topic that is important to write about, they will. The main thing is that data will never replace your own judgment, data only helps you be more effective.

Chris Moran, audience editor at the Guardian, always says that it is important to decide when you can publish your article. How you change your publishing schedule still reflects a print mind-set, you can use the data to inform that decision and be more effective.

 

Are there any weaknesses in what newsrooms are doing with analytics?

Many newsrooms are a bit generic and basic. They gather the data, they share the data, maybe the journalist gets an email everyday with their performances of the day before, but that’s it. So one weakness is not really trying to turn it into actionable insight.

Another weakness is not just for the newsroom, but it is very difficult to track data across devices or across platforms.  So is a share on Facebook the same as a Tweet? Does it have the same impact or value? So trying to understand how the data in different mediums translate to different platforms.

Where do you think we are going now in terms of data and analytics, all this stuff that is new for old school style journalists?

I think newsrooms are getting more sophisticated. But they need to understand that one approach doesn’t exist. There is no one set of tricks you just learn and you’re done.

I really think it should be focused and tailored on each news organization. Otherwise its tricks to improve the headlines and getting more reach. Pure reach, irresponsible reach, doesn’t get you anywhere, doesn’t mean that the reader is going to come back.

Reach, or being big, isn’t enough anymore. The next question is about how you turn your audience into a loyal audience.

And metrics taps into that in helping you have a bit more information and to test hypothesis in the newsroom.  You can experiment go back and look at the data and see if it worked. If it didn’t you can change your approach the next day.

 

Federica Cherubini currently works with WAN-IFRA on engagement strategies and editorial conference planning.

 

Data Journalism Awards past winner focus: The Migrant Files

migrant files nicolas kayser bril

The Data Journalism Awards organised by the Global Editor’s Network (GEN) showcase some of the best data journalism every year. Here we take a look at past winners in anticipation for this year’s awards.

In August 2013, Nicolas Kayser Bril, a French data journalist and CEO of Journalism ++, started The Migrant Files project along with 15 other European journalists in order to document the rising migrant death toll at the gates of Europe. The project was a response to the lack of official monitoring of migrant deaths on their journey west to safety.

“We started building our database based on information from NGOs that had done a terrific amount of work on the topic already,” said Kayser-Bril.

So the team extracted and aggregated data from open sources to build the database that would allow them to track each of the migrants dying everyday around Europe and the coast of Africa.

The data is visualised on a bubble map that indicates the number of dead migrants in Europe and Africa. The user gets information on the number of refugees and migrants that died between 2000 and 2015 by clicking on a specific spot in the map.

A detailed explanation of the project can be found on the same website under the article “counting the dead.” The team still updates the information and has since written another article on the amount of money the European Union spends to keep migrants out.

Kayser-Bril said that the map was still being updated to this day and that he and his team will not stop until international organisations like the UNHCR start doing the work themselves.

The jury described the project as an “excellent example of journalists intervening to put a largely neglected issue on the political agenda […] this is data journalism at its best. We need more projects like these.”

Kayser-Bril said it was a nice feeling to have the project recognised by peers.

And as for the data journalist awards? “They’re a great opportunity to review what has been done in a given year.”

Currently, Kayser-Bril is working on several cross-border investigations where “we follow the same goal of measuring the unmeasured.” One of them is The Football Tax, which measures the flows of public money spent on professional football. The other project is Rentwatch, which measures the prices of rent everywhere in Europe.

If you are a data journalist who wants to submit a project, the submission deadline is 10 April 2016. This year’s ceremony will take place at Vienna City Hall on 16 June.

Interhacktives is proud media partner of the DJAs.

 

 

 

 

City hosts BBC News Labs’ University Challenges

City Hacks

The first City University London Hackathon will take place this weekend as part of the BBC News Labs’ University Challenges.

Organised by BBC News Labs & Connected Studio in collaboration with City, the event brings together 23 participants from the university’s Journalism and Computer Science programmes.

The goal, according to BBC News Labs data scientist Sylvia Tippmann, is to build a tool that will help journalists dive deeper into topics and do meta-analysis on news articles.

She said the task for the student journalists and computer scientists will be to come up with different front-ends for the Juicer— the BBC’s experimental news aggregation tool.

“The challenge will be to find novel and interesting ways to present the data that grows by 15,000 articles a day at the moment,” Tippmann explained.

Working in groups of four to five, Tippmann said, the participants will be required to “build something”— a prototype that works.

“If your project is convincing, we would love to invite you to work with us in News Labs for a while to make it happen and move your prototype to a beautiful and fully functional tool for journalists.”

Director of City’s Newspaper and Interactive Journalism MA Jonathan Hewett described the event as a “happy convergence,” adding that City was excited to play host.

Journalism and technology, Hewett said, are increasingly converging and the hackathon would help journalism students learn how to collaborate with coders and programmers in news projects.

“A project can progress much more quickly when both the journalist and computer scientist know what is possible and what is needed rather than having a dialogue where the journalist is a few steps behind,” he said.

The BBC News Labs’ University Challenges seeks to engage the talents of student innovators and help universities use their collaborative potential to build innovative news tools.

You can follow the event @Interhacktives and #BBCityHack.

Cath Levett: The Guardian Head of Graphics

Credit: The Guardian

Visualising your story is a key aspect of a lot of data journalism. While there are stories that don’t require a visualisation, be it a simple or complex, many more do and getting that right is key for readers and journalists.

I caught up with Cath Levett, Head of Graphics and Interactives at the Guardian Media Group, to find out about their approach to data visualisations, the future, and what they think is important.

Cath Levett Credit: Cath Levett via Twitter
Cath Levett Credit: Cath Levett via Twitter

The Guardian has a reputation for some of the most visually impressive graphics and data representations out there, with this in mind I asked Cath:

What do you do on a daily basis what are the processes behind how the Guardian go about building their graphics?

“My job has moved very quickly from being a person who does design work physically to being somebody who has ideas and focus on the best way to tell a story. In that way it’s quite nebulus, its all about how we approach things.

“While retaining similar elements the process we go through to create a visualisation will differ every time. We always sit down in a project group, there will be a developer, a designer, another senior editor and a journalist in attendance at least.

“What we try and work out is what is the story and what are we are to tell. Then we sketch out a plan and start designing.”

Cath summed up the process as “being about playing around and aggressive collaboration. If one person tries to do it the result is chaos and nothing gets done.”

What tools do you use to build the Guardian’s visualisation?

“We start with pens and paper and whiteboards. We sketch, sketch over each others work and come up with an idea. Then will use Adobe illustrator or photoshop to render it.

Only at the point we have a good plan do we start building the graphic. We use a variety of tools to do that; D3 is one of the most common tools we use but we also use CSS, JQuery and Javascript libraries for example. We use pretty much anything with Javascript.

“We also use more simple tools depending on the project but a lot of our stuff comes from D3 now.”

What makes good visual journalism?

“Anything can be good visual journalism, be it a rich interactive with photos, an immersive snow fall or could just a simple bar chart that tells the story in one second. It all depends on the story and what you are trying to tell.”

“You should not get confused by saying that a great visualisation is a brilliant technical project, it really is about the right approach for each story. It is all about the reader and better informing them and enhancing journalism

“Things go wrong when they don’t collaborate or aren’t sure what’s going on. Often this is for the right reasons as they’ve come to a project late. Setting aside egos is the key to visualisations really. Recently one of our data team did this huge investigation which took weeks and we decided that the best way to use it was just written into the story rather than a big visualisation. That attitude is really key.”

How do you communicate accuracy of data and statistical variance while still making sure graphics look good?

“Well it is very difficult to communicate these things, it doesn’t really come across my desk if I’m honest.

“But I think it is about who you are aiming to inform. You would have the hurricane paths if you were writing for an economic paper but its different in journalism. It certainly isn’t about dumbing down, instead it is about showing what is key and answering our readers questions.

“We take a lot of pride in our data at the Guardian, we always use the correct data, if it is in any way flawed we won’t touch it and will only go ahead with visuals when have the correct data. This is really important to us, visualisation is as much about good data as great presentation. Otherwise you aren’t telling readers the truth.

“An example of this how we scale data, we don’t. We always just show how it is, that’s the data. You can’t exaggerate to make it exciting because that is just a lie.”

How do you strike a balance between print and online?

“It is difficult, it is fun but it’s definitely a challenge. A good example is that we’ve been building some fantastic interactives for the election, and now the print edition has caught up with the need to do election pull-outs and specials but if we were going to do them separately it’d be very labour intensive.

“Thankfully now we’re working more and more in D3 we can crowbar the visualisations off and put it into print relatively easily, with only minor changes. We do have to condense things much more as we are obviously digita first and online there is much more space, but this is just distilling down and editing out.”

Speak of the election, how did you set out to approach it and visualise it?

“Our priority was to set out for the clearest possible narrative for readers in all out visualisation. We asked our selves what we are we going need to show them. This could be making sure they have access to the day-by-day polling data or what the policies from the different parties are.

“To this end we had focus groups in and came up with seven or eight key interactives. Probably the best is the Guardian Poll Projection.

“Although it’s a model it is our model and we are very clear about the hierarchy so that solved a lot of problems for us. We put it in D3 again, like a lot of election interactives, which makes it very editable and can be run from a Google spreadsheet polling wise. It is a really good example of how great visualisations are about bringing different peoples expertise together.”

Credit: The Guardian
Credit: The Guardian

Where going what are the main challenges going forward for data visualisations?

“It is all about making sure put the reader into the story and asking where they fit into the story. Journalism is all about telling the story and that is what we need to keep in mind.

“For example for the World Health Organisation Obesity Index data you could visualise it simply as a map, or you could make it a data set where the reader is involved and they input their data and see what they are in the world. It could be are you fatter than a Samoan or skinner than Ethiopian.

“Or take the Tour de France where you could measure your cycling speed against Chris Froome. This sort of stuff is difficult but its about addressing our key challenge, making sure the readers get the best experience from our visualisations.”

“Our other main challenge is staying abreast of technological change which is still moving very quickly indeed.

Is the future mobile then?

“Yes, it is where 60 per cent of our traffic comes from – more at the weekend.

“We have a mobile first approach, if the visualisations don’t work there then they don’t work. Everything is designed for each mobile breakpoint, portrait and landscape on mobile, small and big tablets. We try and cover all the angles.

“Even now we are designing for the Apple Watch and other wearables which will be a new breakpoint, but to make sure we stay on top we have to keep ahead of the curve.”

That revelation seems the perfect point to stop on and Cath has to go back up to cover the election. I leave with the importance of mobile and keeping readers needs are your key goal firmly stuck in my head, better equipment to visualise my data in the future.

How to write a data story with bad data

The General Election is the most important media event of the year. Its a chance to earn your stripes as a journalist and get some page leads in your portfolio.

This was the thought I had last week when I was on work experience with the Times’ Redbox supplement. If you’ve heard of Redbox, you’re probably also aware of their strong emphasis on data driven journalism. Redbox receives exclusive polling data from Yougov to keep them ahead of the curve.

I wanted to prove that I could write strong data stories. I had already been working on a feature about young candidates in the election, and thought it could work as a data story. Only problem is, the data on parliamentary candidates themselves is inconsistent as hell.

I used a website called Your Next MP, which had a spreadsheet of every candidate running in the election this year.

The data was pretty bad. There were huge chunks of information missing, no guarantee on the accuracy of the data and another journalist mentioned “crowdsourcing” when the website came up in conversation.

What should you do in this position? Do you give up after days of research and interviewing? Do you try and find a different angle that doesn’t need data?

It might not look exactly how you thought it would from the beginning, but a bad or incomplete dataset doesn’t have to mean the story is dead. there’s lots of ways you can tell a strong, accurate data story that don’t involve perfect data.

 

Clean it

My battle with the data
My battle with the data

First thing’s first, you can’t do anything with a messy dataset. You should clean what information you do have so you at least know the extent of the problem. To view how much data you have to talk about, right-click on the column you’re looking at (in my case this was “birth date”), select “filter” and select the values you want to keep. Although this is not a definitive list it gave me an large cross-section of young candidates to research and talk about further in my story.

You might find that the data your left with is incomplete but it still paints an interesting picture and backs up what you already know. Equally, you could find that you simply do not have enough information to make your story data-centric. Either way you need to know. Make a note of the change with each stage of the cleaning process so you know how inconsistent your data actually is. Then you can make an informed decision on how important a role the stats will be able to play. You can find out more on how to clean data from this handy guide, or in this video.

 

Cite your sources.

Crowd-sourcing site Your Next MP
Crowd-sourcing site Your Next MP

You should be doing this anyway, but it’s even more important if you’re worried about the accuracy of your data. Take yournextmp.com as an example. I was fully prepared to analyse information on incumbents on this site for a data story on same-sex marriage. That is until I spoke to Roger Smith at the Press Association:

“The information is gathered through crowdsourcing, which makes it really rather unreliable. There may well be quite a few last-minute withdrawals and so the data’s accuracy can’t be guaranteed.”

This looks like a death sentence for your feature, but it’s actually something you can work with. As long as you know and acknowledge that the dataset you have doesn’t tell the whole story, you still have the basis for an insightful piece that is enhanced by data.

 

 

Don’t analyse too deeply

Atrapitis

Do not be tempted to over-alter the dataset to find an angle . The chances are that anything you do to the set beyond just cleaning will create further inaccuracies. Abandon any grand ideas you had of merging with polling data or finding average ages. It’s not going to work. Here’s an example of a dataset I had to work with which had the same issues.

Remember that the data will not tell the whole story, but you can look at it and analyse it to get some interesting statistics to illustrate your bigger point.

 

 

 

Avoid misleading visualisations

Here's one I made earlier...
Here’s one I made earlier…

For the same reason that you shouldn’t be analysing the data too deeply, you shouldn’t be putting the information you do have into a graph. Graphs and maps assume that the data is gospel. If you can’t guarantee that then any visualisation is misleading and uninformative.

 

Focus on people, places and personalities

using candidate data to make contacts
using candidate data to make contacts

Your data is not going to be the hook of a ground-breaking discovery, but it’s actually very rare for data to make front page news. Instead, you should be using your data as a starting point to explore different areas, people and trends. Say your story is about candidates under 20 running in the election, and you can only find 8 people who fit the bill, even though you know there’s more. Use the number of candidates you have found as a contact list rather than the story, and before you know it you have some interesting insights into the political careers of teenagers.

 

So there you have it. Use this guide any time you have a dataset you feel very uncomfortable using as the basis of a story, or even if you’re new to data journalism and don’t know what to analyse. You don’t have to be a statistician to create great data stories.

 

Interview: Hera Hussain on the need for open data

Hera Hussain is Communities and Partnerships Manager at OpenCorporates, the world’s largest open companies database. When she isn’t busy organising hackathons and liberating corporate data from across the internet, she works with the social entrepreurship movement MakeSense and empowers women to achieve independence through Chayn. She spoke to Interhacktives about her experiences with open data, its importance, and the role that journalists should play in making it more accessible.

A basic right

“I initially misunderstood it,” Hera says, of her first encounter with open data as an organiser of WikiMania, an annual event focused on wikis and open content. “Like many other people, I could only see some applications of open data. For example, I thought it would be really useful if government posts statistics on crime. What I didn’t realise is that the aggregated statistics aren’t important. Anybody can come up with those numbers; the important thing is the underlying data. It’s not just about how many knife crimes have happened, it’s more about when they happened, where they happened – the little details.”

“Data should be a basic right,” she goes on. “And that wasn’t very clear to me until I started working for OpenCorporates.”

What does being Communities and Partnerships Manager for OpenCorporates entail? “It’s my job to make sure that the data held by OpenCorporates is used for social good – by journalists, by NGOs, by citizens, by other open data organisations. My job is to make sure that happens and also make it easy for people to contribute open data.”

One of the ways people can get involved in contributing open data is through taking part in #FlashHacks, monthly hackathons where anyone can come along to liberate and map corporate data or write bots that will convert the data into accessible formats.

Hera Hussain wearing a red T-shirt with the word "FlashHacks" across the front, against a crowded backdrop of hackathon attendees wearing similar T-shirts and sitting at tables
Hera Hussain at a FlashHacks hackathon

 

The importance of open data

Believe it or not, the UK is one of the world leaders in open data, alongside New Zealand. “Especially company information,” says Hera. “Our Companies House is really open to suggestions from the NGO community and the open data community, and they’ve done great work in opening up the database. The government has a really pro-open data stance which makes it possible for this all to happen.”

What is the most important thing about having open data? “I think it’s the fact that it exists,” Hera says. “People always say open data is very elitist. Only people who can work with data can use it. But because I think it’s a right, the fact that it exists is really important, because there will be somebody who can use it. We can leverage their knowledge to make things better.

“There’s always somebody out there who can apply it, and while there’s a big gap in terms of understanding data, I think eventually that will be filled. You can say the same thing about engineering, you know – engineering’s really elitist, because not everybody can understand how machines work or how buildings or materials work. But those who know how to make it work make it work for everybody.”

Making an impact

Ideally, she says, more people should be educating themselves about data and what it can do; but it might take a different approach from the data community to generate more interest in open data. “The problem is that things that make an impact on people are stories. I think we need more of that and I think the whole open data community is realising that, is trying to create a storyline of how it can be applied and how it is being applied.”

Is this a role that journalists should be playing? “I think it’s a responsibility. I think you become a journalist because you want to report on something that’s true, or you want to investigate something that you don’t know about. In both cases I think preferring open data over proprietary data is really important.”

Of course, the right data isn’t always available for journalists to tell the stories they want to, but Hera is optimistic that this will improve as the open data movement and data-driven journalism gain momentum. “So many times I’m contacted by journalists who want to work with open data and have a very strong hypothesis that they want the data to prove or disprove, but the data’s not available, so there’s no way to do it,” she says. “I think that can be quite frustrating. But I think the new data-driven tide in journalism is interesting, and I think these things are going to be much easier to do in the future. As we liberate more data, there’s more pressure on governments to release data, more pressure on companies to release data in the right formats, so I think the future is promising. It’s just that there’s a long way to go before it becomes easier for journalists.”

Change is coming

What does she think is currently the biggest obstacle to making data more open? “Two things, from OpenCorporates’ perspective: one is that we need so many more bodies of volunteers to actually scrape the data sets … And we need to actually find them as well. Finding data in itself is a big, big problem. Some people say that it’s almost like a self-fulfilling prophecy, because as governments and companies are realising that people are making use of this data to do things that they might not like, they start closing them down. So many corporate registers have closed down in the last year.

“There’s not enough incentive for them to release information, so we need to ramp up the pressure on them. But at the same time, there’s something there which they don’t want to get out, which is why it’s not happening.”

“I am glad that there is a conversation happening, and journalists are a big part of it. They put pressure on governments and companies to be more transparent.”

But things in data aren’t all doom and gloom. As previously mentioned, the UK as a whole has a positive approach to making data open, and next year this will improve even further with the launch of a central register of beneficial ownership for UK companies. It will mean that companies have to disclose information on anyone who controls more than 25% of the company’s shares and voting rights, starting in April 2016.

“I think we will definitely see a difference [in the amount of open data] starting from next year,” says Hera, “because the beneficial ownership information will be open in the UK. Other countries have said they will open it as well. For the next two or three years, I think we’re definitely going to see some change.”

Billy Ehrenberg on data journalism’s future and the skills you need

Billy Ehrenberg, ex-Interhacktive and data journalist, has spent the last year working on new data-based projects with City A.M.’s expanding online team.

I caught up with him to ask what his role involves, and what he sees as the future of data journalism.

In his average day, he admitted that he doesn’t do as much data as he’d like.

“There is a common misconception that graphs in stories means that it’s data – but I try to get at least one data piece done a day.

“Some of what I do is trying to find a story in the numbers, but often the story is quite obvious or easy to tease out, and I need to use visuals or explanations to make it accessible and interesting. To do this I use a few different tools.”

“Excel, Google Sheets, QGIS, CartoDB, HighCharts, Quartz Chartbuilder, Outwit Hub, Illustrator – each one has their advantages”

Billy has several different favourite data tools depending on the job at hand. For example, he says he usually prefers Excel for cleaning datasets.

“I’ve used Open Refine a bit, and that’s certainly worth getting into. Excel and Google Sheets have a bunch of functions that let you pull data apart and whip it into shape – so how useful Excel is depends mostly on if you’re boring enough to have fiddled with functions for days on end.”

data journalism at city am

“Fake data”

On what he sees as the future of data journalism, Billy reckons that “it will naturally divide between real data and fake data. You see some people who do things like not adjusting historic financial data (even film revenues) for inflation because they are in a rush or just don’t realise they should. That’s a dangerous thing: people can see a graph or chart and think that what it shows is fact, when it’s as easily manipulated or screwed up as words are.”

“That’s a dangerous thing: people can see a graph or chart and think that what it shows is fact, when it’s as easily manipulated or screwed up as words are.”

“I think you’ll get two sets of people: those who do not do a lot else, with big skillsets like coding, stats, cartography and programming, and those who have to rush out faux data for hits.”

The next ‘hot topic’

Billy told me he’s not sure what the next hot topic is, but he think it’ll be related to coding – “maybe it’s a cop out, as it’s nothing new.

“People wonder if it’s worth coding if you’re a journalist, and even if you are a journalist if you code. I’m obviously pro-learning.”
data journalism at city am

Data principles

“It’s really important to try not to mislead people. Graphics are easy to use to manipulate people. The more complex they are, the more likely you are to mess up and the less likely it is anyone will notice, even if it changes something.”

“Visualising ethically is important too: even the colours on a map or the extents of an axis can make a change look hugely dramatic”

“I try to let the data tell the story as much as I can and if I don’t like what it’s saying I won’t change the message.”

When asked what data-related skill he wishes he could master, Billy said: “it’s got to be D3. It’s so difficult that I get a real buzz out of solving something in it, even if it’s taken hours.

“Probably learning JavaScript is the best way to crack that nut. It’s a work in progress.”

Data Journalism workshops by Interhacktives 2015: learn to clean and visualise election data

Promotional image for the Interhacktives events

Love data? Ever wondered what a data journalist actually does? Just want free booze?

The Interhacktives are pleased to announce TWO upcoming events this April that will satisfy either or both of these needs. Brace yourselves.

The sessions will be focused on giving you some data skills you can use for election coverage.

2015 is set to be one of the closest elections in living history and data will be essential in figuring out exactly what happened and why.

Cleaning the data

We will start with a Google Hangout webinar on Monday 13 April, 6pm-8pm. Everyone who will have registered on Eventbrite will receive a link to the webinar.

Our Excel guru Jonathan Frayman will be using 2010 General Election results data to show how to clean a dataset and arrange it into a format that will be easy to visualise and find stories by analysing data.

More info on Google Plus event page.

Visualising the data

The second event will be held at City University London (home to the Interhacktives) on Thursday 16 April, 6pm-8pm.

Our mapping star Emily Shackleton will use the clean dataset from the previous session and teach you how to map this data with web mapping tool CartoDB, and how to source potential news stories through visual analytics.

There will also be a networking session afterwards with refreshments provided.

Our team will be available during both sessions to supervise and troubleshoot any problems you may get with the new scary technology!

Both events are free, but please register your attendance by booking a ticket on Eventbrite:

  • Register for Cleaning the Data webinar here.
  • Register for Visualising the Data workshop here.

It would be useful (but not compulsory) to attend both sessions to get the most out of them, but people who can only attend one of the two sessions are of course still very welcome. Both sessions are aimed at total beginners, so no prior knowledge is needed.

For the super keen data geeks who want a quick look at the dataset we’ll be working with here is a link.

See you there!

Interhacktives group photo

How to use statistical functions in Excel

Lies, damn lies, and statistics. At least that’s how the saying goes and how the wider public feel, for some reason people tend to distrust something with numbers backing it more than if it doesn’t. Well that’s just completely wrong, but the problem is that numbers can tell two different stories from the same data.

Statistics might drive people insane, scare them, or not seen relevant.

But in this post I’ll try and explain how to use Excel for some basic statistical analysis and what it can tell us.

Disclaimer: There will be outcomes I don’t explain as they are more advanced, but they may be covered in a later. I will also use very simple data sets for ease of explanation.

Why might a data journalist want to use statistical tools?
Journalists have a lets be honest earned reputation for being scared of numbers and frankly being awful at them. But data journalists and those interested in it are a different breed.

Why we should be interested in statistics is what it tells us about our data, it is a tool to spot patterns, check reliability and ask if all as it seems to be. For a basic story this is probably going a bit far, but when handling complex data sets especially if they are financial it tells us a lot.

If you perform a regression analysis for example and the results seem odd there is a lead to explore, which you would never have found unless by chance or really understanding the subject area. In essence it is a tool that allows you to tell more about stories and find exciting new leads.

So lets get started.

Firstly you need to make sure you have the right tools. For Mac this is:

  1. Download StatPlus:mac LE for free from AnalystSoft, and then use StatPlus:mac LE with Excel 2011.
  2. You can use StatPlus:mac LE to perform many of the functions that were previously available in the Analysis ToolPak, such as regressions, histograms, analysis of variance (ANOVA), and t-tests.
  3. Visit the AnalystSoft website, and then follow the instructions on the download page.
  4. After you have downloaded and installed StatPlus:mac LE, open the workbook that contains the dat that you want to analyze.
  5. Open StatPlus:mac LE. The functions are located on the StatPlus:mac LE menus.

Statistical Functions

To start with, here is a list of the majority of the statistical functions within Excel. We won’t be covering anywhere near all of these but an explanation is provided by each.

statsfunctions1

statsfunctions2

statsfunctions3

statsfunctions4

statsfunctions5

Learn about your data

One nice thing about the Data Analysis tool is that it can do several things at once. If you want a quick overview of your data, it will give you a list of descriptives that explain your data. That information can be helpful for other types of analyses.

We shall use the data below. It shall also be used for other topics in the post.

datausing

If we wanted to get a quick overview of the variables, we can use the descriptive statistics tool. Go to the basic statistics tab in StatPlus and click on descriptive statistics. Then highlight the column containing the data, if you have checked column 1 as labels make sure to includes it. This looks like a lot. But some of these variables can be helpful. This is useful for journalists because it helps us test the validity of our data and mean we don’t go too far in trying to find a story before realising it isn’t worth our while.

dataoverview

To look at Variable #1 (Quantity sold), if you do a regression, you want the Mean (average) and Median (middle value) to be relatively close together. If your results are good you should be seeing standard deviation to be less than the mean. So in the above table, our Mean and Median are close together. The standard deviation is about 1296 – which means that about 70 percent of the quantity sold was approximately between 5900 and 7100. Not too scary so far right?

Correlation

Another good overview of your data is what is called a correlation Matrix, which gives you an overview of what variables tend to go up and down together and in what way they are moving. For example say you were looking at data which showed how something changed over time, you could use the matrix to see if the progressions are what you’d expect. This might find you a great story.

It is useful as a first look at what your data is telling you before potentially delving into regression, but to work out whether the data for a story is reliable this would be useful. The correlation is measured by a variable called Pearson’s R, which ranges between -1 (indirect relationship) and 1 (perfect relationship).

Go to the data table and the data analysis tool and choose correlations. Choose the range of all the columns (less headers) that you want to compare. Then you get a table that matches each variable to all other variables. Below you see that the correlation between Column 2 and Column 3 is 0.02156. It would be the same between Column 2 and Column 3.

Correlation provides a general indicator of the what is called the linear relationship between two variables, but it crucially you cannot make progressions. To do that, you need to do what is called linear regression – this will be covered later.

correllation

What we can use it for however is checking the outcomes are logical and within a margin of error, if not ask why? If the data set your working on suddenly changes ask questions, see if there isn’t a story. It is a tool to allow you to go beyond the obvious and find interesting stories within your data.

Some characteristics help predict others. For example, people growing up in a lower-income family are more likely to score lower on standardized tests than those from higher-income families.

Regression helps us see that connection and even say about how much characteristics affect another.

Trend Analysis

Trend analysis is a mathematical technique that uses historical results to predict future outcome. This is achieved by tracking variances in cost and schedule performance.

For trend analysis there are three ways it can be done: the equation, forecast, or trend. I will go through these three methods using the simple set. One important term to understand here is R-squared, as it gives a indication of the reliability of your data. But what is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line.

The definition of R-squared is here. Or:

R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

It is important to remember R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why it is critical that you must assess the residual plots. R-squared also does not indicate whether a regression model is accurate beyond doubt, it can be low and right or high and wrong.

The equation

trendanalysis1

Forecast function

trendanalysis2

Trend function

trendanalysis3

Why might this be useful?

Well if you are getting large outliers or your R-squared value is out for example it can be an indicator of an unreliable data set. For a jobs data story this could suggest that the government’s claims to a smooth system is not true.

Or if you were doing a story on incidents of piracy it could lead you to exploring avenues about reporting, hotspots, or identify key periods for further investigation. Paradoxically by going deeper into the numbers it can allow you to can further beyond them and ask the really tough questions.

Statistics keeno klaxon

Here are other types of standard trends, which may be touched on in a future article:

  • Polynomial – Approximating a Polynomial function to a power
  • Power – Approximating a power function
  • Logarithmic – Approximating a Logarithmic line
  • Exponential – Approximating an Exponential line

trendtypes

Regression Analysis

Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modelling and analysing variables, where the focus is on the relationship between a dependent variable and one or more independent variables. Not obviously related to journalism? Think about a story about stress levels for a certain group and other factors such as wages, housing or frankly anything. If you can identify this you can then track those changes over time and tell a lot deeper data story, it gives weight to sometimes seemingly obvious answers.

So lets get started:

  1. In StatPlus click on the statistics tab.
  2. Select linear regression and click OK.
  3. Select the Y Range (A1:A8). This is the predictor variable (also called dependent variable).
  4. Select the X Range(B1:C8). These are the explanatory variables (also called independent variables).
  5. These columns must be adjacent to each other.
  6. regression1Check Labels.
  7. Select an Output Range.
  8. Check Residuals.
  9. Click OK.

Excel produces the following Summary Output:

regression2

R Square

R Square tells you how much of the change in your dependent variable can be explained by your independent variable. R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is explained by the independent variables Price and Advertising. The closer to 1, the better the regression line (read on) fits the data.

Significance F and P-values

To check if your results are reliable (statistically significant), look at Significance F (0.001). If this value is less than 0.05, you’re data looks good. If Significance F is greater than 0.05, it’s probably better to stop using this set of independent variables. Delete that rerun the regression until Significance F drops below 0.05. Of course this is not guarantee of success.
Most or all P-values should be below 0.05.

Coefficients

The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 * Advertising. In other words, for each unit increase in price, Quantity Sold decreases with 835.722 units. For each unit increase in Advertising, Quantity Sold increases with 0.592 units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals £4 and Advertising equals £3000, you might be able to achieve a Quantity Sold of 8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.

Residuals

The residuals show you how far away the actual data points are from the predicted data points (using the equation). For example, the first data point equals 8500. Using the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 = 8523.009, giving a residual of 8500 – 8523.009 = -23.009.

Why might this be useful?

See my explanation of regression analysis above! This is probably the most advanced stats covered in this post, but I would say is potentially the most useful as it can be applied to so many types of data and data sets you have created.

Conclusion:

I hope this has been a good introductory overview to statistics in Excel, I’ll do another post soon and update this one when I get round to it but I hope this has proved useful.

Interview: Ben Kreimer on drone journalism

DJI Phantom with camera

When most people think of drones, they’ll probably think of the flying machines that hail indiscriminate death down across the Middle East. But not Ben Kreimer. For him they are a way of seeing the world in a new and exciting light, and occasionally for doing a bit of accidental journalism along the way.

Even at a young age, Ben was doing things differently. He made his own toys from wood and metal, so when he was told he’d have to build his own brackets to hold cameras under drones, he leapt at the chance.

“When the drone thing came around I was getting a degree in journalism. Not because I wanted to become reporter but because I’m curious, I like exploring the world and seeing things and experiencing things.”

And he has indeed explored the world with his drones, from filming urban crocodiles in India to chasing endangered species in Tanzania and mapping landfills in Kenya. But of them all, his favourite is the Drone Safari.

“I’d never been on a safari before so being able to see the animals, to be able to film them, was really exciting.

“Doing it was fun and then people’s reactions to it afterwards. Most people have seen pictures and video of these animals, but when people see it from this perspective, flying around a giraffe’s head? They get a kick out of it. It’s the same story, but from a new perspective. That’s what I like about it.”

Drone Safari with giraffes, credit: Ben Kreimer
Drone Safari with giraffes, credit: Ben Kreimer

The challenges of drone journalism

Drone journalism is not without its own unique challenges. In 2014 the FAA (Federal Aviation Administration) told the University of Nebraska they didn’t have permission to fly drones, and had to apply for it, and was just the beginning for Ben.

“In the past year I’ve spent more time in India and Kenya than the US, and now both countries have explicitly said that civilians can’t use drones without permission from the defence branch. So that makes it hard, as I don’t want to break those regulations as a foreigner. I think the issue is foreigners coming in and flying around for fun.”

The landfill

Whilst in Kenya he used a drone to map a landfill, but that could be just the beginning. The air around the landfill is full of pollutants, enough to cause respiratory problems if you’re around it long enough.

“I was thinking of building an air pollution sensor. You could fly that around the dump, and around the area around the dump and see what we’re breathing. How’s the pollution travelling out, and can we visualise that data and show a three dimensional plume of bad air that emanates from the dump? And can you do that elsewhere?”

Ben remains adamant that the laws that are currently causing him so many problems won’t be around for very long, and in the mean time he already has ambitious plans to work with UNESCO (United Nations Educational Scientific and Cultural Organisation) and make 3D models of historical sites around the world. But, as always, his reasoning is refreshingly plain.

“It’s 2015 and why can’t we look at a three dimensional model of the Angkor Wat? I think it’s time for that. I get interested in things when I realise they’re possible. I like travelling just to go to a new place and see how things are there.

“Now’s a good time to get into the journalism part too. As far as I know there are only two universities in the US that are looking into that. But you have to go do something with it.”

The best places to get your data

data journalism infographic scraping Guardian

Best places to get your data

For many beginners getting into data and what are the best places to get your data , the first stumbling block is actually accessing the information you want. Before you’ve started to tell the story, and before you can even get your teeth into the visualisations, you fall at the first hurdle.

Fear not. We’ve collected some of our favourite sources of reliable and informative data, giving you some starters if you’re struggling to figure out where to find your story.

Office of National Statistics

Government releases are always a good source of up-to-date information and the ONS  is one of the best places to get your data on population changes, demography or unemployment.

The ONS is also good for getting files on the counties, constituencies and wards, giving you the information on shape and size of the areas – handy when it comes to mapping data.

The USA has a similar model with Data.gov, giving people access to their data. One warning, however: governments may not release information that makes them look bad. If you want to make sure you’re getting the full story about an institution, never just consult one source on it.

Data.police.uk

Data.police.uk is a hub of data on crime and policing in England, Wales and Northern Ireland. You can access CSVs on street-level information and explore the site’s API for data about individual police forces and neighbourhood teams.

This is a very handy site to see how the police are performing on a local basis. You can compare crimes by location and time, enabling you to find any correlations or patterns there are out there.
The Metropolitan Police also publishes their data on each crime in London on police.uk.

Nomis

Nomis is a good source for official labour market statistics – you can get detailed data based on local areas, and can search summary statistics by local authority, ward or constituency.

MyNHS

Want to see the data that the NHS and local councils use to monitor performance and shape the services you use? Well MyNHS gives you this chance, it is one of the best places to get your data on the UK’s health service.

Eurostat

If you’re looking to compare the UK against other countries, or are looking to cover a more internationally-based story, Eurostat contains a variety of publications containing statistics on EU member states. This site has information on economic output, labour markets and demographics – to name just a couple.

World Bank and World Health Organisation

For a more global story, the World Bank and World Health Organisation release data on global finances and public health and safety. Such organisations and institutions have a multitude of datasets ready for you to trawl through in an attempt to find global trends and the effect of certain events.

Freedom of Information

The old faithful. If you can’t find the data anywhere, attempt to access it yourself. Utilise the Freedom of Information Act, which gives you the right to access recorded information held by public sector organisations. Ensure the information’s not already out there, and then send your request off to the relevant institution.

Scrape it yourself

When all else fails, you can always find a site that serves the data you want and then scrape the data. If you don’t know how to use Python, Javascript or other code languages, here’s a short guide we’ve done to help you scrape data without code.

We’re always after new tips for places to find data – can you think of anything else? Tweet us @Interhacktives with your ideas.

6 sites that show why data is beta

New to data journalism and keen to learn but unsure about the kind of stories you could uncover with numbers? Well worry not because the Interhacktives have collected the examples of experts in action so you don’t have to.

Here’s a roundup in no particular order of the best news sites that use data journalism and data visualisation in the UK.

 

Guardian Datablog Screen Shot 2014-11-17 at 13.46.07

 

Guardian Data Blog

Data journalism is by no means a new trend. The Guardian is cited as the first major publication to bring data journalism into digital era, with Simon Rogers launching the Datablog in 2009.

The blog covers everything from topics  currently on the news agenda to general interest.

This week saw a report on the record levels of opium harvested in Afghanistan and a visualisation about the lives and reigns of Game of Thrones Targaryen kings.

The Guardian’s Datablog is good for beginners as there tends to be a link to the source of their data on each article, enabling you to access the data and to use it for your own stories.

Amp3d graph - We're eating more chocolate than there is in the world, "Predicted world chocolate deficit"

Ampp3d

This arm of the The Mirror is what its creator Martin Belam calls “socially shareable data journalism”, the successor to his Buzzfeed -esque site UsVSTh3m. Launched last Christmas, after only eight weeks of building, Ampp3d is the tabloid perspective of data journalism.

Stories this week included what makes the Downton Abbey’s perfect episode and the British city where people are most likely to have affairs.

Most importantly, perhaps, is that it’s a site specifically designed for viewing and sharing on a mobile device. As Belam writes on his blog,  80+ per cent of traffic at peak commuting times comes from mobile, which the project aims to capitalise on this attention.

i100 "The list" Screen Shot 2014-11-17 at 14.30.30

i100

i100 is The Independent’s venture into shareable data journalism. It takes stories from The Independent and transforms them into visual, interactive pieces of often data journalism. It also incorporates an upvote system to put the reader in charge of the site’s top stories.

The articles are easily shareable since social media integration is a core part of the reader’s experience.

To upvote an article, you have to log in with one of your social networks (currently Facebook, Twitter, Google Plus, Linkedin, Instagram or Yahoo).

Bureau of Investigative Journalism homepage

Bureau of Investigative Journalism

Championing journalism of a philanthropic kind, the data journalism of the Bureau of Investigative Journalism differs from most of the other publications on this list.

Based at City University London, its focus is not on the visual presentation of data, but the producing of “indepth journalism” and investigations that aim to “educate the public about the abuses of power and the undermining of democratic processes as a result of failures by those in power”. As a result, there is little visualisation and mostly straight reporting.

For data journalists, though, its ‘Get the Data’ pieces are indispensable resources as they allow you to download the relevant Google spreadsheets that you could then turn into data visualisations.

FT Datawatch: the world's stateless people screenshot

The FT

The Financial Times’  Data blog is one of the leading international news sources for data journalism and one of the UK’s leading innovators in data visualisation. It creates pieces of interactive and data-driven journalism based on issues and stories around the world, which include everything from an interactive map showing Isis’ advances in Iraq to UK armed forces’ deaths since World War II.

It describes itself as a “collaborative effort” from journalists from inside the FT, occasionally accepting guest blogs.

Bloomberg screenshot of homepage

Bloomberg

Bloomberg  has perhaps some of the most impressive-looking data visualisations out of all the news sources mentioned. The emphasis on the aesthetic is immediately apparent since a zoomed-in version of each visualisation functions to draw a reader in on the homepage as opposed to a traditional headline/photo set up.

Interactivity is the most defining feature of Bloomberg’s data journalism. Many of its pieces rely on the reader to actively click on parts of the visualisation in order to reveal specific data. For example, its World Cup Predictions and Results article requires the reader to select a game in order to see statistics and information about it.

What’s on Reddit’s front page?

Reddit is an online super-community with hundreds of millions of users, and has become in recent years an arbiter of what’s cool and what’s not on the web. If something makes it to the front page of reddit, where it is most visible, it will inevitably receive millions of views.

Reddit stats

The way the site works is users post content – pictures, article links, conversation starters etc – and the success of that content is determined by whether the reddit community likes it (upvotes) or talks about it (comments) or just clicks on it.

Submissions are made to the relevant subreddit – a subject specific community – and should they prove popular, can rise to the front page. This is the reddit mainstream. And I scraped it.

Digg vs Reddit via Quantcast
Digg vs Reddit via Quantcast

Three times a day, for two weeks, in March and April of this year, I scraped the data from front page of r/all to see what is popular on reddit, and what that means.

Reddit is growing. It’s the 58th most visited site on the net (up 6 places from last quarter), and the 21st most popular in the US. Since it defeated Digg at the turn of the decade, reddit has established itself as really the only aggregate site in town – and with that comes power.

If reddit helps shape the internet conversation, what does the data say about reddit?

reddittop25 This is the top 25 subreddits over that fortnight of scrapeage – the front page of subs, if you will.

Perhaps predictably, r/funny is at the top. It appeared the most on the front page, received the most upvotes, and the second most comments because, naturally, it has the most subscribers (over 6 million).

Other predictably popular subs include memes (#2), cute animal pics (#4) and video games (#5).

Interestingly, a few of the more stereotypically reddit subs barely made the front page, or didn’t even at all. The site is known for its militant atheism, and yet that subreddit only made it to #25. While the site’s marijuana predilection could only reach #26 – no place with the best of the best.

Only two of the top-25 are substantially NSFW (Not Safe For Work). The sub r/WTF – wherein people post strange and disturbing things – is about a third NSFW whereas r/gonewild, the site’s most popular porn sub, is exclusively not for the workplace (unless you work from home).

The rankings largely stay the same when using comments instead of upvotes as the key parameter, except there is a notable rise of interaction-led subs like r/askreddit and r/IAmA. Askreddit, in particular, skyrockets to the top of the front page despite only appearing 9 times over the two weeks to r/funny’s 226.

As for the average scores and comments for front-page posts, r/pics and r/askreddit are respectively the top dogs. Where r/funny rules in front page appearances and accumulated points, it doesn’t even reach the top 10 in either category. That suggests that reddit’s biggest sub is more quantity than quality.

There is an obvious outlier amongst these broad and mainstream subs and that is r/leagueoflegends.

It’s a community dedicated to an exceedingly popular 2012 PC game. With almost 500,000 subscribers, it is the 41st largest subreddit but its community activity exceeds even that.

Stats for r/leagueoflegends
Stats for r/leagueoflegends

One of the moderators of r/league of legends, arya, said: “This subreddit is the largest unofficial community for LoL. We get between 500-1000 new subscribers per day I’d estimate. Big events do show an influx of new users and higher activities. I remember during Worlds when the stream shut down due to technical errors, the thread about it reached the top of r/all within minutes.

KingKrapp, another mod, said: “From what we’ve experienced, a lot of our users only come here and don’t really interact with the rest of reddit. We’re a very specific community compared to other big subs.”

It’s the success of niche-y subs like r/leagueoflegends that prompted reddit to introduce trending subreddits at the top of the front page in April.

Umbrae, mod for trendingsubreddits, said: “The thinking behind trending was essentially that there’s a lot of diversity to reddit, but that many of the visitors to the homepage don’t see or understand that. This gives a good hint to the breadth of reddit, while at the same time giving deeply engaged folks a new source of interesting communities.”

The initiative has so far been a success, with Umbrae reporting: “A lot of smaller subs have definitely gotten exposure.”

 

alluvial2

 

Only 20% of top subreddits are not and have never been default to new subscribers. Default subreddits have more subscribers (naturally) and more interaction, but they consequently have less community.

At the beginning of May, r/mildlyinteresting became a default sub. Its popularity, according to mod RedSquaree, is because “all the content is original, and chances are that nobody has seen anything posted here before. It also doesn’t aim to be amazing content, so expectations are low and people are happy.”

mildly interesting stats
Stats from r/mildlyinteresting

Of its new status, RedSquaree said: “Our growth was very steady until the recent increase as a result of being a default. [It has led to] more removals and a deteriorating comments section.”

It seems that a sizeable sub comes at the expense of a close community. Karmanaut, mod of r/IAmA, said: “Unfortunately, there isn’t a very strong r/IAma community. I think one of the main reasons behind this is that there is no core of submitters, because there are very few people with multiple submissions. Unlike most other subreddits, all of r/IAmA is original content and has to be done by the original person. And each person has a limited involvement. In its infancy, there was a smaller group of individuals who were very involved in the subreddit but since growing to its larger size, those individuals are no longer necessary to recruit AMA subjects.”

So those are the communities, but what do the actual posts say?

Wordcloud

These are the most frequently used words in that two-week period. You can see where the interests of the site lie – there’s an inordinate number of mentions of Oculus, the VR company Facebook bought, compared to the MH370 drama.

Here’s the most popular post of that entire period. It may have only ended up at 4,003 karma but this post received more than 56,000 upvotes.

Screen shot 2014-05-30 at 14.40.56

Conclusion

Perhaps it is what it always was, or what it was always going to be, but reddit is largely a chill place. People go on the front page for a joke, a pretty picture, to learn a weird fact, or take part in an amusing straw poll. It’s a nice place to hang out, it isn’t challenging. Its major contribution to the internet conversation is jokes, memes and silly things that will crop up on Buzzfeed a few hours later.

With trendingsubreddits, the site is attempting to change that in a way. Not so much the pleasant interactions, but the homogenized output. Perhaps by promoting the nichier subs, the front page will change.

Because, just as Katy Perry is not an accurate reflection of modern music, neither is r/funny representative of reddit and its many weird and wonderful subs.

Interview with Abraham Thomas, co-founder and head of data at Quandl

Abraham Thomas

What is Quandl and why is it so useful for data journalists?

Quandl, at its core, is a search engine for numerical time series data.  The data we have is heavily influenced by what our users want, and as such we tend to have datasets on important or trending topics.  For example right now we just created a number of datasets encompassing all the inequality data included in Thomas Piketty’s new book “Capital in the 21st Century”.  We also have a huge number of datasets on standard reference topics: economics, financial markets, society, demography and so on.  All these datasets are easily accessible in applications for analysis or for export to graphs.  Best of all it’s all free.

Is it easy to use?

Quandl‘s mantra is to make data easy to find and easy to use. We try to do this in a number of ways.  
The first step is helping users find the data they need.  Having millions of datasets is no good unless you can find what you’re searching for. Most current search engines don’t do a very good job at pure numerical data searches. So we built our own custom search algorithm that is optimized for numerical data.  You can filter by data source, filter by data frequency, perform advanced search using Boolean queries and so on.  Of course there’s still a long way to go; and we’re constantly improving our backend algorithm to give you the data you were looking for. 
Another mechanism we use to help users find the data they need is by “browsing” our data collections.  Collections are hand-selected, curated groups of high-quality datasets on specific topics.  So instead of searching for specific datasets, users can explore in a more free-form manner via this method.
The next step is actually working with the data you’ve found.  We offer offer various options for downloading and graphing the data though the website.  Perhaps though our real strength is our API; lots of users have written their own apps and programs that use Quandl data delivered via this API.  We’ve also written (with generous contributions from our users) a number of libraries that help you get Quandl data directly into the analysis tool of your choice — R, Python, Excel, Matlab, you name it — without visiting the website or invoking the API.

Does the site provide data in a form that is easy to manipulate?

The important thing about making data easy to manipulate is understanding that different users have different needs, and we need to be able to facilitate that.  That’s why we offer all our data in multiple formats (JSON, CSV, XML), irrespective of what format the data was originally published in.   That’s also why we’ve built our API and all the tools and libraries that interface with it.  We want to make the process of taking our data and getting it into whatever tool you choose to use as frictionless as possible. 

How did you first come up with the idea? Did you spot a gap in the industry that needed to be filled?

The idea came from our founder Tammer Kamel.  Tammer was having a difficult time finding the data he required for his personal consulting business, without paying thousands of dollars to firms like Bloomberg or Reuters.  And it turns out that there are many people in similar situations.  As it currently stands (without Quandl) if you are not working for a large company with a large data budget, it is surprisingly difficult to get even simple public statistics, like the GDP of China over time, into your workflow. 

Last year you were described by journalism.co.uk as being the “YouTube of data” – do you think this is a fitting description?

It very much describes our aspiration. We would like to get to a point where some users are contributing massive amounts of data that other users are consuming.  We’re currently building the tools to enable this in a frictionless, functional manner.  (See answer 7 below)

How do you source the data you host, and how do you ensure that it is always up to date?

We source data from all over the internet and sometimes physical media as well. We have multiple scheduling and freshness checks in place to make sure everything is updating properly. 

Last year you mentioned that you are hoping to allow users to upload their own data – what are the latest developments here? What is the thinking behind this? And does this not make it difficult to ensure that all data is accurate?

Right now we are still in the testing phase of this project internally.  We’ve also slowly started inviting a few alpha-testers to try it.  We feel we have created a fairly frictionless experience getting data from Quandl, and we want to provide that same frictionless experience putting data on Quandl as well.  
There are two reasons for moving in this direction.  First, as a team there is only so much data we can add ourselves.  Secondly we cannot pretend to be experts at everything.  Here at Quandl we have a very talented group of people with varying skills and domains of expertise.  However the wealth of data out there — and knowledge of it — is so vast we could never dream of understanding it all.  Luckily our users as a whole do have this knowledge.  Right now, every dataset that is being added to Quandl has been specifically asked for by a user, and it has been this way now for months.  We are very confident that with the right tools, our users will be able to create high quality, usable datasets.  These datasets will be associated with their creator, and other users can choose to trust or distrust these creators just like they’ve chosen with Quandl as a whole.

Is there anything else similar in the field at the moment?

Yes and each has its strengths depending on what a journalist might need.  Zanran.com has crawled a huge number of PDF documents on the internet for tabular data; they have some really esoteric stuff.  Datamarket.com has great visualization tools.  Datahub.io also looks interesting to us as an open-data platform.  Exversion.com offers access control and version control for datasets which are both interesting features.  WolframAlpha.com doesn’t offer much raw data but their natural-language query system is very impressive.  So there’s lots of activity in this field right now.

Interview with data visualiser Ri Liu

Ri Liu, data visualiser at Pitch Interactive. Photo credit: Ri Liu

Good design is key when trying to tell stories in an interactive or visual way.

I spoke with Ri Liu from Pitch Interactive, an interactive and data visualization studio based in California. The studio is best known for its interactive detailing the victims of every known drone attack in Pakistan.

In her spare time, Ri recently created We Can Do Better, which is a visualisation of gender disparity in engineering teams in the tech industry. I was interested in how a reasonably simple data set could be made much more engaging through the visualisation.

Ri's We Can Do Better visualisation. Click the image for the full interactive version.
Ri’s We Can Do Better visualisation. Click the image for the full interactive version.

What was the inspiration for We Can Do Better?

It’s an ongoing issue in the tech industry and as a female in the industry I just asked myself ‘what can I do?’. It’s frustrating when you see this inequality and imbalance.

This data has actually been around for a little while now but in the form of a spreadsheet. It’s great and a lot of people have added to it, but it’s quite technical and has to be updated by submitting a pull request on GitHub.

So I thought, since I have the design and coding background and I’m in tech, maybe I could bring it to a wider audience.

I want to let people touch this information and engage with it, instead of seeing rows and rows on a spreadsheet.

It’s definitely a lot easier on the eyes.

Yeah. I’m glad it’s been shared a lot, and maybe different people and journalists can now engage with this data more easily than before.

The data in its much less engaging spreadsheet format. Click the image to see the full spreadsheet.
The data in its much less engaging spreadsheet format. Click the image to see the full spreadsheet.

Which tools do you use and how long did you spend on it?

I spent a few weekends on it and the visualisation itself is built using D3.

This project is actually on GitHub, I’ve put a creative commons license on it so anyone can look at the code.

Was it worth putting the the time into?

Definitely. Personally, I just wanted to see this data visualised. I’d seen these numbers but it wasn’t really connecting with it in a meaningful way.

I didn’t expect for it to be tweeted around as much, but that’s been really awesome.

How easy would you say it is for someone to learn to use D3?

It’s definitely not the easiest tool to get started with, but once you do get a grasp of it it’s incredibly powerful. When you want to do something you’re not limited by the code at all, so you’re able to say ‘I want to explore the data this way’ and have the tools to do that.

I hardly ever geek-out over technology, but this is the one exception where I rave about it. Compare it to the other end of the spectrum, like the rudimentary graphs in Excel. They just leave you feeling trapped.

Have you noticed increasing interest in interactivity and visualisation from journalists?

We work a lot with publications and I think they’re realising that we need to present these figures visually and in a more compelling way for them to reach people.

That’s definitely been a shift and I think we’ll see more places engaging with data viz companies and studios, as well as more doing it in-house as well.

I’m also interested in how interactivity is being used to tell non-data stories, the most obvious example being Snowfall.

I’m a very avid web user but the problem is that I don’t read a lot of longform content because I just have so much to read that I don’t absorb a lot of it. A lot of sites are just competing for that attention and working out how to make this digestible for people.

I think it’s great to have more visual imagery and better design and it’s great that a piece like Snowfall got such wide attention. It’s like ‘oh, let’s actually pay attention to the design of these articles instead of just dumping text in front of people’.

I’d like to see what the reader stats were for it.

People spent roughly 12 minutes looking through it.

That’s really good.

Because there’s a lot more time gone in to presenting the content like that, I’d also be interested in what that means for the timeliness of certain articles. That was a good piece because it wasn’t about something current, it was just a story.

But it’s a great way of presenting stories which isn’t just dumping traditional print content onto a screen.

Are the tools getting better for making interactive things more quickly? Could we see more timely articles being made interactive?

I wonder whether it’s even possible to produce a piece like that without putting the effort in and finding the best visuals and other  content.

Obviously there are technical aspects like the parallax and scrolling effects they put in, which could just be bundled into tools. But I think that the real beauty of it is in the thoughtfulness, and I’m not sure you could match it without effort and time.

Should we expect more personal projects from you?

I’m always playing around with new technologies. I’ve been meaning to do something with semantic analysis and playing around with words to see biases and other insights.

I’m interested in making people aware of what they’re subconsciously doing and the assumptions they’re making. We’ve got a lot of traces of that on the internet these days, on Twitter, blogs and all these social networks, so it would be cool to do something with it.

That’s just in the back of my mind though. I’m playing around with it but nothing concrete so far.

Carl Bialik interview: ‘Any data set has eureka potential’

Carl Bialik is a writer for Nate Silver‘s new website FiveThirtyEight, having recently moved from the Wall Street Journal where he started The Numbers Guy column. I ask him about the ups, downs and difficulties of being a data journalist, as well as what he thinks are the most important traits for being successful in the field.

You recently moved to FiveThirtyEight from the WSJ: do you think the two publications differ in their approach to data analysis?

With The Numbers Guy at the WSJ, my role was more about looking at other people’s data analyses, taking them apart and finding the weaknesses in them. I’m going to be doing some of that at FiveThirtyEight but will be more focussed on doing original data analysis.

When you first started at WSJ, were you a data journalist? Or was this more of an organic development?

When I started at the WSJ I don’t think I had even heard the term “data journalism”, and I wasn’t a data journalist for most of my first years there. The more specialised role came later when I started writing The Numbers Guy column. Then, when the WSJ expanded its sport coverage, I started to write much more about sports from a data point of view.

Which is your favourite sport to write about?

My favourite sport to follow is tennis, which is in some ways both my favourite and least favourite sport to write about. It’s my favourite because it’s largely untapped territory in terms of data analysis, but it’s also one of my least favourites because of the way that the data has been archived, making it one of the most difficult to get accurate data for. It’s a pretty fertile area, though, and although it’s not big in the USA, there’s always going to be a focus around major events.

What steps do you take to make sure that the data you are analysing is accurate?

There are some built-in error checks with analysis, which can help determine the reliability of the data. These include checking whether the data you are running the analysis on makes sense, and looking whether different analyses produce similar results. Another important question to ask yourself is whether there is some important factor that you are not controlling for.

At FiveThirtyEight we also have a quantitative editor who reviews your work and points things out for you, such as confounding variables and sources of error. Readers are really vital for this, too: the feedback we have already received from readers who tell us when they think we have made mistakes has been extremely useful.

What do you think are the most important traits for being a good data journalist?

The first is having a good statistical foundation, which includes being comfortable with coding and using various types of software. The others are the same as for all types of journalist: being a collaborator, fair, open-minded, ethical, and responsive to both readers and sources.

Which data journalists do you particularly admire?

I’ve admired the work of many data journalists, including my current colleagues, and my former colleagues  at the Wall Street Journal. Certainly Nate Silver at FiveThirtyEight: he is a large part of the reason that I wanted to work with FiveThirtyEight in the first place. Also my colleague Mona Chalabi because she has a great eye for finding stories with interesting data behind them.

What’s the best part of being a data journalist?

Compared to most journalism, I think there is more potential to have an “aha” [eureka] moment for any given story, since it can sometimes be a slog if you’re trying to get that just from interviews or other sources. Any data set has the potential to give you a couple of these moments if you’re spending just a few hours looking at it.

And the most difficult part?

I think number one is when you can’t get hold of the data for something: occasionally a topic can be very hard to measure, and you would love to write about it but just don’t have a way in. This is often the case with sport in particular, where there can be measurement problems, issues with the quality of the data, or even a complete scarcity of it. So issues with data quality and access are the most difficult parts.