The Internet is abuzz with talk of data these days, and everyone wants in. However, the complexities of data and how to work with it are regularly misunderstood.
When harnessed correctly, data can be a very powerful resource. On the other hand, when it’s not used correctly, it can become a liability.
It is very easy to make mistakes when using data. In this article, we explore five common mistakes people make with data, and solutions for how to avoid them.
Not allocating your time correctly
Producing a well-organized, comprehensive data article takes a lot of time, and it’s important to note that a lot of work has to be done before the “fun” visualization part can happen. Collecting, formatting, transforming, cleaning and filtering data often takes up the biggest portion of an overall project.
According to this Forbes article, “data preparation accounts for up to 80 percent of the work of data scientists”. In 2013 Josh Wills, director of data engineering at Slack, told Technology Review: “I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.”
Allow yourself a reasonable amount of time to work through your data, especially when you don’t know what type of data you will be dealing with. While the data preparation process may not be the most fun, it is very important in ensuring the data you analyse is clean and accurate.
Trying to use all of a dataset
Today, data is generally stored in a raw format. Because entities are creating and collecting so much information, data gets piled on top of more data, and you end up with a “data lake,” or a vast repository of raw data that is not formatted or cleaned.
When you obtain a dataset, it will likely have been pulled from a data lake and contain a lot of information that is not usable. The industry calls this “data noise”, which is essentially all of the elements of a dataset that are meaningless.
Before doing any kind of analysis, clean and format your data to verify that the information you want is there, and that you aren’t missing any important metrics or dimensions.
To filter out the noise, the first step is to select the wanted data, the second step is to label your data and the third step is to filter. This will allow you to have more accessible data for querying and visualizing.
Trusting your data
Because data is complex to capture and manage, the accuracy of data cannot be guaranteed. Also, for manual data entry there is the added likelihood of human error.
Don’t assume that because something is “data” that it is flawless. You should always eye your data with a level of scrutiny, because the smallest incorrect detail could throw off your whole analysis.
You also should question your data at a high level. For example, if you are writing a data-based article about the happiest cities in the world, what metrics were used to determine happiness? Could more have been measured? How was it collected? Is some of the data perhaps not useful in determining happiness?
Data isn’t perfect, and we should treat it as such. The more you approach your data with caution, the more you will be able to draw valuable conclusions from it.
Being biased in your visualization
As humans we are prone to bias, so it’s important to think carefully before interpreting data. How you choose to visualise a dataset is very important because certain colours or even number ranges can easily invoke bias in readers.
When visualizing, think about who your audience is and consider how to display the data in a way that they can understand, but that also tells the truth.
Keep visualizations clean, and make sure the numbers you are displaying add up correctly. Be discerning about the type of chart you use to display the data, as it can make a difference in how readers understand it.
Using outdated data
Most data is “historical”, or recorded in the past. Unless you are analysing live data in real time (streaming data), everything you work in is historical.
The problem with working with historical data is when you use it to report on the present and/or future. For example, if a poll is taken on a political campaign in the morning and in the evening one of the candidates does terribly in a live debate, the data recorded in the morning may no longer be significant.
Consider how time relevant your data is. If you are writing about a multi-year study, having historical data is important. However if you report about the latest information from an election, your topic and time range could highly impact the accuracy of your data.
In any dataset you analyse, look at when the data was published, when it was last updated, and if possible, see if you can access a live database.
Measure the data you have against current hashtags and Google Trends to see if the idea your data supports has changed before you do the work of writing and publishing.
Data is a big beast to tackle, but as long as you are meticulous and scrutinize every step of the data journey, you will likely produce some compelling data-driven articles. Just remember to allow plenty of time upfront for a project, always question your data, know your audience, and check what you have against current trends.