Lies, damn lies, and statistics. At least that’s how the saying goes and how the wider public feel, for some reason people tend to distrust something with numbers backing it more than if it doesn’t. Well that’s just completely wrong, but the problem is that numbers can tell two different stories from the same data.
Statistics might drive people insane, scare them, or not seen relevant.
But in this post I’ll try and explain how to use Excel for some basic statistical analysis and what it can tell us.
Disclaimer: There will be outcomes I don’t explain as they are more advanced, but they may be covered in a later. I will also use very simple data sets for ease of explanation.
Why might a data journalist want to use statistical tools?
Journalists have a lets be honest earned reputation for being scared of numbers and frankly being awful at them. But data journalists and those interested in it are a different breed.
Why we should be interested in statistics is what it tells us about our data, it is a tool to spot patterns, check reliability and ask if all as it seems to be. For a basic story this is probably going a bit far, but when handling complex data sets especially if they are financial it tells us a lot.
If you perform a regression analysis for example and the results seem odd there is a lead to explore, which you would never have found unless by chance or really understanding the subject area. In essence it is a tool that allows you to tell more about stories and find exciting new leads.
So lets get started.
Firstly you need to make sure you have the right tools. For Mac this is:
- Download StatPlus:mac LE for free from AnalystSoft, and then use StatPlus:mac LE with Excel 2011.
- You can use StatPlus:mac LE to perform many of the functions that were previously available in the Analysis ToolPak, such as regressions, histograms, analysis of variance (ANOVA), and t-tests.
- Visit the AnalystSoft website, and then follow the instructions on the download page.
- After you have downloaded and installed StatPlus:mac LE, open the workbook that contains the dat that you want to analyze.
- Open StatPlus:mac LE. The functions are located on the StatPlus:mac LE menus.
To start with, here is a list of the majority of the statistical functions within Excel. We won’t be covering anywhere near all of these but an explanation is provided by each.
Learn about your data
One nice thing about the Data Analysis tool is that it can do several things at once. If you want a quick overview of your data, it will give you a list of descriptives that explain your data. That information can be helpful for other types of analyses.
We shall use the data below. It shall also be used for other topics in the post.
If we wanted to get a quick overview of the variables, we can use the descriptive statistics tool. Go to the basic statistics tab in StatPlus and click on descriptive statistics. Then highlight the column containing the data, if you have checked column 1 as labels make sure to includes it. This looks like a lot. But some of these variables can be helpful. This is useful for journalists because it helps us test the validity of our data and mean we don’t go too far in trying to find a story before realising it isn’t worth our while.
To look at Variable #1 (Quantity sold), if you do a regression, you want the Mean (average) and Median (middle value) to be relatively close together. If your results are good you should be seeing standard deviation to be less than the mean. So in the above table, our Mean and Median are close together. The standard deviation is about 1296 – which means that about 70 percent of the quantity sold was approximately between 5900 and 7100. Not too scary so far right?
Another good overview of your data is what is called a correlation Matrix, which gives you an overview of what variables tend to go up and down together and in what way they are moving. For example say you were looking at data which showed how something changed over time, you could use the matrix to see if the progressions are what you’d expect. This might find you a great story.
It is useful as a first look at what your data is telling you before potentially delving into regression, but to work out whether the data for a story is reliable this would be useful. The correlation is measured by a variable called Pearson’s R, which ranges between -1 (indirect relationship) and 1 (perfect relationship).
Go to the data table and the data analysis tool and choose correlations. Choose the range of all the columns (less headers) that you want to compare. Then you get a table that matches each variable to all other variables. Below you see that the correlation between Column 2 and Column 3 is 0.02156. It would be the same between Column 2 and Column 3.
Correlation provides a general indicator of the what is called the linear relationship between two variables, but it crucially you cannot make progressions. To do that, you need to do what is called linear regression – this will be covered later.
What we can use it for however is checking the outcomes are logical and within a margin of error, if not ask why? If the data set your working on suddenly changes ask questions, see if there isn’t a story. It is a tool to allow you to go beyond the obvious and find interesting stories within your data.
Some characteristics help predict others. For example, people growing up in a lower-income family are more likely to score lower on standardized tests than those from higher-income families.
Regression helps us see that connection and even say about how much characteristics affect another.
Trend analysis is a mathematical technique that uses historical results to predict future outcome. This is achieved by tracking variances in cost and schedule performance.
For trend analysis there are three ways it can be done: the equation, forecast, or trend. I will go through these three methods using the simple set. One important term to understand here is R-squared, as it gives a indication of the reliability of your data. But what is R-squared?
R-squared is a statistical measure of how close the data are to the fitted regression line.
The definition of R-squared is here. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
It is important to remember R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why it is critical that you must assess the residual plots. R-squared also does not indicate whether a regression model is accurate beyond doubt, it can be low and right or high and wrong.
Why might this be useful?
Well if you are getting large outliers or your R-squared value is out for example it can be an indicator of an unreliable data set. For a jobs data story this could suggest that the government’s claims to a smooth system is not true.
Or if you were doing a story on incidents of piracy it could lead you to exploring avenues about reporting, hotspots, or identify key periods for further investigation. Paradoxically by going deeper into the numbers it can allow you to can further beyond them and ask the really tough questions.
Statistics keeno klaxon
Here are other types of standard trends, which may be touched on in a future article:
- Polynomial – Approximating a Polynomial function to a power
- Power – Approximating a power function
- Logarithmic – Approximating a Logarithmic line
- Exponential – Approximating an Exponential line
Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modelling and analysing variables, where the focus is on the relationship between a dependent variable and one or more independent variables. Not obviously related to journalism? Think about a story about stress levels for a certain group and other factors such as wages, housing or frankly anything. If you can identify this you can then track those changes over time and tell a lot deeper data story, it gives weight to sometimes seemingly obvious answers.
So lets get started:
- In StatPlus click on the statistics tab.
- Select linear regression and click OK.
- Select the Y Range (A1:A8). This is the predictor variable (also called dependent variable).
- Select the X Range(B1:C8). These are the explanatory variables (also called independent variables).
- These columns must be adjacent to each other.
- Check Labels.
- Select an Output Range.
- Check Residuals.
- Click OK.
Excel produces the following Summary Output:
R Square tells you how much of the change in your dependent variable can be explained by your independent variable. R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is explained by the independent variables Price and Advertising. The closer to 1, the better the regression line (read on) fits the data.
Significance F and P-values
To check if your results are reliable (statistically significant), look at Significance F (0.001). If this value is less than 0.05, you’re data looks good. If Significance F is greater than 0.05, it’s probably better to stop using this set of independent variables. Delete that rerun the regression until Significance F drops below 0.05. Of course this is not guarantee of success.
Most or all P-values should be below 0.05.
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 * Advertising. In other words, for each unit increase in price, Quantity Sold decreases with 835.722 units. For each unit increase in Advertising, Quantity Sold increases with 0.592 units. This is valuable information.
You can also use these coefficients to do a forecast. For example, if price equals £4 and Advertising equals £3000, you might be able to achieve a Quantity Sold of 8536.214 -835.722 * 4 + 0.592 * 3000 = 6970.
The residuals show you how far away the actual data points are from the predicted data points (using the equation). For example, the first data point equals 8500. Using the equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 = 8523.009, giving a residual of 8500 – 8523.009 = -23.009.
Why might this be useful?
See my explanation of regression analysis above! This is probably the most advanced stats covered in this post, but I would say is potentially the most useful as it can be applied to so many types of data and data sets you have created.
I hope this has been a good introductory overview to statistics in Excel, I’ll do another post soon and update this one when I get round to it but I hope this has proved useful.