In this post I’ll show you how to use the free version of OutWit Hub to scrape data that can be used to form the basis of a story. Please comment on this post below if you have any problems and I’ll do my best to help you out.
As data journalism becomes increasingly popular it has the potential to appear as something of a dark art practiced on desks away from the news desk. However, although data journalism can be extremely specialised, it brings with it some really powerful new techniques that can be harnessed by any journalist in order to get or enrich news.
One such new skill is data scraping. To somebody who’s never come into contact with scraping before it can sound scary and even slightly sinister. However, there are a number of tools out there that make it really easy to scrape your own data without having to know a single line of code.
Once you’ve finished reading, you’ll be able to do this:
Getting OutWit Hub
OutWit Hub is a specialised web browser which is free to download. The free version limits you to scraping 100 rows of data but if you are a casual user this will probably be more than sufficient to fulfil your scraping needs. Alternatively you can buy the pro version for about £60.
In this post you’ll only need the free or lite version.
What does OutWit Hub do? – The simple theory behind scraping
When you boil it down, OutWit Hub actually does quite a simple thing. It retrieves text from a web page between two specified markers.
For instance, let’s assume we have a page with a source code that looks like this: A B C
I want OutWit Hub to extract the B. In order to make it do this I would set the ‘before marker’ equal to A and the ‘after marker’ equal to C. With these instructions in place OutWit would extract the B.
However, it is important to make sure that your before and after markers are as specific as they can be, otherwise you could end up with OutWit extracting data that you don’t want.
For instance, in a situation where the source code of the page looks like this: A B C D A E C
I still want to extract the B from this code but if we use the same markers as before then OutWit will also scrape the E because the E also has an A before it and a C after it.
In order to solve this we would make the markers more specific. So instead of having after marker = C, we would change it to after marker = C D. In this way, the markers are specific only to the B and OutWit will only extract the data we want.
How to scrape in OutWit Hub
So now we know the theory we can apply that to some actual scraping. You can scrape all sorts of things with OutWit Hub but lists are probably one of the easiest things to scrape. This is due to the fact that they are generally set out in a way that means the page structure is very regular, making it easy to write our before and after markers.
We’ll be working from Watersones top 100 Bestsellers list which can be found here.
OutWit is a good tool to use for this list because the list spans multiple pages. It would take a long time to copy and paste the list entries from each page. Scraping the list allows you to perform the task much quicker and in a way that’s much cleaner.
You should now be seeing something that looks like this (the book will be different depending on when you access the list):
The page does give you the option to view up to 50 entries at a time but we’ll stick with 10 because it will give you a better idea of OutWit’s capabilities.
Open up OutWit Hub then paste in the url of the Waterstones page and go to that page.
Then select the ‘Scrapers’ tab from the left sidebar and create a new scraper called something like Waterstones Top 100.
When you’ve done that you should have a page that looks roughly like this:
I’ve entered the Description names of the four pieces of data we want to record for each book which in this case is Rank (in the top 100), Name of the Book, Author and Price.
Pages of code can often be very long but the ‘Find in Page’ box makes it easy for non-coders like me to navigate the page quite easily and find the bits that I want. Seeing as we know that the first book in the list is called Life After Life, we can search for it in the ‘Find in Page’ box and it will take us to the top of the list. From here we can work out what our markers should be.
The first instance of Life After Life takes us to this line of yellow code:
<img src=”http://www.waterstones.com/wat/images/nbd/s/978055/277/9780552776639.jpg” alt=”Life After Life”/>
This is not useful to us because OutWit will only extract text that is coloured black i.e. the text that will actually appear on the web page.
The next instance of Life After Life, however, is in black and will take you to a screen that has code looking like this:
We can see immediately from this that three of the four information categories that we’re looking to extract are present: Rank, Name and Author.
Now we can set our markers to scrape data
In order to extract the rank of the book we would set the ‘marker before’ as ‘<span class=”position”>’ and the marker after as ‘</span>‘.
Just as we saw in the theory section above, this will isolate the bit of data that we want so it can be extracted by Outwit.
Similarly, we can extract the name of the book by setting the ‘before marker’ as ‘href=”http://www.waterstones.com/waterstonesweb/products/‘ and the after marker as ‘</a> <h2>‘ . As you will see from the image below, there are a lot of spaces in between these two tags (represented as triangles in the OutWit scraper section) and the best way of getting the exact number is to simply copy and paste the marker in from the code.
The before marker, in this instance, does take all the code immediately preceding the text because it makes the marker too specific. The Kate+Atkinson part onwards is only true of this first book and so the scraper would only scrape the author for this book. Making sure your markers are just specific enough is very important and you can make sure you get them right through trial and error or by again using the ‘Find in Text’ box.
The markers for the author name are more straightforward. The marker before is ‘<p class=“byAuthor”>by‘ and the marker after is ‘</a>‘.
By scrolling down the page we can then find the bit of the code that tells us the price and set the marker before as ‘priceRed2″>£‘ and the marker after as ‘</span>‘.
You should now have a set of markers that look like this:
Make sure you save the scraper, then click the ‘Execute’ button.
This should then take you to your scraped data tab, looking something like this:
Well done, you’ve written your first scraper!
Applying the scraper to multiple pages to scrape the full list
Scraping one page is all well and good but using the scraper that we’ve written to scrape multiple pages is even better.
In this case we have 10 pages, each with 10 entries. We want to scrape all 100.
To do this you’ll need to open up an excel document so that we can generate the links we want to scrape.
If you go back to viewing the Waterstones page in a normal browser and click the button to go onto the next set of list entries you’ll see the url changes from this:
The part of the Url that has changed is the bit which begins pageNumber=. The first 10 entries are on pageNumber=0 whereas the second 10 are on pageNumber=1. From this we can deduce that this is the only part of the url that will change throughout the list.
In your excel document type all 10 links from pageNumber=0 to pageNumber=9 without a column header. Then save your document as a webpage.
Return to OutWit Hub, click on the Tables tab in the left hand menu. Go to File, then Open the Webpage you just saved.
You should be presented with your spreadsheet with the urls you just generated in column 4. Click on one of these urls and press ctrl+a to select them all. Then right click on one of them, select, Auto Explore Pages, Fast Scrape and then the Scraper you just made e.g. Waterstones Top 100.
Click on the scraper and OutWit will scrape all 100 entries in the list for you.
Now all you need to do is click on the vertical ‘Export’ button to the right of your scraped data and save your data in your chosen format.
So, did your page look like this? Let me know in the comments section!