OutWit Hub Tutorial

In this post I’ll show you how to use the free version of OutWit Hub to scrape data that can be used to form the basis of a story. Please comment on this post below if you have any problems and I’ll do my best to help you out.

As data journalism becomes increasingly popular it has the potential to appear as something of a dark art practiced on desks away from the news desk. However, although data journalism can be extremely specialised, it brings with it some really powerful new techniques that can be harnessed by any journalist in order to get or enrich news.

One such new skill is data scraping. To somebody who’s never come into contact with scraping before it can sound scary and even slightly sinister. However, there are a number of tools out there that make it really easy to scrape your own data without having to know a single line of code.

Once you’ve finished reading, you’ll be able to do this:

Getting OutWit Hub

OutWit Hub is a specialised web browser which is free to download. The free version limits you to scraping 100 rows of data but if you are a casual user this will probably be more than sufficient to fulfil your scraping needs. Alternatively you can buy the pro version for about £60.

In this post you’ll only need the free or lite version.

What does OutWit Hub do? – The simple theory behind scraping

When you boil it down, OutWit Hub actually does quite a simple thing. It retrieves text from a web page between two specified markers.

For instance, let’s assume we have a page with a source code that looks like this: A B C

I want OutWit Hub to extract the B. In order to make it do this I would set the ‘before marker’ equal to A and the ‘after marker’ equal to C. With these instructions in place OutWit would extract the B.

However, it is important to make sure that your before and after markers are as specific as they can be, otherwise you could end up with OutWit extracting data that you don’t want.

For instance, in a situation where the source code of the page looks like this: A B C D A E C

I still want to extract the B from this code but if we use the same markers as before then OutWit will also scrape the E because the E also has an A before it and a C after it.

In order to solve this we would make the markers more specific. So instead of having after marker = C, we would change it to after marker = C D. In this way, the markers are specific only to the B and OutWit will only extract the data we want.

OutWit scrapes data using markers to book-end what you want to extract.

How to scrape in OutWit Hub

So now we know the theory we can apply that to some actual scraping. You can scrape all sorts of things with OutWit Hub but lists are probably one of the easiest things to scrape. This is due to the fact that they are generally set out in a way that means the page structure is very regular, making it easy to write our before and after markers.

We’ll be working from Watersones top 100 Bestsellers list which can be found here.

OutWit is a good tool to use for this list because the list spans multiple pages. It would take a long time to copy and paste the list entries from each page. Scraping the list allows you to perform the task much quicker and in a way that’s much cleaner.

You should now be seeing something that looks like this (the book will be different depending on when you access the list):

Waterstones Bestseller List
Waterstones Bestseller List

The page does give you the option to view up to 50 entries at a time but we’ll stick with 10 because it will give you a better idea of OutWit’s capabilities.

Open up OutWit Hub then paste in the url of the Waterstones page and go to that page.

Then select the ‘Scrapers’ tab from the left sidebar and create a new scraper called something like Waterstones Top 100.

When you’ve done that you should have a page that looks roughly like this:

OutWit Hub dashboard
OutWit Hub dashboard

 

I’ve entered the Description names of the four pieces of data we want to record for each book which in this case is Rank (in the top 100), Name of the Book, Author and Price.

Pages of code can often be very long but the ‘Find in Page’ box makes it easy for non-coders like me to navigate the page quite easily and find the bits that I want. Seeing as we know that the first book in the list is called Life After Life, we can search for it in the ‘Find in Page’ box and it will take us to the top of the list. From here we can work out what our markers should be.

The first instance of Life After Life takes us to this line of yellow code:

<img src=”http://www.waterstones.com/wat/images/nbd/s/978055/277/9780552776639.jpg” alt=”Life After Life”/>

This is not useful to us because OutWit will only extract text that is coloured black i.e. the text that will actually appear on the web page.

The next instance of Life After Life, however, is in black and will take you to a screen that has code looking like this:

Waterstones Source Code
Waterstones’ website source code

 

We can see immediately from this that three of the four information categories that we’re looking to extract are present: Rank, Name and Author.

Now we can set our markers to scrape data

In order to extract the rank of the book we would set the ‘marker before’ as ‘<span class=”position”>’ and the marker after as ‘</span>.

Just as we saw in the theory section above, this will isolate the bit of data that we want so it can be extracted by Outwit.

Similarly, we can extract the name of the book by setting the ‘before marker’ as ‘href=”http://www.waterstones.com/waterstonesweb/products/ and the after marker as ‘</a>    <h2> . As you will see from the image below, there are a lot of spaces in between these two tags (represented as triangles in the OutWit scraper section) and the best way of getting the exact number is to simply copy and paste the marker in from the code. 

The before marker, in this instance, does take all the code immediately preceding the text because it makes the marker too specific. The Kate+Atkinson part onwards is only true of this first book and so the scraper would only scrape the author for this book. Making sure your markers are just specific enough is very important and you can make sure you get them right through trial and error or by again using the ‘Find in Text’ box.

The markers for the author name are more straightforward. The marker before is ‘<p class=“byAuthor”>by‘ and the marker after is ‘</a>‘.

By scrolling down the page we can then find the bit of the code that tells us the price and set the marker before as ‘priceRed2″>&pound;‘ and the marker after as ‘</span>‘.

You should now have a set of markers that look like this:

How to scrape using outwit hub markers
Markers before and after

Make sure you save the scraper, then click the ‘Execute’ button.

This should then take you to your scraped data tab, looking something like this:

OutWit Waterstones Bestseller List First Scrape

Well done, you’ve written your first scraper!

Applying the scraper to multiple pages to scrape the full list

Scraping one page is all well and good but using the scraper that we’ve written to scrape multiple pages is even better.

In this case we have 10 pages, each with 10 entries. We want to scrape all 100.

To do this you’ll need to open up an excel document so that we can generate the links we want to scrape.

If you go back to viewing the Waterstones page in a normal browser and click the button to go onto the next set of list entries you’ll see the url changes from this:

http://www.waterstones.com/waterstonesweb/bestSellersCategory.do?searchType=7&ctx=0&pageNumber=0&sort=ProductSalesRankList|REQUEST_SORT_DIRECTION_DESC&resultsPerPage=10

To this:

http://www.waterstones.com/waterstonesweb/bestSellersCategory.do?searchType=7&ctx=0&pageNumber=1&sort=ProductSalesRankList|REQUEST_SORT_DIRECTION_DESC&resultsPerPage=10

The part of the Url that has changed is the bit which begins pageNumber=. The first 10 entries are on pageNumber=0 whereas the second 10 are on pageNumber=1. From this we can deduce that this is the only part of the url that will change throughout the list.

In your excel document type all 10 links from pageNumber=0 to pageNumber=9 without a column header. Then save your document as a webpage.

Generate your urls in Excel
Generate your urls in Excel

 

Return to OutWit Hub, click on the Tables tab in the left hand menu. Go to File, then Open the Webpage you just saved.

You should be presented with your spreadsheet with the urls you just generated in column 4. Click on one of these urls and press ctrl+a to select them all. Then right click on one of them, select, Auto Explore Pages, Fast Scrape and then the Scraper you just made e.g. Waterstones Top 100.

Auto exploring pages in Outwit Hub
Auto exploring pages in Outwit Hub

Click on the scraper and OutWit will scrape all 100 entries in the list for you.

Now all you need to do is click on the vertical ‘Export’ button to the right of your scraped data and save your data in your chosen format.

So, did your page look like this? Let me know in the comments section!

22 COMMENTS

  1. Great tutorial on how to scrape. However, some of your sample before and after markers are inconsistent with your actual samples and the screenshots. When I followed the tutorial to the letter, I get results that start with 11, and repeat after 20 back to 11. After 20 it continues on correctly with the ranking. Also I have gaps in the data where the name of book is See more. I think my anomalies are due to your different examples between the written and screenshots. Or more likely due to my newbiness of working with Outwit Hub. =P Great tutorial otherwise, as I will be able to apply to my real world job…

  2. Hi. Glad you like the post.

    You are quite right about the markers. I’m not sure what happened there. I’ve now changed them in the article and have also taken a better screenshot of my markers so you can see all of them.

    I hope this helps!

  3. Hi Patrick
    Great Tutorial!

    I am currently struggling with a scraper and several pages.

    Will you be willing to take a look at my scraper settings.

    Best

    Cesar

  4. Why don’t you just use the links-autobrowse function to obtain the links and then save as .txt and then open that file in OutWit, highlight, right-click, autoscrape? alternatively create a single use scraper to scrape links for the books and hit the auto-browse function, save as txt and then open that file and autoscrape… much quicker than manually doing it in excel

  5. Hi Patrick,

    Thanks very much for the tutorial which helped me a lot. All the things run good on the single page, but when I tried to scrape all 10 pages, the results are like a mess(e.g., it shows from 1 to 30 then 61 to 70 three times repeated, then 71 to 90 finally 81 to 90 again), so I tried to do all steps from beginning to the end again then the results are like another messy order…

    Is that a bug of this software? or something wrong what I did?

    Looking forward to your reply.
    Elvo

  6. I’m coming again… I created a scraper named “test”which worked very well at the beginning, both on the single webpage and the url list page, but after I exported it, then tried to run it again, there was always a pop-up said”the automator test is disabled. Use it anyway?” If I clicked Yes, then the scraper cannot be used on the webpage I saved before from excel(it can present column 4, but if I selected auto-explore pages, fast scrape, then there was only one option”automatically select scraper”. If I clicked that one, the page will show “No active applicable scrapers were found for the explored URL(s). Check Active and URL fields in the ‘scrapers’ View”). I can not figure out what was wrong and how to fix it…

    If you can help me, I will really appreciate it…

    Elvo

    • Hi Elvo. Sorry for the late reply. It sounds as though something is slightly awry with the url that the scraper is destined too. Even a really tiny change will confuse OutWit. So it might be https:// rather than http:// or something that small. Try just setting it to ‘www.’

  7. Hello Patrick, Is it possible to read only a part of the page source?
    Lets say that OutWit Hub stops reading when it finds the

  8. Great tutorial. I have a different problem.I want to scrap data from a JSP page which opens after submitting 4 variables into a form. That means, I have to input data into a web page and submitting this will open a JSP page. I want to scrape data from second page. I want to do this repeatedly 100 times, by changing one of the 4 variables. Is this possible with lite version of the program. If so please guide me. Thank you

  9. Great tutorial. I want to scrape data from JSP page which opens after submitting data to previous page. I have to input 4 variables. I want to do it 100 times by changing 1 of the 4 variables. I want to automate this. Is this possible with lite version of program. Thank you

  10. Great tutorial, I did however intitially have some problems with it. The urls saved in excell have to be formatted as hyperlinks, i used a second column with the command =HYPERLINK(A1) to do this.

  11. […] More bash commands: Linux Command Directory from O'Reilly, GNU CoreUtils.SS64 bash discussion forumLinks to other Sites, books etc. An A-Z Index of the Bash command line. An A-Z Index of the Bash command line. How to use OutWit Hub to scrape data for free – Interhacktives. […]

Leave a Reply