How to extract data from a PDF

We live in a world where PDF is king. Perhaps we could even go as far as to call it the tyranny of the PDF.

Developed in the early 90s as a way to share documents among computers running incompatible software, the Portable Document Format (PDF) offers a consistent appearance on all devices, ensuring content control and making it difficult for others to copy the information contained within.

However, for a data journalist whose job depends on being able to extract bulk data for analysis and visualisation, PDFs as the filetype of choice does not tend to go down well.

In a field of journalism where the spreadsheet rules the roost, we explore a few ways of turning data enclosed within PDFs to spreadsheets (excel xls or CSV), into data primed for analysis.

What’s always important to remember in trying to get data out of PDF files is that there is no single catch-all way that works for every occasion, sometimes it’s just a matter of trying each one until you find the one that works. Here’s some of the methods you could try:

1) SCRAPER WIKI

ScraperWiki is a powerful web-based platform for building ‘scrapers’: programmes that allow you to extract, clean and analyse data from websites. In order to really utilise its powerful features, knowing how to code is essential, but the new table extract feature is a useful way of getting data trapped in PDFs to spreadsheets.

Here’s how:

ScraperWiki has a free community version allowing up to three datasets and you can get up to 20 if you are a journalist.

Here is a great example of scraping the PDF with ScraperWiki by writing simple code. For a more advanced guide to getting to grips with ScraperWiki for those who don’t code, this from Nicola Hughes is a great starting point.

2) TABULA

Tabula, developed by former Knight-Mozilla Open News fellow Manuel Aristarán in association with ProPublica, is an open source tool specifically designed for extracting data within tables in a PDF to CSV format.

You will have to download Tabula and run it from your own device, but don’t worry as it is very straightforward and there’s even a video showing you how to do it.

Essentially, the way Tabula works is by uploading a PDF file, drawing a box around the area of the table you would like to copy and then simply selecting either to copy or download the file. You can also make text edits to the text before copying or downloading to your spreadsheet software of choice. Also, try not to include table headers in your selection as they may be problematic – you can just add them in after you have got your data into a spreadsheet.

This slideshow requires JavaScript.

One of the current drawbacks of Tabula is that you are not able to select tables over multiple pages, which you can do with ScraperWiki. Other than that, when your PDF data is in a tabular format, Tabula is a great tool to have in the battle against PDFs. If you are still not convinced, here’s how same major news organisations used it to produce data-rich news stories.

3) COME TO DOCS

come to docs

Come to Docs is an online document management system that allows you to turn a PDF into an Excel (XLS) file (or a number of other formats including .txt files) in just a few simple steps. It’s as straightforward as uploading your file (you can do this either by registering for an account or simply by dragging and dropping the PDF) and choosing the format you want. You need to include your email and in a matter of minutes the converted file will arrive in your inbox.

Come to Docs insists that its system is completely secured and privacy is guaranteed, but as you are uploading documents onto their servers and receiving files via email (as with a few of these systems), this is something you may want to be aware of, particularly for sensitive PDFs.

4) ZAMZAR

Zamzar extracting data from PDFs

Zamzar is another one worth highlighting, and it works much like Come to Docs. In my experience however, Come to Docs is the more reliable of the two.

5) NITRO

Nitro extracting data from PDFs

Nitro also works in a similar way as to Come to Docs. Simply upload a PDF and wait for it to reach your inbox. You are allowed 5 free conversions without having to sign up for an account and a free 14-day trial as an account holder. If you do end up needing to use it regularly, you will have to pay.

If these free conversion tools do fail, and its not inconceivable that this would happen, then there are a number of ways to manually convert a PDF to a .CSV file.

This tutorial from data journalist and former Interhacktive Henry Kirby offers a useful alternative.

Mostly, these examples can handle PDFs that have text or tables embedded within. In the cases where you are dealing with images of text, for example in a scanned document, things become much more difficult.

Below, you can find some additional resources and examples to use in such instances:

Data Driven

http://datadrivenjournalism.net/resources/getting_text_out_of_an_image_only_pdf2#sthash.DXQZw4KM.dpuf

Scanned image to Excel

http://www.verypdf.com/app/scan-to-excel-ocr/scanned-image-to-excel-converter.html

PDF to text

http://www.newocr.com/

http://datadrivenjournalism.net/resources/getting_text_out_of_an_image_only_pdf2

Propublica guide:

http://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide

ftp.foolabs.com/pub/xpdf

Explainer:

https://docs.google.com/file/d/1-y5aHy5KSZhAtd4Q7ZC_NVGylLNN5mHZmSeiZi15r8LHAhZXMjm6LctqsybM/edit

15 Replies to “How to extract data from a PDF”

  1. I notice you didn’t mention PDF2XL by ConiView. You can convert thousands of PDF pages in minutes, and PDF2XL also uses a top quality OCR engine to convert scanned documents. I think it deserves a mention!

  2. I have spent quite a few years of my life in data processing and I didn’t find anything better than Intelliget for PDF parsing. When I started using it, I was very skeptical of its abilities. However, soon I grew very fond and was able to convince my company to replace the existing super costly enterprise extraction tool with this cheap and easy one – Scott

  3. If you are going to be doing batch processing/extraction from PDF’s then https://docparser.com is really good at location based, text recognition/extraction. Then you can send the data to Zapier, or download the data in several file types. This allows you to process files with the same physical layout, really efficiently.

Leave a Reply