In the early 1990s, Adobe invented the PDF so that documents look exactly the same, no matter what devices are used to view or print them. PDFs are not editable. This is good when you’re sending people your resume, but it’s not so good for open data.

Unfortunately, most Indian government data sets and many other data sets can only be found in PDF documents. These documents can be transcribed, but it’s expensive. Someone fast can transcribe about 2,000 cells per hour — that’s more than one cell every 2 seconds. Suppose a PDF document has 100 pages with 400 cells per page. That will take 20 hours to transcribe, costing you $200 at $10 per hour.

I can imagine the evil people who present data in PDFs cackling with delight at the plight of data scientists who have to figure out how to extract tabular data from these PDFs.

pdf, data, html, socialcops

Some nice people decided to build a tool called Tabula, which makes it much easier to extract a few tables from a PDF. However, it becomes really boring to keep extracting and cleaning data for hundreds of tables. You’d rather automate the entire process.

PDF, data, SocialCops,

Some other nice people at ScraperWiki made a really cool API called PDFTables, which converts data from PDFs into HTML. HTML isn’t evil, since it can be parsed!

Let’s extract tabular data from one page to get a hang of the workflow required to automate the extraction process in Python. (I’ll be assuming intermediate experience with Python).

Step 1: Get the data

For this tutorial let’s use Distribution of accidental deaths by causes attributable to nature and sex — 2013 dataset provided by the National Crime Records Bureau of India.

Step 2: Convert the PDF to HTML using the PDFTables API

Import essential libraries and write the API response into an intermediate HTML file.

import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

htmlfile = open(filename+'.html', 'w+')
files = {'f': (filename+'.pdf', open(filename+'.pdf', 'rb'))}
response = requests.post(https://pdftables.com/api?key=r4i5cvh74tvn, files=files)
response.raise_for_status() # ensure we notice bad responses
for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        htmlfile.write(chunk)
    htmlfile.flush()
htmlfile.close()

Step 3: Parse the HTML and hunt for patterns

This is the trickiest and most creative part of the entire process, and the technique varies with datasets. For our sample data set, we’ll try extracting data between rows that contain the string “STATES:” and “TOTAL (STATES)”.

htmldata = ''
with open(filename+'.html', 'r') as htmlfile:
    htmldata = htmldata + htmlfile.read()
soup = BeautifulSoup(htmldata, 'html.parser')

NOTE: We use html.parser instead of the default lxml because it’s sturdier.

rows = tables[0].find_all('tr') # Extract all rows from the first table
start = 0
stop = 0
# Find the starting point
for row in rows:
    if 'STATES' in str(row):
        break
    start += 1
start += 1
stop = start
for row in rows[start:]:
    if 'TOTAL (ALL INDIA)' in str(row):
        break
    stop += 1

Step 4: Convert the data to CSV

This is the easy bit. Loop through the rows and comma separate the values in the columns. Beware of potholes in the data though! Notice that a ‘UNION TERRITORY:’ label appears out of nowhere!

csvstring = ''
for i in xrange(start, stop+1):
    cols = rows[i].find_all('td')
    for col in cols:
        try:
            csvstring = csvstring + col.contents[0] + ','
        except IndexError:
            pass
    csvstring = csvstring + '\n'
print csvstring

At the end of this comparatively short pipeline of python scripts, we get an output similar to this:

1,ANDHRA PRADESH,0,0,0,0.0,2,4,6,0.5,
2,ARUNACHAL PRADESH,0,0,0,0.0,0,0,0,0.0,
3,ASSAM,0,0,0,0.0,0,1,1,0.4,
4,BIHAR,0,0,0,0.0,66,15,81,9.4,
5,CHHATTISGARH,0,0,0,0.0,2,1,3,0.4,
6,GOA,0,0,0,0.0,0,0,0,0.0,
7,GUJARAT,0,0,0,0.0,12,2,14,0.5,

A couple of formatting and conversion steps later, this data is analyzable! We can even loop through all the tables in the PDF and save them all as CSVs.


In conclusion, Tabula is helpful for extracting a couple of tables quickly. For hundreds of tables from a couple of PDFs, writing Python scripts and using the PDFTables API is a better choice. However, extracting data from hundreds of tables from hundreds of PDFs can be slow even with PDFTables because of data transfer speeds over the internet. I really hope the nice people and friends over at ScraperWiki decide to open source the library behind the PDFTables API for super fast tabular data extraction from ultra evil PDFs.

If you know a faster way to extract data from PDFs, let us know in the comments.

Happy hacking!


Interested in learning more about what we do? Check out our capabilities deck for information on what we do and how our platform works, or drop us a message!

rE