Table Extraction from the PDF

Introduction

The objective of this project is to extract tables and its cells from a PDF using python library camelot.

Note : Camelot works better if boundaries of each cell are properly defined. It means that any two cells are separated with a solid line.

Detection of Outer Boundary

Table extraction from a pdf can be done by a process called Lattice. Below are the steps which it take to identify table region.

Converts PDF into image using Ghostscript
Image processing to get Horizontal and Vertical Lines
Line segements are detected
Table boundaries are computed by overlapping the detected line segments by “or”ing their pixel intensities.

The image below shows the detected outer lines of a table —

Detection of Cell Boundaries

Intersection points of horizontal and vertical lines are identified by Image Processing techniques and these points will be the coordinates for each cell given in the table. But, all these coordinates will be in camelot space because this library reduces the size of pdf before processing it. Hence, it is necessary to shift these coordinates from camelot space to original PDF space.

Now, this transformation can be easily done by shifting and rescaling of axes (Cartesian Coordinate System) in camelot space. If top-left coordinate of table is considered as origin for both the spaces. Then, the following approach can be used -

Shifting of top-left coordinate of table_c (table in camelot space) to top_left coordinate of table_p (table in PDF space)
Calculate the rescaling factor for width and height. This will be the ratio of widths and heights of both the tables (ratio > 1)
For each cell in camelot space, multiply height and width of cell with their respective scaling factors

For example —

red : Table in PDF space
purple : Table in camelot space.

Transformation equations for x and y coordinates —-

The image below is the table transformed from camelot space to pdf space.

Usage

Install requirements

pip install -r requirements.txt
Install Ghostscript from here
Implementation done in jupyter notebook and notebook can be found here