Objective of this project is to extract tables and its cells from a PDF using python library camelot.
The objective of this project is to extract tables and its cells from a PDF using python library camelot.
Note : Camelot works better if boundaries of each cell are properly defined. It means that any two cells are separated with a solid line.
Table extraction from a pdf can be done by a process called Lattice. Below are the steps which it take to identify table region.
The image below shows the detected outer lines of a table —
Intersection points of horizontal and vertical lines are identified by Image Processing techniques and these points will be the coordinates for each cell given in the table. But, all these coordinates will be in camelot space because this library reduces the size of pdf before processing it. Hence, it is necessary to shift these coordinates from camelot space to original PDF space.
Now, this transformation can be easily done by shifting and rescaling of axes (Cartesian Coordinate System) in camelot space. If top-left coordinate of table is considered as origin for both the spaces. Then, the following approach can be used -
For example —
red
: Table in PDF
space purple
: Table in camelot
space.
Transformation equations for x and y coordinates —-
The image below is the table transformed from camelot space to pdf space.