Intro

Extracts tables from PDFs with tables having certain properties:

Readable PDF -- PDFs which are not scanned i.e. on which the text can be selected
Non-readable PDF -- PDFs which are scanned i.e. on which the text cannot be selected
Tables with lines
Tables with out lines

How it works

First of all the function inference_pdf.get_tables from ./inference_pdf.py is called. This function takes the path to the file along with two other arguments - method indicating the method to use for OCR and debug indicating whether to save the intermediate files for later or not. It extracts the image of each page and the HTML of the whole PDF (HTML of a PDF is basically a way to represent the PDF in a raw way. Think of it as a file containing all the words along with there position).
It makes the list of all the pages in the PDF and calls get_page_data.get_data from ./multi_level_table/get_page_data.py for each page.
In the function get_page_data.get_data first the page is scanned for all the tables using detect_table.detect_table from ./table_detect/detect_table.py. Then each table is checked for whether it is scanned or written (i.e. can be selected, copied etc.). If it is scanned, it is read using OCR. If it is written it is extracted from HTML. The collected data is returned in JSON format.
For detection of tables two methods are used - AI (using detectron2) and OpenCV.
Model trained using transfer learning on detectron2 model is stored in ./model_table_detection.
Table detection using OpenCV is a little more complicated and is done by table_detect/detect_table.py. In the file the function detect_table takes the path to an image file. It calls functions/get_contours from ./multi_level_table/functions.py which returns an image with only vertical and horizontal lines in it. It achieves this using erosion and dilation. Now contours are detected on this image using OpenCV function findContours. The contours are filtered to get all the tables (basically by selecting the ones with no parent)
For each table (if it is scanned) read_table_ocr.image_to_df is called from ./multi_level_table/read_table_ocr.py. To detect the cells in the table, first the lines on the table are redrawn (scanned PDFs can have blurred/curved/non-straight lines) using hough_lines.hough_lines in ./multi_level_table/hough_lines.py. Now again contours are detected in this table and all the cells are filtered (those contours which don't have any child). For each cell the content is read using OCR (Google Vision or tesseract) and mapped to its proper row and column.
If the table is written, the content is read from the HTML of the PDF and the cells are detected using the above method only. The content is then mapped to its proper row and column.`
Finally, the result is stored as ans.json. YOu can use ./json_to_excel.py to get ans.xlsx.

Fig 1 - Image of the page from which table needs to extracted

Fig 2 - Table detection

Fig 3 - Smudging of the image using erosion and dilation to extract table structure

Fig 4 - Using Hough Lines to create cleaner table structure

Fig 5 - Cells detection

Fig 6 - Final Result

Run

Install poppler on windows/linux/mac
Install tesseract on linux/mac/windows before using pytesseract
pip install -r requirements.txt
export GOOGLE_APPLICATION_CREDENTIALS="/home/xyz/Documents/pdf-text-extract/abc21.json" -- IMP contains google ocr credentials
If using detectron2 to detect tables, Download the model from here and store it in ./model_table_detection/ as model_final.pth
Run in dev mode with : python app.py

Usage

parameters

readable : either 0 for non-readable pdf or 1 for readable pdf
lines : either 0 for no lines or 1 for tables with lines
pdf_links : list of pdf url paths

Ex: POST request through python

import requests import json url = "http://35.200.217.56:3002/get_pdf_tables" payload = '{"pdf_links":["https://abc.com/xyz/abc.pdf"],"lines":1}' headers = { 'content-type': "application/json"} response = requests.request("POST", url, data=payload, headers=headers) result=json.loads(response.text)

print(result.text)

TODO

Better table detection
Better data extraction from tables without lines
Improve tesseract
Optimize speed of extraction from ocr

Note for the author

Removed table detection using detectron. To start this facility make the necessary changes in multi_level_table/get_page_data.py file by comparing it to the master (not origin/master) in the original repo.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
model_table_detection		model_table_detection
multi_level_table		multi_level_table
pdf		pdf
pdf_tabula		pdf_tabula
skew_correct		skew_correct
table_detect		table_detect
table_log		table_log
.gitignore		.gitignore
Readme.md		Readme.md
app.py		app.py
async_pdf.py		async_pdf.py
config.py		config.py
download.py		download.py
downloads.sh		downloads.sh
inference_pdf.py		inference_pdf.py
inference_tabula.py		inference_tabula.py
json_to_excel.py		json_to_excel.py
requirements.txt		requirements.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

How it works

Run

Usage

parameters

TODO

Note for the author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intro

How it works

Run

Usage

parameters

TODO

Note for the author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages