Extracts tables from PDFs with tables having certain properties:
- Readable PDF -- PDFs which are not scanned i.e. on which the text can be selected
- Non-readable PDF -- PDFs which are scanned i.e. on which the text cannot be selected
- Tables with lines
- Tables with out lines
- First of all the function
inference_pdf.get_tablesfrom./inference_pdf.pyis called. This function takes thepathto the file along with two other arguments -methodindicating the method to use for OCR anddebugindicating whether to save the intermediate files for later or not. It extracts the image of each page and the HTML of the whole PDF (HTML of a PDF is basically a way to represent the PDF in a raw way. Think of it as a file containing all the words along with there position). - It makes the list of all the pages in the PDF and calls
get_page_data.get_datafrom./multi_level_table/get_page_data.pyfor each page. - In the function
get_page_data.get_datafirst the page is scanned for all the tables usingdetect_table.detect_tablefrom./table_detect/detect_table.py. Then each table is checked for whether it is scanned or written (i.e. can be selected, copied etc.). If it is scanned, it is read using OCR. If it is written it is extracted from HTML. The collected data is returned in JSON format. - For detection of tables two methods are used - AI (using detectron2) and OpenCV.
- Model trained using transfer learning on detectron2 model is stored in
./model_table_detection. - Table detection using OpenCV is a little more complicated and is done by
table_detect/detect_table.py. In the file the functiondetect_tabletakes the path to an image file. It callsfunctions/get_contoursfrom./multi_level_table/functions.pywhich returns an image with only vertical and horizontal lines in it. It achieves this using erosion and dilation. Now contours are detected on this image using OpenCV functionfindContours. The contours are filtered to get all the tables (basically by selecting the ones with no parent) - For each table (if it is scanned)
read_table_ocr.image_to_dfis called from./multi_level_table/read_table_ocr.py. To detect the cells in the table, first the lines on the table are redrawn (scanned PDFs can have blurred/curved/non-straight lines) usinghough_lines.hough_linesin./multi_level_table/hough_lines.py. Now again contours are detected in this table and all the cells are filtered (those contours which don't have any child). For each cell the content is read using OCR (Google Vision or tesseract) and mapped to its proper row and column. - If the table is written, the content is read from the HTML of the PDF and the cells are detected using the above method only. The content is then mapped to its proper row and column.`
- Finally, the result is stored as
ans.json. YOu can use./json_to_excel.pyto getans.xlsx.
- Install poppler on windows/linux/mac
- Install tesseract on linux/mac/windows before using pytesseract
- pip install -r requirements.txt
- export GOOGLE_APPLICATION_CREDENTIALS="/home/xyz/Documents/pdf-text-extract/abc21.json" -- IMP contains google ocr credentials
- If using
detectron2to detect tables, Download the model from here and store it in./model_table_detection/asmodel_final.pth - Run in dev mode with : python app.py
- readable : either 0 for non-readable pdf or 1 for readable pdf
- lines : either 0 for no lines or 1 for tables with lines
- pdf_links : list of pdf url paths
Ex: POST request through python
import requests import json url = "http://35.200.217.56:3002/get_pdf_tables" payload = '{"pdf_links":["https://abc.com/xyz/abc.pdf"],"lines":1}' headers = { 'content-type': "application/json"} response = requests.request("POST", url, data=payload, headers=headers) result=json.loads(response.text)
print(result.text)
- Better table detection
- Better data extraction from tables without lines
- Improve tesseract
- Optimize speed of extraction from ocr
Removed table detection using detectron. To start this facility make the necessary changes in multi_level_table/get_page_data.py file by comparing it to the master (not origin/master) in the original repo.





