Skip to content

NLPatVCU/PDFtoTextExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

install.sh is for Debian Linux Prerequisites: apt and python above 3.8

The project includes 3 main parts:
PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder Text Visualizer - Visualize the text to see what the computer recognizes

If on debian linux do

Sudo bash install.sh
Steps:

  1. Install tesseract-ocr and libtesseract-dev using your os package installed
  2. Create a virual env python3 -m venv venv
  3. source venv/bin/activate
  4. Install all libraries required pip install -r requirments.txt

Depending on your work load either use main.py if you want a graphical interface or maincli.py to use command line argumets

For mainCLI.py you can use either syntax
python3 main.py PDFfile
or
python3 main.py PDFfile -o outputFileName

For visualizer.py the syntax is
python3 visualizer.py PDFfile