Having trouble copying text from PDF? No problem, Word2TXT app will help you extract text from images.
- Word to Images: Important feature. You should start with it first
- JPEG to PNG: Not important feature, but it will be useful if you want to convert JPEG to PNG
- OCR Images: Central feature. It will export images to text for you to copy and paste into Word. Supported 2 modes:
- Fast: Export fast thanks to your CPU. For desktops, I recommend 4 CPUs. For laptops, I recommend 2 CPUs or less
- Slow: Slower but more efficient mode. I recommend this mode for laptops, or for desktops too
- Windows
- MacOS:
brew install tesseract- Linux:
sudo apt install tesseract-ocr- Windows: Go to Tessdata and install language you want or all languages
- MacOS:
brew install tesseract-lang- Linux:
sudo apt install tesseract-ocr-allAlso, you can install language types from:
- Best: https://github.com/tesseract-ocr/tessdata_best
- Fast: https://github.com/tesseract-ocr/tessdata_fast
You have 2 methods to use the application:
- Download from Releases
- Clone repository:
git clone https://github.com/WMZS-Modding/Word2TXT.gitAnd then run:
python main.py- First, you need to convert your PDF to DOCX. I recommend using
Gooogle DriveandGooogle Docs - Second, click the application you've downloaded and extracted
- Then, choose the
Word2PNGsection. Choose your input DOCX and output folder. In theJPEG2PNGsection, it's optional but it's good for you if you prefer PNG instead of JPEG - Next, choose the
OCR Imagessection. I recommend choosingSlowmode to get a better result. Choose your language inLanguagepart (Languages will automatically show after you pastetraineddata(Windows) or install language (MacOS and Linux)). Choose your input image folder and output TXT folder. InFastmode, I recommend choosing 4 CPU - Finally, check your output TXT folder, you'll see result
- Advantages: Runs on CPU. Can process large image folders. Gives fast results
- Disadvantages: Sometimes this mode is too "rushed" leading to skimming and giving results that are almost missing some words
- Advantages: Processes image folders very carefully. Gives more accurate results
- Disadvantages: Slow and can take a long time to process large image folders. With 128 images or more, the time will increase
- Fork this repository
- Make your own changes
- Send a pull request for me