Word2TXT

Having trouble copying text from PDF? No problem, Word2TXT app will help you extract text from images.

Features

Word to Images: Important feature. You should start with it first
JPEG to PNG: Not important feature, but it will be useful if you want to convert JPEG to PNG
OCR Images: Central feature. It will export images to text for you to copy and paste into Word. Supported 2 modes:

Fast: Export fast thanks to your CPU. For desktops, I recommend 4 CPUs. For laptops, I recommend 2 CPUs or less
Slow: Slower but more efficient mode. I recommend this mode for laptops, or for desktops too

Installations

1. Download `Tesseract OCR`

Windows
MacOS:

brew install tesseract

Linux:

sudo apt install tesseract-ocr

2. Install languages

Windows: Go to Tessdata and install language you want or all languages
MacOS:

brew install tesseract-lang

Linux:

sudo apt install tesseract-ocr-all

Also, you can install language types from:

3. Use the application

You have 2 methods to use the application:

Download from Releases
Clone repository:

git clone https://github.com/WMZS-Modding/Word2TXT.git

And then run:

python main.py

4. Usage

First, you need to convert your PDF to DOCX. I recommend using Gooogle Drive and Gooogle Docs
Second, click the application you've downloaded and extracted
Then, choose the Word2PNG section. Choose your input DOCX and output folder. In the JPEG2PNG section, it's optional but it's good for you if you prefer PNG instead of JPEG
Next, choose the OCR Images section. I recommend choosing Slow mode to get a better result. Choose your language in Language part (Languages will automatically show after you paste traineddata (Windows) or install language (MacOS and Linux)). Choose your input image folder and output TXT folder. In Fast mode, I recommend choosing 4 CPU
Finally, check your output TXT folder, you'll see result

Advantages and Disadvantages

Fast

Advantages: Runs on CPU. Can process large image folders. Gives fast results
Disadvantages: Sometimes this mode is too "rushed" leading to skimming and giving results that are almost missing some words

Slow

Advantages: Processes image folders very carefully. Gives more accurate results
Disadvantages: Slow and can take a long time to process large image folders. With 128 images or more, the time will increase

Contributing

Fork this repository
Make your own changes
Send a pull request for me

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word2TXT

Features

Installations

1. Download `Tesseract OCR`

2. Install languages

3. Use the application

4. Usage

Advantages and Disadvantages

Fast

Slow

Contributing

About

Uh oh!

Releases 1

Packages

Languages

License

WMZS-Modding/Word2TXT

Folders and files

Latest commit

History

Repository files navigation

Word2TXT

Features

Installations

1. Download Tesseract OCR

2. Install languages

3. Use the application

4. Usage

Advantages and Disadvantages

Fast

Slow

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

1. Download `Tesseract OCR`

Packages