Rtesseract package

This is an R interface to the tesseract OCR (Optical Character Recognition) system.

tesseract is available at https://code.google.com/p/tesseract-ocr/.

More recent versions are available on github https://github.com/tesseract-ocr/tesseract

Installing tesseract involves first installing leptonica http://www.leptonica.com/.

This is currently a basic interface to the essential functionality, with some added R functionality to visualize the results.

Of course, the package provides functionality to get the recognized text. However, it also allows us to do this at various different levels, e.g. word, character, line
We can create a searchable and selectable PDF version of the image(s).
We can output the results of the OCR to a tab-separated-value file, an HTML (hocr) file, a BoxText, a UNLV, or a OSD file.
We can also use different page segmentation modes so that we can detect/recognize lines on the image which is useful for processing tables where the lines separate rows or columns
We can get the confidence for each recognized text element to understand whether it is a good match or not.
We can get the location and dimensions of each of the text elements. Again, this is necessary for processing tables and other structured content.
We can display the matched text, the associated confidences to see spatial patterns. Also, we can overlay this on the original image to see patterns.
We can restrict the recognition to a sub-rectangle of the image.
The package provides lower-level access to the C++ API, allowing for more fine-grained and efficient use and flexible programmatic access.
We can set and query many variables controlling tesseract's behavior.
We can query details about the image.
We can manipulate an image as an array of pixels
We interface to numerous leptonica routines to process images, e.g., convert to gray scale or binary images, rotate and transpose images
Functionality to read images and their metadata to determine their formats
Read multi-page TIFF documents.
We can query the metadata about the version of tesseract, the supported image formats, etc.

We can machine generate the interface to the other methods and classes in the tesseract API/library.

Converting Documents to Images

Often we will start with a scanned document already as a single image. Assuming leptonica was installed with support for that image format, we can read the image directly.

Multipage PDF

In many of our use cases, we start with a PDF document that consists of multiple scanned pages. Each page is a scanned image. Tesseract/leptonica does not read this directly. Instead, we need to convert the PDF document into a different format. We ue ImageMagick, and specifically its very general and powerful convert command, to convert between image formats.

Separate Image File for each Page

If we want to create a separate image for each page in the original PDF, we can use the script pdf2png in this package (inst/scripts/pdf2png). This hides some of the details of convert. (This can convert to JPEG and other formats, in spite of what the name suggests.)

pdf2png SMITHBURN_1952.pdf

This will generate png files with names SMITHBURN_1952_0000.png, ... We can specify the filename format.

We can also specify the density (points per pixel), the quality/level of compression, and any other command line arguments convert supports.

Multipage Image Format

Alternatively, we can convert the PDF document to a multi-page/image TIFF file, i.e. a single TIFF file that contains multiple images. We then read this into R using the readMultipageTiff() function and then access each page from the resulting list.

To convert a multipage PDF document to a multipage TIFF file, use, e.g.,

convert SMITHBURN_1952.pdf SMITHBURN_1952.tiff

History

We - Matt Espe & Duncan Temple Lang - started developing this package in April 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 800 Commits
Experiments		Experiments
R		R
TU		TU
inst		inst
man		man
src		src
testRexit		testRexit
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
Changelog		Changelog
DESCRIPTION		DESCRIPTION
INSTALL.md		INSTALL.md
INSTALL.windows		INSTALL.windows
InstallingTesseract.md		InstallingTesseract.md
Installing_Rtesseract.md		Installing_Rtesseract.md
NAMESPACE		NAMESPACE
Note		Note
README.md		README.md
TODO.md		TODO.md
TODO.win		TODO.win
cleanup		cleanup
config.R.win		config.R.win
configure		configure
configure.ac		configure.ac
configure.win		configure.win
createSamplePNG.R		createSamplePNG.R
findLines_notes		findLines_notes
imageCapabilities.R.win		imageCapabilities.R.win
lines.R		lines.R
lines2.R		lines2.R
readImg.cpp		readImg.cpp
sampleImage.bmp		sampleImage.bmp
sampleImage.gif		sampleImage.gif
sampleImage.jp2		sampleImage.jp2
sampleImage.jpg		sampleImage.jpg
sampleImage.png		sampleImage.png
sampleImage.pnm		sampleImage.pnm
sampleImage.ps		sampleImage.ps
sampleImage.spix		sampleImage.spix
sampleImage.webp		sampleImage.webp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rtesseract package

Converting Documents to Images

Multipage PDF

Separate Image File for each Page

Multipage Image Format

History

About

Releases

Packages

Contributors 3

Languages

duncantl/Rtesseract

Folders and files

Latest commit

History

Repository files navigation

Rtesseract package

Converting Documents to Images

Multipage PDF

Separate Image File for each Page

Multipage Image Format

History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages