Extract text from pdf area

For text extraction, pdfboxing currently uses [org.apache.pdfbox.text.PDFTextStripper](https://pdfbox.apache.org/docs/2.0.13/javadocs/org/apache/pdfbox/text/PDFTextStripper.html) which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.

As an alternative, there's also [org.apache.pdfbox.text.PDFTextStripperByArea](https://pdfbox.apache.org/docs/2.0.13/javadocs/org/apache/pdfbox/text/PDFTextStripperByArea.html), which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.

I have prepared a rough prototype that seems to work:

```clojure
(ns my-ns
  (:require [pdfboxing.common :as common])
  (:import (org.apache.pdfbox.text PDFTextStripperByArea)
           (java.awt Rectangle)))

(defn extract-by-area
  "get text from a specified area of a PDF document"
  [pdfdoc x y w h page]
  (with-open [doc (common/obtain-document pdfdoc)]
    (let [rectangle       (Rectangle. x y w h)
          pdpage          (.getPage doc (inc page))
          pdftextstripper (doto (PDFTextStripperByArea.)
                            (.addRegion "region" rectangle)
                            (.extractRegions pdpage))]
      (.getTextForRegion pdftextstripper "region"))))
```
@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?

If so let me know and I'll put some time into making a proper PR.

note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract text from pdf area #61

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extract text from pdf area #61

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions