Skip to content

Commit df57705

Browse files
🏊
1 parent f281491 commit df57705

File tree

2 files changed

+35
-15
lines changed

2 files changed

+35
-15
lines changed

readme.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ _You must own the ebook on Kindle for this project to work._
3434

3535
### How does it work?
3636

37-
It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (`gpt-4o` or `gpt-4o-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
37+
It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (defaulting to `gpt-4.1-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
3838

3939
This [example](./examples/B0819W19WD) uses the first page of the scifi book [Revelation Space](https://www.amazon.com/gp/product/B0819W19WD?ref_=dbs_m_mng_rwt_calw_tkin_0&storeType=ebooks) by [Alastair Reynolds](https://www.goodreads.com/author/show/51204.Alastair_Reynolds):
4040

@@ -66,7 +66,7 @@ This [example](./examples/B0819W19WD) uses the first page of the scifi book [Rev
6666
</tr>
6767
<tr>
6868
<td>
69-
We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4o</strong> or <strong>gpt-4o-mini</strong>).
69+
We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4.1-mini</strong>.
7070
</td>
7171
<td>
7272
<p>Mantell Sector, North Nekhebet, Resurgam, Delta Pavonis system, 2551</p>
@@ -202,7 +202,7 @@ npx tsx src/transcribe-book-content.ts
202202
```
203203

204204
- _(This takes a few minutes to run)_
205-
- This takes each of the page screenshots and runs them through a vLLM (`gpt-4o` or `gpt-4o-mini`) to extract the raw text content from each page of the book.
205+
- This takes each of the page screenshots and runs them through a vLLM (defaulting to `gpt-4.1-mini`) to extract the raw text content from each page of the book.
206206
- It then stitches these text chunks together, taking into account chapter boundaries.
207207
- The result is stored as JSON to `out/${asin}/content.json`.
208208
- Example: [examples/B0819W19WD/content.json](./examples/B0819W19WD/content.json)
@@ -284,7 +284,7 @@ Compared with these approaches, the approach used by this project is much easier
284284

285285
The main downside is that it's possible for some transcription errors to occur during the `image ⇒ text` step - which uses a multimodal LLM and is not 100% deterministic. In my testing, I've been remarkably surprised with how accurate the results are, but there are occasional issues mostly with differentiating whitespace between paragraphs versus soft section breaks. Note that both Calibre and Epubor also use heuristics to deal with things like spacing and dashes used by wordwrap, so the fidelity of the conversions will not be 100% one-to-one with the original Kindle version in any case.
286286

287-
The other downside is that the **LLM costs add up to a few dollars per book using `gpt-4o`** or **around 30 cents per book using `gpt-4o-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
287+
The other downside is that the **LLM costs add up to a dollars per book using `gpt-4.1-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
288288

289289
### How is the accuracy?
290290

src/extract-kindle-book.ts

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -300,22 +300,36 @@ async function main() {
300300
}
301301

302302
async function updateSettings() {
303-
await page.locator('ion-button[aria-label="Reader settings"]').click()
303+
console.log('Looking for Reader settings button')
304+
const settingsButton = page
305+
.locator(
306+
'ion-button[aria-label="Reader settings"], ' +
307+
'button[aria-label="Reader settings"]'
308+
)
309+
.first()
310+
await settingsButton.waitFor({ timeout: 30_000 })
311+
console.log('Clicking Reader settings')
312+
await settingsButton.click()
304313
await delay(500)
305314

306315
// Change font to Amazon Ember
307316
// My hypothesis is that this font will be easier for OCR to transcribe...
308317
// TODO: evaluate different fonts & settings
318+
console.log('Changing font to Amazon Ember')
309319
await page.locator('#AmazonEmber').click()
320+
await delay(200)
310321

311322
// Change layout to single column
323+
console.log('Changing to single column layout')
312324
await page
313325
.locator('[role="radiogroup"][aria-label$=" columns"]', {
314326
hasText: 'Single Column'
315327
})
316328
.click()
329+
await delay(200)
317330

318-
await page.locator('ion-button[aria-label="Reader settings"]').click()
331+
console.log('Closing settings')
332+
await settingsButton.click()
319333
await delay(500)
320334
}
321335

@@ -410,6 +424,15 @@ async function main() {
410424
await ensureFixedHeaderUI()
411425
await updateSettings()
412426

427+
console.log('Waiting for book reader to load...')
428+
await page
429+
.waitForSelector(krRendererMainImageSelector, { timeout: 60_000 })
430+
.catch(() => {
431+
console.warn(
432+
'Main reader content may not have loaded, continuing anyway...'
433+
)
434+
})
435+
413436
// Record the initial page navigation so we can reset back to it later
414437
const initialPageNav = await getPageNav()
415438

@@ -449,10 +472,15 @@ async function main() {
449472
.length
450473
await writeResultMetadata()
451474

475+
// 56 sections
476+
// page => startPosition
477+
// "startPositionId": 234954
478+
// "endPositionId": 236216
479+
// "wordsInPage": 269
480+
452481
// Navigate to the first content page of the book
453482
await goToPage(result.nav.startContentPage)
454483

455-
// let maxPageSeen = -1
456484
let done = false
457485
console.warn(
458486
`\nreading ${result.nav.totalNumContentPages} content pages out of ${result.nav.totalNumPages} total pages...\n`
@@ -470,13 +498,6 @@ async function main() {
470498
break
471499
}
472500

473-
// TODO: this doesn't technically work since page ordering is not guaranteed
474-
// to monotonically increase w.r.t. position ordering.
475-
// if (pageNav.page < maxPageSeen) {
476-
// break
477-
// }
478-
// maxPageSeen = Math.max(maxPageSeen, pageNav.page)
479-
480501
const index = result.pages.length
481502

482503
const src = (await page
@@ -610,7 +631,6 @@ async function main() {
610631
await goToPage(initialPageNav.page)
611632
}
612633

613-
// await page.close()
614634
await context.close()
615635
await context.browser()?.close()
616636
}

0 commit comments

Comments
 (0)