🏊

transitive-bullshit · transitive-bullshit · commit df577050a1b9 · 2025-10-23T17:19:44.000+07:00
diff --git a/readme.md b/readme.md
@@ -34,7 +34,7 @@ _You must own the ebook on Kindle for this project to work._
 
 ### How does it work?
 
-It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (`gpt-4o` or `gpt-4o-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
+It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (defaulting to `gpt-4.1-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
 
 This [example](./examples/B0819W19WD) uses the first page of the scifi book [Revelation Space](https://www.amazon.com/gp/product/B0819W19WD?ref_=dbs_m_mng_rwt_calw_tkin_0&storeType=ebooks) by [Alastair Reynolds](https://www.goodreads.com/author/show/51204.Alastair_Reynolds):
 
@@ -66,7 +66,7 @@ This [example](./examples/B0819W19WD) uses the first page of the scifi book [Rev
     </tr>
     <tr>
       <td>
-        We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4o</strong> or <strong>gpt-4o-mini</strong>).
+        We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4.1-mini</strong>.
       </td>
       <td>
         <p>Mantell Sector, North Nekhebet, Resurgam, Delta Pavonis system, 2551</p>
@@ -202,7 +202,7 @@ npx tsx src/transcribe-book-content.ts
 ```
 
 - _(This takes a few minutes to run)_
-- This takes each of the page screenshots and runs them through a vLLM (`gpt-4o` or `gpt-4o-mini`) to extract the raw text content from each page of the book.
+- This takes each of the page screenshots and runs them through a vLLM (defaulting to `gpt-4.1-mini`) to extract the raw text content from each page of the book.
 - It then stitches these text chunks together, taking into account chapter boundaries.
 - The result is stored as JSON to `out/${asin}/content.json`.
 - Example: [examples/B0819W19WD/content.json](./examples/B0819W19WD/content.json)
@@ -284,7 +284,7 @@ Compared with these approaches, the approach used by this project is much easier
 
 The main downside is that it's possible for some transcription errors to occur during the `image ⇒ text` step - which uses a multimodal LLM and is not 100% deterministic. In my testing, I've been remarkably surprised with how accurate the results are, but there are occasional issues mostly with differentiating whitespace between paragraphs versus soft section breaks. Note that both Calibre and Epubor also use heuristics to deal with things like spacing and dashes used by wordwrap, so the fidelity of the conversions will not be 100% one-to-one with the original Kindle version in any case.
 
-The other downside is that the **LLM costs add up to a few dollars per book using `gpt-4o`** or **around 30 cents per book using `gpt-4o-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
+The other downside is that the **LLM costs add up to a dollars per book using `gpt-4.1-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
 
 ### How is the accuracy?
 
diff --git a/src/extract-kindle-book.ts b/src/extract-kindle-book.ts
@@ -300,22 +300,36 @@ async function main() {
   }
 
   async function updateSettings() {
-    await page.locator('ion-button[aria-label="Reader settings"]').click()
+    console.log('Looking for Reader settings button')
+    const settingsButton = page
+      .locator(
+        'ion-button[aria-label="Reader settings"], ' +
+          'button[aria-label="Reader settings"]'
+      )
+      .first()
+    await settingsButton.waitFor({ timeout: 30_000 })
+    console.log('Clicking Reader settings')
+    await settingsButton.click()
     await delay(500)
 
     // Change font to Amazon Ember
     // My hypothesis is that this font will be easier for OCR to transcribe...
     // TODO: evaluate different fonts & settings
+    console.log('Changing font to Amazon Ember')
     await page.locator('#AmazonEmber').click()
+    await delay(200)
 
     // Change layout to single column
+    console.log('Changing to single column layout')
     await page
       .locator('[role="radiogroup"][aria-label$=" columns"]', {
         hasText: 'Single Column'
       })
       .click()
+    await delay(200)
 
-    await page.locator('ion-button[aria-label="Reader settings"]').click()
+    console.log('Closing settings')
+    await settingsButton.click()
     await delay(500)
   }
 
@@ -410,6 +424,15 @@ async function main() {
   await ensureFixedHeaderUI()
   await updateSettings()
 
+  console.log('Waiting for book reader to load...')
+  await page
+    .waitForSelector(krRendererMainImageSelector, { timeout: 60_000 })
+    .catch(() => {
+      console.warn(
+        'Main reader content may not have loaded, continuing anyway...'
+      )
+    })
+
   // Record the initial page navigation so we can reset back to it later
   const initialPageNav = await getPageNav()
 
@@ -449,10 +472,15 @@ async function main() {
     .length
   await writeResultMetadata()
 
+  // 56 sections
+  // page => startPosition
+  // "startPositionId": 234954
+  // "endPositionId": 236216
+  // "wordsInPage": 269
+
   // Navigate to the first content page of the book
   await goToPage(result.nav.startContentPage)
 
-  // let maxPageSeen = -1
   let done = false
   console.warn(
     `\nreading ${result.nav.totalNumContentPages} content pages out of ${result.nav.totalNumPages} total pages...\n`
@@ -470,13 +498,6 @@ async function main() {
       break
     }
 
-    // TODO: this doesn't technically work since page ordering is not guaranteed
-    // to monotonically increase w.r.t. position ordering.
-    // if (pageNav.page < maxPageSeen) {
-    //   break
-    // }
-    // maxPageSeen = Math.max(maxPageSeen, pageNav.page)
-
     const index = result.pages.length
 
     const src = (await page
@@ -610,7 +631,6 @@ async function main() {
     await goToPage(initialPageNav.page)
   }
 
-  // await page.close()
   await context.close()
   await context.browser()?.close()
 }