Skip to content

Commit 3648fdc

Browse files
Merge pull request #22 from transitive-bullshit/feature/improve-extraction
2 parents f2c8f0b + 4407c16 commit 3648fdc

14 files changed

+621
-796
lines changed

eslint.config.js

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
11
import { config } from '@fisch0920/config/eslint'
2+
import { globalIgnores } from 'eslint/config'
23

3-
export default [
4-
...config,
5-
{
6-
ignores: ['**/out/**', '**/dist/**']
7-
}
8-
]
4+
export default [...config, globalIgnores(['out', 'dist', 'examples'])]

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,14 @@
3737
"playwright": "^1.56.1",
3838
"playwright-core": "^1.56.1",
3939
"sharp": "^0.34.4",
40+
"sort-keys": "^6.0.0",
4041
"tar": "^7.5.1",
4142
"tempy": "^3.1.0",
4243
"type-fest": "^5.1.0",
4344
"unrealspeech-api": "^1.0.2"
4445
},
4546
"devDependencies": {
46-
"@fisch0920/config": "^1.3.3",
47+
"@fisch0920/config": "^1.3.4",
4748
"@types/fluent-ffmpeg": "^2.1.26",
4849
"@types/hh-mm-ss": "^1.2.3",
4950
"@types/node": "^24.9.1",

pnpm-lock.yaml

Lines changed: 199 additions & 580 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pnpm-workspace.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@ enablePrePostScripts: true
22

33
minimumReleaseAge: 1440
44

5+
minimumReleaseAgeExclude:
6+
- '@fisch0920/config'
7+
58
onlyBuiltDependencies:
69
- esbuild
710
- sharp

readme.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ _You must own the ebook on Kindle for this project to work._
3434

3535
### How does it work?
3636

37-
It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (`gpt-4o` or `gpt-4o-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
37+
It works by logging into your [Kindle web reader](https://read.amazon.com) account using [Playwright](https://playwright.dev), exporting each page of a book as a PNG image, and then using a vLLM (defaulting to `gpt-4.1-mini`) to transcribe the text from each page to text. Once we have the raw book contents and metadata, then it's easy to convert it to PDF, EPUB, etc. 🔥
3838

3939
This [example](./examples/B0819W19WD) uses the first page of the scifi book [Revelation Space](https://www.amazon.com/gp/product/B0819W19WD?ref_=dbs_m_mng_rwt_calw_tkin_0&storeType=ebooks) by [Alastair Reynolds](https://www.goodreads.com/author/show/51204.Alastair_Reynolds):
4040

@@ -66,7 +66,7 @@ This [example](./examples/B0819W19WD) uses the first page of the scifi book [Rev
6666
</tr>
6767
<tr>
6868
<td>
69-
We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4o</strong> or <strong>gpt-4o-mini</strong>).
69+
We then convert each page's screenshot into text using one of OpenAI's vLLMs (<strong>gpt-4.1-mini</strong>.
7070
</td>
7171
<td>
7272
<p>Mantell Sector, North Nekhebet, Resurgam, Delta Pavonis system, 2551</p>
@@ -202,7 +202,7 @@ npx tsx src/transcribe-book-content.ts
202202
```
203203

204204
- _(This takes a few minutes to run)_
205-
- This takes each of the page screenshots and runs them through a vLLM (`gpt-4o` or `gpt-4o-mini`) to extract the raw text content from each page of the book.
205+
- This takes each of the page screenshots and runs them through a vLLM (defaulting to `gpt-4.1-mini`) to extract the raw text content from each page of the book.
206206
- It then stitches these text chunks together, taking into account chapter boundaries.
207207
- The result is stored as JSON to `out/${asin}/content.json`.
208208
- Example: [examples/B0819W19WD/content.json](./examples/B0819W19WD/content.json)
@@ -284,7 +284,7 @@ Compared with these approaches, the approach used by this project is much easier
284284

285285
The main downside is that it's possible for some transcription errors to occur during the `image ⇒ text` step - which uses a multimodal LLM and is not 100% deterministic. In my testing, I've been remarkably surprised with how accurate the results are, but there are occasional issues mostly with differentiating whitespace between paragraphs versus soft section breaks. Note that both Calibre and Epubor also use heuristics to deal with things like spacing and dashes used by wordwrap, so the fidelity of the conversions will not be 100% one-to-one with the original Kindle version in any case.
286286

287-
The other downside is that the **LLM costs add up to a few dollars per book using `gpt-4o`** or **around 30 cents per book using `gpt-4o-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
287+
The other downside is that the **LLM costs add up to a dollars per book using `gpt-4.1-mini`**. With LLM costs constantly decreasing and local vLLMs, this cost per book should be free or almost free soon. The screenshots are also really good quality with no extra content, so you could swap any other OCR solution for the vLLM-based `image ⇒ text` quite easily.
288288

289289
### How is the accuracy?
290290

src/export-book-audio.ts

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ import {
1616
ffmpegOnProgress,
1717
fileExists,
1818
getEnv,
19-
hashObject
19+
hashObject,
20+
readJsonFile
2021
} from './utils'
2122

2223
type TTSEngine = 'openai' | 'unrealspeech'
@@ -35,11 +36,10 @@ async function main() {
3536
const audioOutDir = path.join(outDir, isPreview ? 'audio-previews' : 'audio')
3637
await fs.mkdir(audioOutDir, { recursive: true })
3738

38-
const content = (
39-
JSON.parse(
40-
await fs.readFile(path.join(outDir, 'content.json'), 'utf8')
41-
) as ContentChunk[]
39+
const rawContent = await readJsonFile<ContentChunk[]>(
40+
path.join(outDir, 'content.json')
4241
)
42+
const content = rawContent
4343
.filter((c) => !isPreview || c.page === 1)
4444
.concat(
4545
isPreview
@@ -53,11 +53,11 @@ async function main() {
5353
]
5454
: []
5555
)
56-
57-
const metadata = JSON.parse(
58-
await fs.readFile(path.join(outDir, 'metadata.json'), 'utf8')
59-
) as BookMetadata
6056
assert(content.length, 'no book content found')
57+
58+
const metadata = await readJsonFile<BookMetadata>(
59+
path.join(outDir, 'metadata.json')
60+
)
6161
assert(metadata.meta, 'invalid book metadata: missing meta')
6262
assert(metadata.toc?.length, 'invalid book metadata: missing toc')
6363

src/export-book-markdown.ts

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,20 @@ import fs from 'node:fs/promises'
44
import path from 'node:path'
55

66
import type { BookMetadata, ContentChunk } from './types'
7-
import { assert, getEnv } from './utils'
7+
import { assert, getEnv, readJsonFile } from './utils'
88

99
async function main() {
1010
const asin = getEnv('ASIN')
1111
assert(asin, 'ASIN is required')
1212

1313
const outDir = path.join('out', asin)
1414

15-
const content = JSON.parse(
16-
await fs.readFile(path.join(outDir, 'content.json'), 'utf8')
17-
) as ContentChunk[]
18-
const metadata = JSON.parse(
19-
await fs.readFile(path.join(outDir, 'metadata.json'), 'utf8')
20-
) as BookMetadata
15+
const content = await readJsonFile<ContentChunk[]>(
16+
path.join(outDir, 'content.json')
17+
)
18+
const metadata = await readJsonFile<BookMetadata>(
19+
path.join(outDir, 'metadata.json')
20+
)
2121
assert(content.length, 'no book content found')
2222
assert(metadata.meta, 'invalid book metadata: missing meta')
2323
assert(metadata.toc?.length, 'invalid book metadata: missing toc')

0 commit comments

Comments
 (0)