Automatically crawl documentation, clean up extra markup, and write to markdown for use with LLMs
- 🔍 Crawl documentation sites with configurable depth
- 🧹 AI Clean up of extra markup and formatting (optional)
- 📚 Automatically categorizes content for better splitting into token friendly chunks
- 📝 Export raw or AI cleaned markdown files
- 🔄 Interactive document pruning and category splitting
- 📊 Token estimation for LLM context windows
npm install -g llmdump
You'll need two API keys:
- Firecrawl API key for web crawling
- OpenAI API key for content processing
Set them as environment variables or provide them via CLI:
export FIRECRAWL_API_KEY="your-firecrawl-key"
export OPENAI_API_KEY="your-openai-key"
Or use CLI flags:
llmdump --firecrawl-key "your-key" --openai-key "your-key"
When you run llmdump
, you'll see the main menu with these options:
┌─────────────────┐
│ LLMDump │
│ v[current-ver] │
└─────────────────┘
? What would you like to do?
❯ Start new crawl
Open existing crawl
Delete crawl
Manage configuration
Exit
- Enter the URL to crawl
- Set the maximum number of pages to crawl (default: 50)
- Wait for crawling and AI processing to complete
- View the processing menu
After crawling, you'll see a summary of documents and categories, then:
? What would you like to do?
❯ View/prune documents (In case we crawled some junk)
Export & clean documents
Export raw documents (No AI cleanup, faster)
Back to Main Menu
When viewing documents, you can:
- Navigate between categories
- Prune documents (remove unwanted content)
- Split categories using AI
- View token estimates for each category
When exporting, choose between:
- Single file (all content in one markdown file)
- Multiple files (one file per category)
The tool will show token estimates for each option to help you choose.
Access configuration settings to:
- Update Firecrawl API key
- Update OpenAI API key
- Open config directory
- Return to main menu
Contributions are welcome! Please feel free to submit a Pull Request.
MIT