ragBIS is a standalone Python package that scrapes openBIS documentation, processes the content, and generates embeddings for use in RAG (Retrieval Augmented Generation) applications.
- Web Scraping: Automatically scrapes openBIS documentation from ReadtheDocs
- Content Processing: Intelligently chunks content while preserving document structure
- Embedding Generation: Creates embeddings using Ollama's
nomic-embed-textmodel - Data Export: Saves processed data in JSON and CSV formats for easy consumption
- Python 3.8 or higher
- Ollama installed and running
- The
nomic-embed-textmodel installed in Ollama
- Install Ollama from https://ollama.ai/
- Pull the required embedding model:
ollama pull nomic-embed-text
- Clone or download this project
- Navigate to the ragBIS_project directory
- Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
pip install ragbisRun ragBIS with default settings to scrape and process openBIS documentation:
python -m ragbisThis will:
- Scrape the openBIS documentation from the default URL
- Save raw content to
./data/raw/ - Process and generate embeddings
- Save processed data to
./data/processed/
python -m ragbis --helpAvailable options:
--url URL: Base URL to scrape (default: https://openbis.readthedocs.io/en/latest/)--output-dir DIR: Output directory for data (default: ./data)--max-pages N: Maximum number of pages to scrape (default: 100)--delay SECONDS: Delay between requests (default: 0.5)--force-rebuild: Force rebuild even if processed data exists--min-chunk-size N: Minimum chunk size in characters (default: 100)--max-chunk-size N: Maximum chunk size in characters (default: 1000)--chunk-overlap N: Chunk overlap in characters (default: 50)--verbose: Enable verbose logging
Scrape with custom settings:
python -m ragbis --max-pages 200 --output-dir ./my_data --verboseForce rebuild existing data:
python -m ragbis --force-rebuildCustom chunking parameters:
python -m ragbis --min-chunk-size 200 --max-chunk-size 1500 --chunk-overlap 100ragBIS creates the following directory structure:
data/
├── raw/ # Raw scraped content
│ ├── index.txt
│ ├── installation.txt
│ └── ...
└── processed/ # Processed data for RAG
├── chunks.json # Main data file with embeddings
└── chunks.csv # Metadata without embeddings
- chunks.json: Contains all processed chunks with embeddings, titles, URLs, and content
- chunks.csv: Contains chunk metadata without embeddings for easy inspection
The processed data from ragBIS is designed to be used with chatBIS, the conversational interface. The chatBIS repo is accesible here. After running ragBIS, you can:
- Copy the
datadirectory to your chatBIS project - Or point chatBIS to the ragBIS output directory
OLLAMA_HOST: Ollama server host (default: localhost)OLLAMA_PORT: Ollama server port (default: 11434)
You can modify the scraping behavior by editing the scraper configuration in the source code:
- Target different documentation versions
- Adjust content selectors for different site layouts
- Modify delay and retry settings
-
Ollama Connection Error
- Ensure Ollama is running:
ollama serve - Check if the model is installed:
ollama list - Install the model if missing:
ollama pull nomic-embed-text
- Ensure Ollama is running:
-
Memory Issues
- Reduce
--max-pagesfor large documentation sites - Increase
--min-chunk-sizeto create fewer chunks - Process in smaller batches
- Reduce
-
Network Issues
- Increase
--delaybetween requests - Check your internet connection
- Verify the documentation URL is accessible
- Increase
Enable verbose logging to debug issues:
python -m ragbis --verbosepytestblack src/mypy src/This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Create an issue on GitHub
- Check the troubleshooting section above
- Ensure Ollama is properly configured