DeSi (DataStore Helper) is an intelligent RAG-focused chatbot that provides expert assistance for openBIS and BAM Data Store documentation. It combines information from multiple knowledge sources using advanced retrieval-augmented generation techniques, powered by Ollama and ChromaDB.
- DeSi: DataStore Helper
- Intelligent RAG Pipeline: Advanced retrieval-augmented generation using Ollama and ChromaDB
- Multi-Source Knowledge: Combines openBIS ReadTheDocs and BAM Data Store Wiki.js documentation
- Conversation Memory: SQLite-based memory system maintains context across interactions
- Smart Workflow: Automatically handles scraping, processing, and database creation
- Highly Configurable: Environment-based configuration with sensible defaults
- Source Prioritization: Intelligently prioritizes relevant documentation sources
- Interactive CLI: Rich command-line interface with conversation history
- Modular Architecture: Clean separation of scraping, processing, and querying components
DeSi follows a sophisticated multi-stage pipeline:
- Data Acquisition: Web scrapers extract content from documentation sources
- Content Processing: Documents are chunked, normalized, and embedded using Ollama
- Vector Storage: ChromaDB stores embeddings for efficient similarity search
- Query Processing: RAG engine retrieves relevant context and generates responses
- Conversation Management: LangGraph-based engine maintains chat history and context
graph TD
A[User Query] --> B{ChromaDB Exists?}
B -->|No| C{Scraped Data Exists?}
B -->|Yes| H[Query Interface]
C -->|No| D[Run Scrapers]
C -->|Yes| F[Run Processors]
D --> E[OpenBIS Scraping]
E --> F[Content Processing]
F --> G[Generate Embeddings & Store in ChromaDB]
G --> H[Query Interface]
H --> I[RAG Query Engine]
I --> J[Retrieve Relevant Context]
J --> K[Generate Response with Ollama]
K --> L[Update Conversation Memory]
L --> M[Return Response to User]
style A fill:#e1f5fe
style H fill:#f3e5f5
style I fill:#e8f5e8
style K fill:#fff3e0
- Python 3.8+ (3.10+ recommended)
- Ollama with required models installed
- Git for cloning the repository
DeSi requires two Ollama models to function:
# Install the embedding model (required for vector search)
ollama pull nomic-embed-text
# Install the chat model (default, can be configured)
ollama pull qwen3Note: You can use different models by configuring them in your
.envfile or via command-line parameters.
git clone https://github.com/carlosmada22/DeSi.git
cd DeSiIt's highly recommended to use a virtual environment:
# Create virtual environment
python -m venv .venv
# Activate it
# On Windows:
.venv\Scripts\activate
# On Unix/macOS:
source .venv/bin/activateInstall DeSi with all dependencies:
# Install main package with development tools
pip install -e ".[dev]"Run the setup script to verify prerequisites and create necessary directories:
python init.pyThis script will:
- β Check Python version compatibility
- β Verify Ollama installation and models
- β Create required data directories
- β Run basic functionality tests
The simplest way to get started is to run the main script, which handles everything automatically:
python main.pyThis intelligent workflow will:
- Check if ChromaDB database exists
- Scrape documentation if no data is found
- Process content and create embeddings
- Launch the interactive CLI interface
Once the interface loads, you can ask questions like:
- "How do I create a new experiment in openBIS?"
- "What are the steps to upload data to BAM Data Store?"
- "How do I configure user permissions?"
# Force re-scraping even if data exists
python main.py --force-scraping
# Force re-processing even if database exists
python main.py --force-processing
# Skip scraping and use existing data
python main.py --skip-scraping
# Use custom configuration file
python main.py --config /path/to/custom.envDeSi uses a flexible configuration system based on environment variables with sensible defaults.
Create a .env file in the project root to customize settings:
# Database Configuration
DESI_DB_PATH=desi_vectordb
DESI_COLLECTION_NAME=desi_docs
DESI_MEMORY_DB_PATH=data/conversation_memory.db
# Model Configuration
DESI_MODEL_NAME=qwen3
DESI_EMBEDDING_MODEL_NAME=nomic-embed-text
# Data Sources
DESI_OPENBIS_URL=https://openbis.readthedocs.io/en/20.10.0-11/index.html
DESI_WIKIJS_URL=https://datastore.bam.de/en/home
DESI_MAX_PAGES_PER_SCRAPER=100
# Processing Settings
DESI_MIN_CHUNK_SIZE=100
DESI_MAX_CHUNK_SIZE=1000
DESI_CHUNK_OVERLAP=50
DESI_RETRIEVAL_TOP_K=5
DESI_HISTORY_LIMIT=20
# Logging
DESI_LOG_LEVEL=INFOYou can also specify a custom configuration file:
python main.py --config /path/to/your/config.envThe main interface is the intelligent CLI that handles the complete workflow:
# Standard usage - handles everything automatically
python main.py
# Advanced options
python main.py --help| Parameter | Description | Example |
|---|---|---|
--web |
Start web interface (currently uses CLI) | python main.py --web |
--skip-scraping |
Skip scraping even if no data exists | python main.py --skip-scraping |
--skip-processing |
Skip processing and go directly to query | python main.py --skip-processing |
--force-scraping |
Force scraping even if data exists | python main.py --force-scraping |
--force-processing |
Force processing even if database exists | python main.py --force-processing |
--config |
Path to custom configuration file | python main.py --config custom.env |
You can also run individual components separately:
# Run only the scraper
python -m desi.scraper.cli --url https://example.com --output data/raw/example
# Run only the processor
python -m desi.processor.cli --input data/raw --output data/processed
# Run only the query interface (requires existing database)
python -m desi.query.cli --db-path desi_vectordbDeSi/
βββ π src/desi/ # Main source code
β βββ π scraper/ # Web scrapers for documentation
β β βββ openbis_scraper.py # OpenBIS ReadTheDocs scraper
β β βββ cli.py # Scraper CLI interface
β βββ π processor/ # Content processing pipeline
β β βββ ds_processor.py # DataStore Wiki.js processor
β β βββ openbis_processor.py # OpenBIS content processor
β β βββ cli.py # Processor CLI interface
β βββ π query/ # RAG query engine
β β βββ query.py # Core RAG implementation
β β βββ conversation_engine.py # LangGraph-based chat engine
β β βββ cli.py # Query CLI interface
β βββ π utils/ # Utilities and configuration
β β βββ config.py # Configuration management
β β βββ logging.py # Logging setup
β βββ π web/ # Web interface (Flask)
β βββ app.py # Flask application
β βββ cli.py # Web CLI interface
β βββ π templates/ # HTML templates
β βββ π static/ # CSS, JS, images
βββ π tests/ # Unit and integration tests
βββ π data/ # Data storage
β βββ π raw/ # Scraped raw content
β β βββ π openbis/ # OpenBIS documentation
β β βββ π wikijs/ # Wiki.js content
β βββ π processed/ # Processed and chunked content
β βββ conversation_memory.db # SQLite conversation history
βββ π desi_vectordb/ # ChromaDB vector database
βββ π prompts/ # LLM prompt templates
β βββ desi_query_prompt.md # Main query prompt
βββ main.py # Main entry point
βββ init.py # Environment setup script
βββ pyproject.toml # Project configuration
βββ README.md # This file
OpenBIS Scraper (src/desi/scraper/openbis_scraper.py)
- Crawls ReadTheDocs documentation sites
- Converts HTML content to clean Markdown
- Handles navigation and link discovery
- Configurable page limits and filtering
DataStore Processor (src/desi/processor/ds_processor.py)
- Processes Wiki.js content from BAM Data Store
- Intelligent content chunking and normalization
- Metadata extraction and enhancement
OpenBIS Processor (src/desi/processor/openbis_processor.py)
- Processes ReadTheDocs content
- Specialized chunking for technical documentation
- Source attribution and categorization
RAG Query Engine (src/desi/query/query.py)
- ChromaDB-based vector similarity search
- Ollama integration for embeddings and generation
- Source prioritization logic
- Context-aware response generation
LangGraph-based Engine (src/desi/query/conversation_engine.py)
- SQLite-based conversation memory
- Context maintenance across sessions
- Query rewriting and clarification
- Token counting and history management
Try asking DeSi questions like:
- "How do I create a new experiment in openBIS?"
- "What are the steps to upload data to BAM Data Store?"
- "How do I configure user permissions in openBIS?"
- "What is the difference between spaces and projects?"
- "How do I register a new collection in the data store?"
- "Can you explain the openBIS data model?"
DeSi includes comprehensive tests for all components:
# Run all tests
pytest
# Run with coverage
pytest --cov=src/desi
# Run specific test categories
pytest tests/test_scraper.py
pytest tests/test_processor.py
pytest tests/test_conversation_memory.py
# Run integration tests
python scripts/integration_test.py-
Create scraper class in
src/desi/scraper/:class MyCustomScraper: def __init__(self, base_url, output_dir): # Initialize scraper def scrape(self): # Implement scraping logic
-
Add CLI support in
src/desi/scraper/cli.py -
Update main pipeline in
main.pyto include the new scraper -
Add tests in
tests/test_my_custom_scraper.py
Extend processors in src/desi/processor/:
- Chunking strategies: Modify chunk size and overlap parameters
- Metadata extraction: Add custom metadata fields
- Content normalization: Implement domain-specific cleaning
- Embedding models: Configure different Ollama models
| Issue | Solution |
|---|---|
| Ollama not available | Ensure Ollama is running: ollama serve |
| Missing models | Install required models: ollama pull nomic-embed-text |
| Empty database | Force re-processing: python main.py --force-processing |
| Scraping fails | Check internet connection and URL accessibility |
| Memory issues | Reduce chunk size in configuration |
| Slow responses | Check Ollama model performance, consider lighter models |
Enable detailed logging:
# Set log level in environment
export DESI_LOG_LEVEL=DEBUG
python main.py
# Or use configuration file
echo "DESI_LOG_LEVEL=DEBUG" > debug.env
python main.py --config debug.env# Check database status
python -c "
from src.desi.utils.config import DesiConfig
from pathlib import Path
config = DesiConfig()
db_path = Path(config.db_path)
print(f'Database exists: {db_path.exists()}')
if db_path.exists():
print(f'Database size: {sum(f.stat().st_size for f in db_path.rglob("*") if f.is_file())} bytes')
"
# Reset database (removes all data)
rm -rf desi_vectordb/
python main.py --force-processingWe welcome contributions! Here's how to get started:
- Fork and clone the repository
- Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -e ".[dev]"
- Run tests to ensure everything works:
pytest python scripts/integration_test.py
We use several tools to maintain code quality:
# Format code
ruff format src tests
# Lint code
ruff check src tests
# Type checking
mypy src
# Run all checks
pytest && ruff check src tests && ruff format --check src tests- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes and add tests
- Run the full test suite:
pytest - Submit a pull request with a clear description
- Bug reports: Include steps to reproduce, expected vs actual behavior
- Feature requests: Describe the use case and proposed solution
- Questions: Use GitHub Discussions for general questions
This project is licensed under the MIT License. See the LICENSE file for details.
DeSi - Making openBIS and BAM Data Store documentation accessible through intelligent conversation.