A Vietnamese language Retrieval Augmented Generation (RAG) system with specialized text processing and embeddings for Vietnamese language.
- Text normalization using
underthesea
- Sentence segmentation
- Word segmentation with domain-specific fixed words (optional)
- Smart chunking strategy with configurable chunk size and overlap (default: 110 tokens with 20 token overlap)
- Embedding generation using
bkai-foundation-models/vietnamese-bi-encoder
- API for processing documents and querying similar chunks
- Caching for embeddings (optional, enabled by default)
- Input validation to ensure chunk size and overlap constraints
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Run the application:
(Note: The
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
--host
and--port
are optional and default to0.0.0.0
and8000
respectively, as defined inapp/config.py
)
POST /api/process
: Process text documents into chunks and embeddings. Takes aProcessingRequest
as input, allowing specification ofchunk_size
andchunk_overlap
. Returns a list ofEmbeddingResponse
.POST /api/query
: Find similar chunks for a given query text. Takes aQueryRequest
and returns aQueryResponse
.GET /api/status
: Get server status.GET /health
: Health check endpoint.GET /
: Root endpoint with basic application information.
Configuration options are managed in app/config.py
and can be overridden using environment variables:
DEBUG
: Enable debug mode (default:False
)EMBEDDING_MODEL
: The SentenceTransformer model to use (default:bkai-foundation-models/vietnamese-bi-encoder
)MAX_TOKEN_LIMIT
: Maximum number of tokens per chunk (default: 128)DEFAULT_CHUNK_SIZE
: Default chunk size in tokens (default: 110)DEFAULT_CHUNK_OVERLAP
: Default chunk overlap in tokens (default: 20)DEFAULT_TOP_K
: Default number of top matches to return for a query (default: 5)ENABLE_CACHE
: Enable embedding caching (default:True
)CACHE_SIZE
: Maximum size of the embedding cache (default: 1000)HOST
: Host address (default:0.0.0.0
)PORT
: Port number (default: 8000)