-
Notifications
You must be signed in to change notification settings - Fork 3
Machine Learning Models
The Discourse Analysis Tool Suite heavily relies on various Machine Learning Models, especially during document pre-processing. Here, we list all models that are currently in use.
Docling runs on GPU, served by Docling-serve
GlotLID runs on CPU, served by Ray
- Spacy NER (eng): en_core_web_trf
- Spacy NER (de): de_core_news_lg
- Spacy NER (it): it_core_news_lg
Spacy runs on CPU, served by Ray
DETR runs on GPU, served by Ray
- Whisper: Whisper Timestamped
Whisper runs on GPU, served by Ray
Gemma 3 27B is used for:
- Image Captioning during document preprocessing
- LLM Assistant (metadata extraction, document tagging, sentence annotation, span annotation)
- Automatic memo generation
- RAG chat in Perspectives extension
- Perspectives Document Rewriting
Gemma3 runs on GPU, served by vLLM
- Sentence Encoder: CLIP-based sentence-transformer
- Image Encoder: CLIP
CLIP runs on GPU, served by Ray
- Document Encoder (Text): Snowflake Arctic Embed v2
Arctic Embed runs on GPU, served by vLLM
- Document Encoder (Text): Multilingual-E5-large-instruct
- Image Encoder: Qwen2
Context size: ?? Instruction-tuned embedding models are run on GPU, served & trained by GPU workers on demand
- Sentence Embeddings (Text): paraphrase-multilingual-mpnet-base-v2
Context size: The default context size of the model is used. Since only sentences are processed, input texts should not be truncated.
The COTA embedding model runs on GPU, served & trained by GPU workers on demand
Classification models can be selected and fine-tuned by the user for specific tasks. Currently, we only support text modality. We offer a selection of the following models:
Context size: The default context size of the chosen model is used. Input texts that are too large are chunked during training dataset creation.
Transformer models are fine-tuned on demand to create document tagging and span classification models.
- ModernBERT-base (EN): answerdotai/ModernBERT-base
- ModernBERT-large (EN): answerdotai/ModernBERT-large
- ModernGBERT_134M (DE): LSX-UniWue/ModernGBERT_134M
- mdeberta-v3-base (MULTI): microsoft/mdeberta-v3-base
Embedding models are fine-tuned on demand to create sentence classification models.
- gte-modernbert-base (EN): Alibaba-NLP/gte-modernbert-base
- multilingual-e5-small (MULTI): intfloat/multilingual-e5-small
- multilingual-e5-large (MULTI): intfloat/multilingual-e5-large
- paraphrase-multilingual-mpnet-base-v2 (MULTI): sentence-transformers/paraphrase-multilingual-mpnet-base-v2
All classification models run on GPU, served & trained by GPU workers on demand
Analysis models can be run on demand to enrich documents further.
All analysis models run on GPU, served by GPU workers on demand.