Machine Learning Models

The Discourse Analysis Tool Suite heavily relies on various Machine Learning Models, especially during document pre-processing. Here, we list all models that are currently in use.

Document Preprocessing

OCR & PDF Processing

Docling https://docling-project.github.io/docling/

Docling runs on GPU, served by Docling-serve

Language Detection

GlotLID: https://huggingface.co/cis-lmu/glotlid

GlotLID runs on CPU, served by Ray

Named Entity Recognition

Spacy NER (eng): en_core_web_trf
Spacy NER (de): de_core_news_lg
Spacy NER (it): it_core_news_lg

Spacy runs on CPU, served by Ray

Object Detection

DETR model with ResNet-50 backbone by facebook

DETR runs on GPU, served by Ray

Automatic Video & Audio Transcriptions

Whisper: Whisper Timestamped

Whisper runs on GPU, served by Ray

Large Language Models

Gemma 3 27B is used for:

Image Captioning during document preprocessing
LLM Assistant (metadata extraction, document tagging, sentence annotation, span annotation)
Automatic memo generation
RAG chat in Perspectives extension
Perspectives Document Rewriting

Gemma3 runs on GPU, served by vLLM

Embedding Models

Similarity Search

Sentence Encoder: CLIP-based sentence-transformer
Image Encoder: CLIP

CLIP runs on GPU, served by Ray

Document Embeddings

Document Encoder (Text): Snowflake Arctic Embed v2

Arctic Embed runs on GPU, served by vLLM

Instruction-tuned Embedding Models

Document Encoder (Text): Multilingual-E5-large-instruct
Image Encoder: Qwen2

Context size: ?? Instruction-tuned embedding models are run on GPU, served & trained by GPU workers on demand

Concept over Time Analysis

Sentence Embeddings (Text): paraphrase-multilingual-mpnet-base-v2

Context size: The default context size of the model is used. Since only sentences are processed, input texts should not be truncated.

The COTA embedding model runs on GPU, served & trained by GPU workers on demand

Classification Models

Classification models can be selected and fine-tuned by the user for specific tasks. Currently, we only support text modality. We offer a selection of the following models:

Context size: The default context size of the chosen model is used. Input texts that are too large are chunked during training dataset creation.

Base Transformer models

Transformer models are fine-tuned on demand to create document tagging and span classification models.

ModernBERT-base (EN): answerdotai/ModernBERT-base
ModernBERT-large (EN): answerdotai/ModernBERT-large
ModernGBERT_134M (DE): LSX-UniWue/ModernGBERT_134M
mdeberta-v3-base (MULTI): microsoft/mdeberta-v3-base

Embedding models

Embedding models are fine-tuned on demand to create sentence classification models.

gte-modernbert-base (EN): Alibaba-NLP/gte-modernbert-base
multilingual-e5-small (MULTI): intfloat/multilingual-e5-small
multilingual-e5-large (MULTI): intfloat/multilingual-e5-large
paraphrase-multilingual-mpnet-base-v2 (MULTI): sentence-transformers/paraphrase-multilingual-mpnet-base-v2

All classification models run on GPU, served & trained by GPU workers on demand

Analysis Models

Analysis models can be run on demand to enrich documents further.

Coreference Resolution (DE)

Maverick-coref-de: fynnos/maverick-mes-de10:

Quotation Detection (DE)

Quotect: fynnos/quotect-mt5-base:

All analysis models run on GPU, served by GPU workers on demand.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Models

Document Preprocessing

OCR & PDF Processing

Language Detection

Named Entity Recognition

Object Detection

Automatic Video & Audio Transcriptions

Large Language Models

Embedding Models

Similarity Search

Document Embeddings

Instruction-tuned Embedding Models

Concept over Time Analysis

Classification Models

Base Transformer models

Embedding models

Analysis Models

Coreference Resolution (DE)

Quotation Detection (DE)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally