MeTa4 is a corpus analysis tool for identifying and exploring metaphor usage following the MIPVU procedure. It supports the VU Amsterdam Metaphor Corpus (VUAMC) and user-provided CSV/TSV/XML/ZIP datasets. It reports MRW counts and densities, exports KWIC concordances, executes regex/CQL searches, and produces tidy CSV outputs.
For installation, launching, platform notes, and troubleshooting, see LAUNCHER.md (authoritative startup guide).
- Punctuation POS are excluded from the LU denominator in all MRW rate calculations.
- B1G subset: only whitelisted sentence numbers are loaded (singles such as
1012, 1299, 1401; ranges such as738–765,1485–1584). Non‑listed B1G sentences are ignored. - Tokens outside
<s>(non‑B1G): wrapped into synthetic sentence IDs likenosent####to avoid data loss.
- Rows marked as DFMA/DFMA_PUNCT are removed from analysis.
- Excludes
type ∈ {lex, morph, phrase}and rows that are puremflagannotations.
- Segments under
seg@function="trunc"are skipped.
- MRW‑labelled of in News is de‑counted to mitigate a known annotation artefact.
- TEI anchor/part pairs (
xml:id↔corresp) are merged into a single token. Original_WordandLemmaare concatenated; POS joins asPOS+POS.MRW/mflagpropagate to the anchor;type/subtypeare unioned using|with duplicates removed.
- Summary metrics (raw/LU/MRW; density per 1,000 LU; share of corpus) plus two KWIC CSVs (MRW and non‑MRW).
- KWIC inserts
[SENT_START]/[SENT_END]sentinels; MRWs are suffixed_Ω, capped byOMEGA_LIMIT(default 3).
- Executes the single‑lemma pipeline for a list provided inline (comma‑separated) or via text file (one lemma per line).
- Counts left/right collocates by distance (default
COLLOCATION_WINDOW = 5), aggregates duplicates; exportslemma, collocate, pos, side, distance, count.
- Word‑form default (enforce with
w:); lemma withl:; raw regex withre:. - CQL subset supported, e.g.,
[lemma="gehen"] []{0,1} [pos="NN.*"]. Saves KWIC CSVs for matches.
- Full flat CSV for current scope; MRW list; MRW‑by‑POS (enter e.g.,
NN).
- Auto‑discovers
VUAMC.xmlor a ZIP containing it. - Parses once per (path, mtime) and caches a DataFrame to avoid re‑parsing.
- CSV/TSV: delimiter auto‑detect; robust encoding attempt (
utf‑8,latin1,iso‑8859‑1). Column names are normalized (e.g.,File_ID → file_id).- Minimal required columns:
lemma, word, metaphor_function, pos. - Recommended:
file_id, sentence_id, type, subtype, mflag, genre.
- Minimal required columns:
- XML/ZIP: TEI parsing applies the same MWE merge, masks, and sentence policies as VUAMC.
- KWIC/collocations benefit from a valid
sentence_idfor precise boundaries.
All exports are written under results/ in a subfolder in the script folder. CSVs are UTF‑8 with BOM for Excel compatibility.
- Single/Batch KWIC: dual CSVs (
*_MRW.csv,*_NONMRW.csv) including left/right contexts, node token, lemma/genre,sentence_id, andmetaphor_function. - Pattern KWIC: per‑query CSV with sentence‑level context and the original query.
- Collocations:
lemma, collocate, pos, side, distance, count. - Print / Full flat CSV:
File_ID, Sentence_ID, Original_Word, Lemma, POS, Metaphor, Type, Subtype, MFlag, Genre, xml:id, corresp.
- Environment variables:
For platform setup and launch, see LAUNCHER.md.
meta4.cli— entry menu and user interactionmeta4.io— discovery & loading (CSV/TSV/XML/ZIP), VUAMC auto‑detect & cachingmeta4.parser— TEI parsing, sentence handling, MWE mergemeta4.analysis— MRW masks, counts, densities, exportsmeta4.cql— CQL/regex machinery and query executionmeta4.mipvu— MIPVU‑specific rulesmeta4.utils/meta4.config— utilities and configuration
- MIPVU: A Method for Linguistic Metaphor Identification (Vrije Universiteit).
- VU Amsterdam Metaphor Corpus (VUAMC).
- Project license: see repository license declaration.
Developed and maintained as part of ongoing research on metaphor study at the University of Erfurt.
[Daban Q. Jaff] (2025). MeTa4 Metaphor Analysis Tool. Available at: https://github.com/dabjaff/MeTa4-Metaphor-Analysis-Tool