Add topic-modeling pipelines (v1-v3) and evaluator #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction of
topic_modeling/
module – three pipelines plus an evaluator – without touching any other part of the repo.Pipelines
src/consultation_topic_modeling_v1.py
– baseline BERTopic workflow.src/consultation_topic_modeling_v2.py
– improved version (automatic stop-words, SBERT/GTE embeddings, UMAP + HDBSCAN, JSON titling).src/consultation_topic_modeling_v3.py
– hybrid key-phrase approach inspired by QualIT (Gemma key-phrase extraction, hallucination filter, titles).Evaluator
src/evaluate_topics.py
– asks an LLM (Gemini API or local Gemma) to grade how well each auto-generated topic title fits its representative comments.--dump_prompts
to write prompts to disk without burning API tokens.**Documentation
House-keeping
.gitignore
to keep secrets & artefacts out of the repo (*.env
,/topic_modeling/.env
,__pycache__/
)..env
with the Gemini API key is not committed.Tested
Ran all three pipelines locally on consultation 320 and generated outputs under
outputs/v1|v2|v3/320/
.Evaluator tested with both
--gemini
(free-tier Flash) and--local
Gemma.Security / secrets
No credentials or personal data included.
API key remains in
topic_modeling/.env
, which is ignored by Git.Impact on existing code
Only
.gitignore
modified; all other existing directories remain unchanged.