Open
Description
🔴 High Priority
🟡 Medium Priority
🟢 Low Priority
General
- Tutorial loading and using Suicide Risk Lexicon.
l.load_lexicon(name)
- load/save all vs. most prototypical
- Tutorial: add example loading embeddings.
cts.measure(documents, stored_embeddings_path = PATH)
- Add docstring for everything.
- Add example of how much a result costs for a definition (add date and model).
- Use generative AI model from huggingface from cache, instead of huggingface API
Tutorial
- save other outputs of lexicon count and cts.
- Put CTS first.
- Sort lexicon by similarity for validation (see code in lexicon repo)
- Show how to load embeddings pickle to save time.
- Add ipywidgets and tqdm and jupyter to toml so you can view progress bar.
Lexicon
-
lexicon.extract
should output a columns calleddocument_n
anddocument_str
-
l.extract()
as a method instead oflexicon.extract(l.constructs)
- obtain
lexicon_dict = l.to_dict()
-->{construct: [tokens]}
from lexicon object. -
lexicon.add
clean up how I store metadata automatically and manually. Maybe create a brief input() dialogue so it saves user, timestamp, source, etc. - create lexicon from seance. Modify seance tutorial accordingly.
Outputs/visualization
-
Clean up output for
matches_per_constructand
matches_per_doc`? - I created a highlight function. But I had other code I used to look at context by printing (in lexicon repo). add to tutorial and scripts.
CTS:
- 🔴 instead of saving
cosine_similarities
(2GB for 5000 CTL chats, compressed), you can provide the tuple(lexicon token, document token, similarity)
in the DF. And just output a visualization for those as HTML. - exact match within a string == 1
- threshold = 0.4 (depends on embedding) for final score. Remember CTS for bursting study where values (without model) where too high for some features.
- Add additional arguments for CTS
- # TODO: double check values for temperature from paper
Outputs/visualization
- CTS: Plot matches in text
- CTS: show top token and cosine similarity in column of features DF
Extensions
- Implement in R or have an R wrapper
- Create a website where csv file can return features.
Metadata
Metadata
Assignees
Labels
No labels