Skip to content

New features #2

Open
Open
@danielmlow

Description

@danielmlow

🔴 High Priority
🟡 Medium Priority
🟢 Low Priority

General

  • Tutorial loading and using Suicide Risk Lexicon. l.load_lexicon(name)
  • load/save all vs. most prototypical
  • Tutorial: add example loading embeddings. cts.measure(documents, stored_embeddings_path = PATH)
  • Add docstring for everything.
  • Add example of how much a result costs for a definition (add date and model).
  • Use generative AI model from huggingface from cache, instead of huggingface API

Tutorial

  • save other outputs of lexicon count and cts.
  • Put CTS first.
  • Sort lexicon by similarity for validation (see code in lexicon repo)
  • Show how to load embeddings pickle to save time.
  • Add ipywidgets and tqdm and jupyter to toml so you can view progress bar.

Lexicon

  • lexicon.extract should output a columns called document_n and document_str
  • l.extract() as a method instead of lexicon.extract(l.constructs)
  • obtain lexicon_dict = l.to_dict() --> {construct: [tokens]} from lexicon object.
  • lexicon.add clean up how I store metadata automatically and manually. Maybe create a brief input() dialogue so it saves user, timestamp, source, etc.
  • create lexicon from seance. Modify seance tutorial accordingly.

Outputs/visualization

  • Clean up output for matches_per_construct matches_counter_dandmatches_per_doc`?
  • I created a highlight function. But I had other code I used to look at context by printing (in lexicon repo). add to tutorial and scripts.

CTS:

  • 🔴 instead of saving cosine_similarities (2GB for 5000 CTL chats, compressed), you can provide the tuple (lexicon token, document token, similarity) in the DF. And just output a visualization for those as HTML.
  • exact match within a string == 1
  • threshold = 0.4 (depends on embedding) for final score. Remember CTS for bursting study where values (without model) where too high for some features.
  • Add additional arguments for CTS
  • # TODO: double check values for temperature from paper

Outputs/visualization

  • CTS: Plot matches in text
  • CTS: show top token and cosine similarity in column of features DF

Extensions

  • Implement in R or have an R wrapper
  • Create a website where csv file can return features.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions