databricks
diff --git a/‎docs/getting_started/quickstart.md
Lines changed: 35 additions & 22 deletions b/‎docs/getting_started/quickstart.md
Lines changed: 35 additions & 22 deletions
diff --git a/‎docs/getting_started/quickstart_python.md
Lines changed: 131 additions & 73 deletions b/‎docs/getting_started/quickstart_python.md
Lines changed: 131 additions & 73 deletions
@@ -6,8 +6,9 @@ In this quick start we're going to:
 
 - Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
   for tuning LLMs.
-- Find PII (emails, etc)
-- Find profanity in the responses (using powerful text embeddings)
+- Compute clusters.
+- Delete specific clusters.
+- Find profanity in the remaining rows (using powerful text embeddings)
 - Download the enriched dataset as a json file so we can clean it in a Python notebook
 
 ## Start the web server
@@ -35,7 +36,13 @@ Click the `Add dataset` button on the Getting Started page and fill in:
 Fill in HuggingFace-specific fields:
 
 3. HuggingFace dataset name: `Open-Orca/OpenOrca`
-4. Sample size: 10000 (it takes ~5mins to compute on-device embeddings for 10,000 items)
+4. Sample size: 10000
+
+```{note}
+Lilac's sweet spot is 10,000-100,000 rows of data, although up to 10 million rows are possible.
+This quickstart uses 10,000 rows so that clustering and embedding operations finish locally
+in ~10 minutes even without a GPU.
+```
 
 Finally:
 
@@ -58,30 +65,35 @@ your media field contains markdown, you can enable markdown rendering.
 
 <video loop muted autoplay controls src="../_static/getting_started/orca-settings.mp4"></video>
 
-## Enrich
+## Cluster
 
-Lilac can enrich your media fields with additional metadata by:
+Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of
+content present in your dataset, as well as to target subsets for removal from the dataset.
 
-- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
-  statistics, etc.)
-- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
-  that you create)
+To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can
+choose which field will get clustered.
 
-### PII detection
+** Add clustering video here **
 
-Let's run the PII detection signal on both the `question` and the `response` field and see if there
-is any PII like emails, secret tokens, etc.
+The cluster visualizer shows two hierarchical levels of clusters by default. You can also group over
+other fields in your dataset by changing the Explore and Group By selections.
 
-<video loop muted autoplay controls src="../_static/getting_started/orca-pii-enrichment.mp4"></video>
+## Tagging and Deleting rows
 
-Once it's done, we can see that both the `question` and the `response` fields have emails present.
-We can click on an email to apply a filter and see all the rows that contain that email.
+Lilac can curate your dataset by tagging or deleting rows.
 
-<video loop muted autoplay controls src="../_static/getting_started/orca-pii-filter.mp4"></video>
+Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way
+to iterate on your dataset by removing undesired slices of data. Later on, when you export data from
+Lilac, deleted rows will be excluded by default.
 
-We notice that the selected email in the `response` field was not hallucinated by the LLM because it
-was also present in the `question` field. Later we can use the enriched metadata of both fields to
-filter out only responses that have hallucinated emails.
+## Enrich
+
+Lilac can enrich your media fields with additional metadata by:
+
+- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
+  statistics, etc.)
+- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
+  that you create)
 
 ### Profanity detection
 
@@ -112,9 +124,10 @@ can open the statistics panel to see the distribution of concept scores.
 
 ## Download
 
-Now that we've enriched the dataset, let's download it by clicking on the `Download data` button in
-the top-right corner. This will download a json file with the same name as the dataset. Once we have
-the data, we can continue working with it in a Python notebook, or any other language.
+Now that we've clustered, curated, and enriched the dataset, let's download it by clicking on the
+`Download data` button in the top-right corner. This will download a json file with the same name as
+the dataset. Once we have the data, we can continue working with it in a Python notebook, or any
+other language.
 
 You can also get the dataset as a Pandas dataframe through the [Python API](quickstart_python.md).
 
 
@@ -1,8 +1,27 @@
 # Python API
 
-Lilac's UI is built atop a Python library, which you can access through the `lilac` module. If you'd
-like to use Lilac's features alongside other popular Python libraries, or prefer a notebook
-workflow, read on.
+Lilac's UI is built atop a Python library, which you can access through the `lilac` module. The UI
+generally defers all computation to Python, so if the feature is in the UI, you'll be able to do the
+same from Python.
+
+The UI excels at interactive exploration and tagging/deletion, while the Python API provides
+powerful primitives, like `map`, which allows you to run arbitrary Python computations with
+developer-friendly features like progress tracking and resumability.
+
+To get the best of both worlds, you can run `ll.start_server()` in your Python notebook or
+interpreter to start the Lilac backend as a background thread, and then continue with using the
+Lilac API. (Running the Lilac server in the same Python process/kernel is recommended because Lilac
+can then share the same database connections and in-memory caches, lowering memory usage and
+ensuring data consistency between UI and API.)
+
+In this quickstart, we're going to:
+
+- Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
+  for tuning LLMs.
+- Compute clusters.
+- Delete specific clusters.
+- Find profanity in the remaining rows (using powerful text embeddings)
+- Download the enriched dataset as a json file so we can clean it in a Python notebook
 
 ## Import lilac
 
@@ -21,11 +40,11 @@ ll.set_project_dir('~/my_project')
 
 Let's load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction
 dataset used for tuning LLM models. While the Lilac tool can scale to millions of rows on a single
-machine, we are sampling to 100,000 so we can get started quickly.
+machine, we are sampling to 10,000 so we can get started quickly.
 
 ```python
-source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=100_000)
-config = ll.DatasetConfig(namespace='local', name='open-orca-100k', source=source)
+source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=10_000)
+config = ll.DatasetConfig(namespace='local', name='open-orca-10k', source=source)
 dataset = ll.create_dataset(config)
 ```
 
@@ -36,7 +55,6 @@ Downloading data files: 100%|█████████████████
 Extracting data files: 100%|███████████████████████████████████████| 1/1 [00:00<00:00, 318.98it/s]
 Setting num_proc from 8 to 2 for the train split as it only contains 2 shards.
 Generating train split: 4233923 examples [00:06, 654274.93 examples/s]
-Reading from source huggingface...: 100%|██████████████| 100000/100000 [00:03<00:00, 30124.10it/s]
 Dataset "open-orca-100k" written to ./data/datasets/local/open-orca-100k
 ```
 
@@ -46,28 +64,47 @@ Alternately, you can load a preexisting dataset:
 dataset = ll.get_dataset('local', 'open-orca-100k')
 ```
 
-## Run signals
+## Compute clusters
 
-Let's run the PII detection signal on both the `question` and the `response` field.
+Let's compute clusters on the `question`field.
 
 ```python
-dataset.compute_signal(ll.PIISignal(), 'question')
-dataset.compute_signal(ll.PIISignal(), 'response')
+dataset.cluster('question')
 ```
 
 Output:
 
 ```sh
-Computing pii on local/open-orca-100k:question: 100%|█████████████████████████████████████| 100000/100000 [03:36<00:00, 462.62it/s]
-Computing signal "pii" on local/open-orca-100k:question took 216.246s.
-Wrote signal output to ./data/datasets/local/open-orca-100k/question/pii
-Computing pii on local/open-orca-100k:response: 100%|█████████████████████████████████████| 100000/100000 [02:21<00:00, 708.04it/s]
-Computing signal "pii" on local/open-orca-100k:response took 141.312s.
-Wrote signal output to ./data/datasets/local/open-orca-100k/response/pii
+[local/open-orca-10k][1 shards] map "extract_text" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 59156.94it/s]
+Wrote map output to question__cluster-00000-of-00001.parquet
+[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)":   0%|          | 0/10000 [00:00<?, ?it/s]
+jinaai/jina-embeddings-v2-small-en using device: mps:0
+Computing embeddings: 100%|██████████| 10000/10000 [18:30<00:00,  9.01it/s]
+Computing embeddings took 1113.504s.
+UMAP: Reducing dim from 512 to 5 of 10000 vectors took 21.791s.
+HDBSCAN: Clustering took 0.175s.
+4515 noise points (45.1%) will be assigned to nearest cluster.
+[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)": 100%|██████████| 10000/10000 [19:13<00:00,  8.67it/s]
+HDBSCAN: Computing membership for the noise points took 15.788s.
+Wrote map output to question__cluster-00000-of-00001.parquet
+[local/open-orca-10k][1 shards] map "title_clusters" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:26<00:00, 374.38it/s]
+Wrote map output to question__cluster-00000-of-00001.parquet
+Computing embeddings: 10000it [01:19, 125.71it/s]es
+Computing embeddings took 79.760s.
+UMAP: Reducing dim from 512 to 5 of 10000 vectors took 53.578s.
+HDBSCAN: Clustering took 0.136s.
+137 noise points (1.4%) will be assigned to nearest cluster.
+[local/open-orca-10k][1 shards] map "cluster_titles" to "('question__cluster',)": 100%|██████████| 10000/10000 [02:14<00:00, 74.37it/s]
+HDBSCAN: Computing membership for the noise points took 0.426s.
+Wrote map output to question__cluster-00000-of-00001.parquet
+[local/open-orca-10k][1 shards] map "title_categories" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:25<00:00, 395.07it/s]
+Wrote map output to question__cluster-00000-of-00001.parquet
+[local/open-orca-10k][1 shards] map "drop_temp_text_column" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 71313.87it/s]
+Wrote map output to question__cluster-00000-of-00001.parquet
 ```
 
-The dataset now has the extra fields `question.pii` and `response.pii`, which we can see by printing
-the entire schema:
+The dataset now has the extra fields `question__cluster`, which we can see by printing the entire
+schema:
 
 ```py
 print(dataset.manifest().data_schema)
@@ -78,48 +115,61 @@ Output:
 ```sh
 id: string
 system_prompt: string
-question:
-  pii:
-    emails: list( string_span)
-    ip_addresses: list( string_span)
-    secrets: list( string_span)
-response:
-  pii:
-    emails: list( string_span)
-    ip_addresses: list( string_span)
-    secrets: list( string_span)
-  gte-small: list(
-    embedding: embedding)
+question: string
+response: string
 __hfsplit__: string
-__rowid__: string
+question__cluster:
+  cluster_id: int32
+  cluster_membership_prob: float32
+  cluster_title: string
+  category_id: int32
+  category_membership_prob: float32
+  category_title: string
 ```
 
-Note that `question.pii.emails` is a list of `string_span` values. These are objects with `start`
-and `end` indices that point to the location of the email in the original `question` text.
-
 ## Select specific rows
 
-Let's query 5 rows that have emails in the `response` field via [](#Dataset.select_rows), a python
-API that is analogous to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op)
-filter on `response.pii.emails` to make sure it's not empty:
+Let's find all clusters that talk about movies via [](#Dataset.select_rows), which works very
+similarly to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op) filter on
+`question__cluster.cluster_title`.
 
 ```py
 df_with_emails = dataset.select_rows(
-  ['id', 'response', 'response.pii.emails'],
+  ['id', 'question', 'question__cluster.cluster_title', 'question__cluster.cluster_id'],
   limit=5,
-  filters=[('response.pii.emails', 'exists')]).df()
+  filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')]).df()
 print(df_with_emails)
 ```
 
 Output:
 
 ```
-             id                                           response                                response.pii.emails
-0  flan.2166478  Subject: Bruce Colbourne, to D.Colbourne@ nrc-...  [{'__value__': {'start': 157, 'end': 183}}, {'...
-1  flan.2168520  Once you have logged into your email account a...        [{'__value__': {'start': 482, 'end': 501}}]
-2   flan.294964  Université McGill, 3550 Rue University, Montré...        [{'__value__': {'start': 174, 'end': 196}}]
-3  flan.1805392  Step 1: Identify the words in the text.\n\nTo ...  [{'__value__': {'start': 274, 'end': 291}}, {'...
-4    niv.204253  In this task, you are asked to translate an En...  [{'__value__': {'start': 322, 'end': 341}}, {'...
+             id                                           question  \
+0    t0.1073241  Answer the following question: Write a multi-c...
+1  flan.1059135  Choose the correct sentiment from candidates:\...
+2  flan.1794922  The "math" aspect to this is merely a gimmick ...
+3     t0.243847  Q:Read the following paragraph and extract the...
+4     t0.265856  Please answer the following question: Generate...
+
+               question__cluster.cluster_title  question__cluster.cluster_id
+0            Answering Movie-Related Questions                           320
+1                      Movie Review Sentiments                           286
+2  Extracting Answers from Vampire Movie Plots                           325
+3          Extracting Answers from Movie Plots                           313
+4                         Movie Plot Questions                           371
+```
+
+After confirming the results of this query, let's delete these rows:
+
+```py
+dataset.delete_rows(filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')])
+print(dataset.count(), 'rows remaining')
+```
+
+Output:
+
+```
+9174 rows remaining
 ```
 
 For more information on querying, see [](#Dataset.select_rows).
@@ -131,21 +181,15 @@ content. To do that we need to _index_ the `response` field using a text embeddi
 index once. For a fast on-device embedding, we recommend the
 [GTE-Small embedding](https://huggingface.co/thenlper/gte-small).
 
-Before we can index with GTE-small, we need to install optional dependencies for the gte embedding:
-
-```sh
-pip install lilac[gte]
-```
-
 ```py
 dataset.compute_embedding('gte-small', 'response')
 ```
 
 Output:
 
 ```sh
-Computing gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████████| 100000/100000 [17:59<00:00, 92.67it/s]
-Computing signal "gte-small" on local/open-orca-100k:('response',) took 1079.260s.
+Compute embedding  GTESmall({"embed_input_type":"document","signal_name":"gte-small"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [04:47<00:00, 31.93it/s]
+
 ```
 
 Now we can preview the top 5 responses based on their profanity concept score:
@@ -156,15 +200,25 @@ r = dataset.select_rows(['response'], searches=[search], limit=5)
 print(r.df())
 ```
 
-Output (the response text is removed due to sensitive content):
+Output:
 
 ```
-                                            response  ...                lilac/profanity/gte-small(response)
-0                                  *****************  ...  [{'__value__': {'start': 0, 'end': 17}, 'score...
-1                                  *****************  ...  [{'__value__': {'start': 0, 'end': 6}, 'score'...
-2                                  *****************  ...  [{'__value__': {'start': 0, 'end': 143}, 'scor...
-3                                  *****************  ...  [{'__value__': {'start': 0, 'end': 79}, 'score...
-4                                  *****************  ...  [{'__value__': {'start': 0, 'end': 376}, 'scor...
+Computing topk on local/open-orca-10k:('response',) with embedding "gte-small" and vector store "hnsw" took 0.062s.
+Computing signal "concept_labels" on local/open-orca-10k:('response',) took 0.012s.
+Computing signal "concept_score" on local/open-orca-10k:('response',) took 0.025s.
+                                            response  \
+0  Part #1: Understand the text from a social med...
+1  - Years active: Early 2000s to present\n- Birt...
+2                                                sex
+3  Sure! In a simple way for you to understand, t...
+4  The nursery rhyme "Ding, Dong, Bell," also kno...
+
+          response.lilac/profanity/gte-small/preview
+0  [{'__span__': {'start': 0, 'end': 113}, 'score...
+1  [{'__span__': {'start': 0, 'end': 103}, 'score...
+2  [{'__span__': {'start': 0, 'end': 3}, 'score':...
+3  [{'__span__': {'start': 0, 'end': 78}, 'score'...
+4  [{'__span__': {'start': 0, 'end': 164}, 'score...
 ```
 
 To compute the concept score over the entire dataset, we do:
@@ -176,8 +230,8 @@ dataset.compute_concept('lilac', 'profanity', embedding='gte-small', path='respo
 Output:
 
 ```sh
-Computing lilac/profanity/gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████▉| 100000/100000 [00:10<00:00, 9658.80it/s]
-Wrote signal output to ./data/datasets/local/open-orca-100k/response/lilac/profanity/gte-small/v34
+Compute signal  ConceptSignal({"embedding":"gte-small","namespace":"lilac","concept_name":"profanity","version":36,"draft":"main","signal_name":"concept_score"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [00:01<00:00, 7322.02it/s]
+Wrote signal output to data/datasets/local/open-orca-10k/response/lilac/profanity/gte-small
 ```
 
 ## Convert formats
@@ -195,14 +249,18 @@ df.info()
 Output:
 
 ```
- #   Column                                   Non-Null Count   Dtype
----  ------                                   --------------   -----
- 0   id                                       100000 non-null  object
- 1   system_prompt                            100000 non-null  object
- 2   question                                 100000 non-null  object
- 3   response                                 100000 non-null  object
- 4   __hfsplit__                              100000 non-null  object
- 5   response.pii                             100000 non-null  object
- 6   response.lilac/profanity/gte-small/v34   100000 non-null  object
- 7   question.pii                             100000 non-null  object
+<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 9174 entries, 0 to 9173
+Data columns (total 7 columns):
+ #   Column             Non-Null Count  Dtype
+---  ------             --------------  -----
+ 0   id                 9174 non-null   object
+ 1   system_prompt      9174 non-null   object
+ 2   question           9174 non-null   object
+ 3   response           9174 non-null   object
+ 4   __hfsplit__        9174 non-null   object
+ 5   question__cluster  9174 non-null   object
+ 6   __deleted__        0 non-null      object
+dtypes: object(7)
+memory usage: 501.8+ KB
 ```