Skip to content

Commit 80892fb

Browse files
committed
save
1 parent e95c628 commit 80892fb

File tree

5 files changed

+201
-105
lines changed

5 files changed

+201
-105
lines changed

docs/getting_started/quickstart.md

Lines changed: 35 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ In this quick start we're going to:
66

77
- Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
88
for tuning LLMs.
9-
- Find PII (emails, etc)
10-
- Find profanity in the responses (using powerful text embeddings)
9+
- Compute clusters.
10+
- Delete specific clusters.
11+
- Find profanity in the remaining rows (using powerful text embeddings)
1112
- Download the enriched dataset as a json file so we can clean it in a Python notebook
1213

1314
## Start the web server
@@ -35,7 +36,13 @@ Click the `Add dataset` button on the Getting Started page and fill in:
3536
Fill in HuggingFace-specific fields:
3637

3738
3. HuggingFace dataset name: `Open-Orca/OpenOrca`
38-
4. Sample size: 10000 (it takes ~5mins to compute on-device embeddings for 10,000 items)
39+
4. Sample size: 10000
40+
41+
```{note}
42+
Lilac's sweet spot is 10,000-100,000 rows of data, although up to 10 million rows are possible.
43+
This quickstart uses 10,000 rows so that clustering and embedding operations finish locally
44+
in ~10 minutes even without a GPU.
45+
```
3946

4047
Finally:
4148

@@ -58,30 +65,35 @@ your media field contains markdown, you can enable markdown rendering.
5865

5966
<video loop muted autoplay controls src="../_static/getting_started/orca-settings.mp4"></video>
6067

61-
## Enrich
68+
## Cluster
6269

63-
Lilac can enrich your media fields with additional metadata by:
70+
Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of
71+
content present in your dataset, as well as to target subsets for removal from the dataset.
6472

65-
- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
66-
statistics, etc.)
67-
- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
68-
that you create)
73+
To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can
74+
choose which field will get clustered.
6975

70-
### PII detection
76+
** Add clustering video here **
7177

72-
Let's run the PII detection signal on both the `question` and the `response` field and see if there
73-
is any PII like emails, secret tokens, etc.
78+
The cluster visualizer shows two hierarchical levels of clusters by default. You can also group over
79+
other fields in your dataset by changing the Explore and Group By selections.
7480

75-
<video loop muted autoplay controls src="../_static/getting_started/orca-pii-enrichment.mp4"></video>
81+
## Tagging and Deleting rows
7682

77-
Once it's done, we can see that both the `question` and the `response` fields have emails present.
78-
We can click on an email to apply a filter and see all the rows that contain that email.
83+
Lilac can curate your dataset by tagging or deleting rows.
7984

80-
<video loop muted autoplay controls src="../_static/getting_started/orca-pii-filter.mp4"></video>
85+
Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way
86+
to iterate on your dataset by removing undesired slices of data. Later on, when you export data from
87+
Lilac, deleted rows will be excluded by default.
8188

82-
We notice that the selected email in the `response` field was not hallucinated by the LLM because it
83-
was also present in the `question` field. Later we can use the enriched metadata of both fields to
84-
filter out only responses that have hallucinated emails.
89+
## Enrich
90+
91+
Lilac can enrich your media fields with additional metadata by:
92+
93+
- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
94+
statistics, etc.)
95+
- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
96+
that you create)
8597

8698
### Profanity detection
8799

@@ -112,9 +124,10 @@ can open the statistics panel to see the distribution of concept scores.
112124

113125
## Download
114126

115-
Now that we've enriched the dataset, let's download it by clicking on the `Download data` button in
116-
the top-right corner. This will download a json file with the same name as the dataset. Once we have
117-
the data, we can continue working with it in a Python notebook, or any other language.
127+
Now that we've clustered, curated, and enriched the dataset, let's download it by clicking on the
128+
`Download data` button in the top-right corner. This will download a json file with the same name as
129+
the dataset. Once we have the data, we can continue working with it in a Python notebook, or any
130+
other language.
118131

119132
You can also get the dataset as a Pandas dataframe through the [Python API](quickstart_python.md).
120133

Lines changed: 131 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,27 @@
11
# Python API
22

3-
Lilac's UI is built atop a Python library, which you can access through the `lilac` module. If you'd
4-
like to use Lilac's features alongside other popular Python libraries, or prefer a notebook
5-
workflow, read on.
3+
Lilac's UI is built atop a Python library, which you can access through the `lilac` module. The UI
4+
generally defers all computation to Python, so if the feature is in the UI, you'll be able to do the
5+
same from Python.
6+
7+
The UI excels at interactive exploration and tagging/deletion, while the Python API provides
8+
powerful primitives, like `map`, which allows you to run arbitrary Python computations with
9+
developer-friendly features like progress tracking and resumability.
10+
11+
To get the best of both worlds, you can run `ll.start_server()` in your Python notebook or
12+
interpreter to start the Lilac backend as a background thread, and then continue with using the
13+
Lilac API. (Running the Lilac server in the same Python process/kernel is recommended because Lilac
14+
can then share the same database connections and in-memory caches, lowering memory usage and
15+
ensuring data consistency between UI and API.)
16+
17+
In this quickstart, we're going to:
18+
19+
- Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
20+
for tuning LLMs.
21+
- Compute clusters.
22+
- Delete specific clusters.
23+
- Find profanity in the remaining rows (using powerful text embeddings)
24+
- Download the enriched dataset as a json file so we can clean it in a Python notebook
625

726
## Import lilac
827

@@ -21,11 +40,11 @@ ll.set_project_dir('~/my_project')
2140

2241
Let's load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction
2342
dataset used for tuning LLM models. While the Lilac tool can scale to millions of rows on a single
24-
machine, we are sampling to 100,000 so we can get started quickly.
43+
machine, we are sampling to 10,000 so we can get started quickly.
2544

2645
```python
27-
source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=100_000)
28-
config = ll.DatasetConfig(namespace='local', name='open-orca-100k', source=source)
46+
source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=10_000)
47+
config = ll.DatasetConfig(namespace='local', name='open-orca-10k', source=source)
2948
dataset = ll.create_dataset(config)
3049
```
3150

@@ -36,7 +55,6 @@ Downloading data files: 100%|█████████████████
3655
Extracting data files: 100%|███████████████████████████████████████| 1/1 [00:00<00:00, 318.98it/s]
3756
Setting num_proc from 8 to 2 for the train split as it only contains 2 shards.
3857
Generating train split: 4233923 examples [00:06, 654274.93 examples/s]
39-
Reading from source huggingface...: 100%|██████████████| 100000/100000 [00:03<00:00, 30124.10it/s]
4058
Dataset "open-orca-100k" written to ./data/datasets/local/open-orca-100k
4159
```
4260

@@ -46,28 +64,47 @@ Alternately, you can load a preexisting dataset:
4664
dataset = ll.get_dataset('local', 'open-orca-100k')
4765
```
4866

49-
## Run signals
67+
## Compute clusters
5068

51-
Let's run the PII detection signal on both the `question` and the `response` field.
69+
Let's compute clusters on the `question`field.
5270

5371
```python
54-
dataset.compute_signal(ll.PIISignal(), 'question')
55-
dataset.compute_signal(ll.PIISignal(), 'response')
72+
dataset.cluster('question')
5673
```
5774

5875
Output:
5976

6077
```sh
61-
Computing pii on local/open-orca-100k:question: 100%|█████████████████████████████████████| 100000/100000 [03:36<00:00, 462.62it/s]
62-
Computing signal "pii" on local/open-orca-100k:question took 216.246s.
63-
Wrote signal output to ./data/datasets/local/open-orca-100k/question/pii
64-
Computing pii on local/open-orca-100k:response: 100%|█████████████████████████████████████| 100000/100000 [02:21<00:00, 708.04it/s]
65-
Computing signal "pii" on local/open-orca-100k:response took 141.312s.
66-
Wrote signal output to ./data/datasets/local/open-orca-100k/response/pii
78+
[local/open-orca-10k][1 shards] map "extract_text" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 59156.94it/s]
79+
Wrote map output to question__cluster-00000-of-00001.parquet
80+
[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)": 0%| | 0/10000 [00:00<?, ?it/s]
81+
jinaai/jina-embeddings-v2-small-en using device: mps:0
82+
Computing embeddings: 100%|██████████| 10000/10000 [18:30<00:00, 9.01it/s]
83+
Computing embeddings took 1113.504s.
84+
UMAP: Reducing dim from 512 to 5 of 10000 vectors took 21.791s.
85+
HDBSCAN: Clustering took 0.175s.
86+
4515 noise points (45.1%) will be assigned to nearest cluster.
87+
[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)": 100%|██████████| 10000/10000 [19:13<00:00, 8.67it/s]
88+
HDBSCAN: Computing membership for the noise points took 15.788s.
89+
Wrote map output to question__cluster-00000-of-00001.parquet
90+
[local/open-orca-10k][1 shards] map "title_clusters" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:26<00:00, 374.38it/s]
91+
Wrote map output to question__cluster-00000-of-00001.parquet
92+
Computing embeddings: 10000it [01:19, 125.71it/s]es
93+
Computing embeddings took 79.760s.
94+
UMAP: Reducing dim from 512 to 5 of 10000 vectors took 53.578s.
95+
HDBSCAN: Clustering took 0.136s.
96+
137 noise points (1.4%) will be assigned to nearest cluster.
97+
[local/open-orca-10k][1 shards] map "cluster_titles" to "('question__cluster',)": 100%|██████████| 10000/10000 [02:14<00:00, 74.37it/s]
98+
HDBSCAN: Computing membership for the noise points took 0.426s.
99+
Wrote map output to question__cluster-00000-of-00001.parquet
100+
[local/open-orca-10k][1 shards] map "title_categories" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:25<00:00, 395.07it/s]
101+
Wrote map output to question__cluster-00000-of-00001.parquet
102+
[local/open-orca-10k][1 shards] map "drop_temp_text_column" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 71313.87it/s]
103+
Wrote map output to question__cluster-00000-of-00001.parquet
67104
```
68105

69-
The dataset now has the extra fields `question.pii` and `response.pii`, which we can see by printing
70-
the entire schema:
106+
The dataset now has the extra fields `question__cluster`, which we can see by printing the entire
107+
schema:
71108

72109
```py
73110
print(dataset.manifest().data_schema)
@@ -78,48 +115,61 @@ Output:
78115
```sh
79116
id: string
80117
system_prompt: string
81-
question:
82-
pii:
83-
emails: list( string_span)
84-
ip_addresses: list( string_span)
85-
secrets: list( string_span)
86-
response:
87-
pii:
88-
emails: list( string_span)
89-
ip_addresses: list( string_span)
90-
secrets: list( string_span)
91-
gte-small: list(
92-
embedding: embedding)
118+
question: string
119+
response: string
93120
__hfsplit__: string
94-
__rowid__: string
121+
question__cluster:
122+
cluster_id: int32
123+
cluster_membership_prob: float32
124+
cluster_title: string
125+
category_id: int32
126+
category_membership_prob: float32
127+
category_title: string
95128
```
96129

97-
Note that `question.pii.emails` is a list of `string_span` values. These are objects with `start`
98-
and `end` indices that point to the location of the email in the original `question` text.
99-
100130
## Select specific rows
101131

102-
Let's query 5 rows that have emails in the `response` field via [](#Dataset.select_rows), a python
103-
API that is analogous to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op)
104-
filter on `response.pii.emails` to make sure it's not empty:
132+
Let's find all clusters that talk about movies via [](#Dataset.select_rows), which works very
133+
similarly to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op) filter on
134+
`question__cluster.cluster_title`.
105135

106136
```py
107137
df_with_emails = dataset.select_rows(
108-
['id', 'response', 'response.pii.emails'],
138+
['id', 'question', 'question__cluster.cluster_title', 'question__cluster.cluster_id'],
109139
limit=5,
110-
filters=[('response.pii.emails', 'exists')]).df()
140+
filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')]).df()
111141
print(df_with_emails)
112142
```
113143

114144
Output:
115145

116146
```
117-
id response response.pii.emails
118-
0 flan.2166478 Subject: Bruce Colbourne, to D.Colbourne@ nrc-... [{'__value__': {'start': 157, 'end': 183}}, {'...
119-
1 flan.2168520 Once you have logged into your email account a... [{'__value__': {'start': 482, 'end': 501}}]
120-
2 flan.294964 Université McGill, 3550 Rue University, Montré... [{'__value__': {'start': 174, 'end': 196}}]
121-
3 flan.1805392 Step 1: Identify the words in the text.\n\nTo ... [{'__value__': {'start': 274, 'end': 291}}, {'...
122-
4 niv.204253 In this task, you are asked to translate an En... [{'__value__': {'start': 322, 'end': 341}}, {'...
147+
id question \
148+
0 t0.1073241 Answer the following question: Write a multi-c...
149+
1 flan.1059135 Choose the correct sentiment from candidates:\...
150+
2 flan.1794922 The "math" aspect to this is merely a gimmick ...
151+
3 t0.243847 Q:Read the following paragraph and extract the...
152+
4 t0.265856 Please answer the following question: Generate...
153+
154+
question__cluster.cluster_title question__cluster.cluster_id
155+
0 Answering Movie-Related Questions 320
156+
1 Movie Review Sentiments 286
157+
2 Extracting Answers from Vampire Movie Plots 325
158+
3 Extracting Answers from Movie Plots 313
159+
4 Movie Plot Questions 371
160+
```
161+
162+
After confirming the results of this query, let's delete these rows:
163+
164+
```py
165+
dataset.delete_rows(filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')])
166+
print(dataset.count(), 'rows remaining')
167+
```
168+
169+
Output:
170+
171+
```
172+
9174 rows remaining
123173
```
124174

125175
For more information on querying, see [](#Dataset.select_rows).
@@ -131,21 +181,15 @@ content. To do that we need to _index_ the `response` field using a text embeddi
131181
index once. For a fast on-device embedding, we recommend the
132182
[GTE-Small embedding](https://huggingface.co/thenlper/gte-small).
133183

134-
Before we can index with GTE-small, we need to install optional dependencies for the gte embedding:
135-
136-
```sh
137-
pip install lilac[gte]
138-
```
139-
140184
```py
141185
dataset.compute_embedding('gte-small', 'response')
142186
```
143187

144188
Output:
145189

146190
```sh
147-
Computing gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████████| 100000/100000 [17:59<00:00, 92.67it/s]
148-
Computing signal "gte-small" on local/open-orca-100k:('response',) took 1079.260s.
191+
Compute embedding GTESmall({"embed_input_type":"document","signal_name":"gte-small"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [04:47<00:00, 31.93it/s]
192+
149193
```
150194

151195
Now we can preview the top 5 responses based on their profanity concept score:
@@ -156,15 +200,25 @@ r = dataset.select_rows(['response'], searches=[search], limit=5)
156200
print(r.df())
157201
```
158202

159-
Output (the response text is removed due to sensitive content):
203+
Output:
160204

161205
```
162-
response ... lilac/profanity/gte-small(response)
163-
0 ***************** ... [{'__value__': {'start': 0, 'end': 17}, 'score...
164-
1 ***************** ... [{'__value__': {'start': 0, 'end': 6}, 'score'...
165-
2 ***************** ... [{'__value__': {'start': 0, 'end': 143}, 'scor...
166-
3 ***************** ... [{'__value__': {'start': 0, 'end': 79}, 'score...
167-
4 ***************** ... [{'__value__': {'start': 0, 'end': 376}, 'scor...
206+
Computing topk on local/open-orca-10k:('response',) with embedding "gte-small" and vector store "hnsw" took 0.062s.
207+
Computing signal "concept_labels" on local/open-orca-10k:('response',) took 0.012s.
208+
Computing signal "concept_score" on local/open-orca-10k:('response',) took 0.025s.
209+
response \
210+
0 Part #1: Understand the text from a social med...
211+
1 - Years active: Early 2000s to present\n- Birt...
212+
2 sex
213+
3 Sure! In a simple way for you to understand, t...
214+
4 The nursery rhyme "Ding, Dong, Bell," also kno...
215+
216+
response.lilac/profanity/gte-small/preview
217+
0 [{'__span__': {'start': 0, 'end': 113}, 'score...
218+
1 [{'__span__': {'start': 0, 'end': 103}, 'score...
219+
2 [{'__span__': {'start': 0, 'end': 3}, 'score':...
220+
3 [{'__span__': {'start': 0, 'end': 78}, 'score'...
221+
4 [{'__span__': {'start': 0, 'end': 164}, 'score...
168222
```
169223

170224
To compute the concept score over the entire dataset, we do:
@@ -176,8 +230,8 @@ dataset.compute_concept('lilac', 'profanity', embedding='gte-small', path='respo
176230
Output:
177231

178232
```sh
179-
Computing lilac/profanity/gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████▉| 100000/100000 [00:10<00:00, 9658.80it/s]
180-
Wrote signal output to ./data/datasets/local/open-orca-100k/response/lilac/profanity/gte-small/v34
233+
Compute signal ConceptSignal({"embedding":"gte-small","namespace":"lilac","concept_name":"profanity","version":36,"draft":"main","signal_name":"concept_score"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [00:01<00:00, 7322.02it/s]
234+
Wrote signal output to data/datasets/local/open-orca-10k/response/lilac/profanity/gte-small
181235
```
182236

183237
## Convert formats
@@ -195,14 +249,18 @@ df.info()
195249
Output:
196250

197251
```
198-
# Column Non-Null Count Dtype
199-
--- ------ -------------- -----
200-
0 id 100000 non-null object
201-
1 system_prompt 100000 non-null object
202-
2 question 100000 non-null object
203-
3 response 100000 non-null object
204-
4 __hfsplit__ 100000 non-null object
205-
5 response.pii 100000 non-null object
206-
6 response.lilac/profanity/gte-small/v34 100000 non-null object
207-
7 question.pii 100000 non-null object
252+
<class 'pandas.core.frame.DataFrame'>
253+
RangeIndex: 9174 entries, 0 to 9173
254+
Data columns (total 7 columns):
255+
# Column Non-Null Count Dtype
256+
--- ------ -------------- -----
257+
0 id 9174 non-null object
258+
1 system_prompt 9174 non-null object
259+
2 question 9174 non-null object
260+
3 response 9174 non-null object
261+
4 __hfsplit__ 9174 non-null object
262+
5 question__cluster 9174 non-null object
263+
6 __deleted__ 0 non-null object
264+
dtypes: object(7)
265+
memory usage: 501.8+ KB
208266
```

0 commit comments

Comments
 (0)