1
1
# Python API
2
2
3
- Lilac's UI is built atop a Python library, which you can access through the ` lilac ` module. If you'd
4
- like to use Lilac's features alongside other popular Python libraries, or prefer a notebook
5
- workflow, read on.
3
+ Lilac's UI is built atop a Python library, which you can access through the ` lilac ` module. The UI
4
+ generally defers all computation to Python, so if the feature is in the UI, you'll be able to do the
5
+ same from Python.
6
+
7
+ The UI excels at interactive exploration and tagging/deletion, while the Python API provides
8
+ powerful primitives, like ` map ` , which allows you to run arbitrary Python computations with
9
+ developer-friendly features like progress tracking and resumability.
10
+
11
+ To get the best of both worlds, you can run ` ll.start_server() ` in your Python notebook or
12
+ interpreter to start the Lilac backend as a background thread, and then continue with using the
13
+ Lilac API. (Running the Lilac server in the same Python process/kernel is recommended because Lilac
14
+ can then share the same database connections and in-memory caches, lowering memory usage and
15
+ ensuring data consistency between UI and API.)
16
+
17
+ In this quickstart, we're going to:
18
+
19
+ - Load [ OpenOrca] ( https://huggingface.co/datasets/Open-Orca/OpenOrca ) , a popular instruction dataset
20
+ for tuning LLMs.
21
+ - Compute clusters.
22
+ - Delete specific clusters.
23
+ - Find profanity in the remaining rows (using powerful text embeddings)
24
+ - Download the enriched dataset as a json file so we can clean it in a Python notebook
6
25
7
26
## Import lilac
8
27
@@ -21,11 +40,11 @@ ll.set_project_dir('~/my_project')
21
40
22
41
Let's load [ OpenOrca] ( https://huggingface.co/datasets/Open-Orca/OpenOrca ) , a popular instruction
23
42
dataset used for tuning LLM models. While the Lilac tool can scale to millions of rows on a single
24
- machine, we are sampling to 100 ,000 so we can get started quickly.
43
+ machine, we are sampling to 10 ,000 so we can get started quickly.
25
44
26
45
``` python
27
- source = ll.HuggingFaceSource(dataset_name = ' Open-Orca/OpenOrca' , sample_size = 100_000 )
28
- config = ll.DatasetConfig(namespace = ' local' , name = ' open-orca-100k ' , source = source)
46
+ source = ll.HuggingFaceSource(dataset_name = ' Open-Orca/OpenOrca' , sample_size = 10_000 )
47
+ config = ll.DatasetConfig(namespace = ' local' , name = ' open-orca-10k ' , source = source)
29
48
dataset = ll.create_dataset(config)
30
49
```
31
50
@@ -36,7 +55,6 @@ Downloading data files: 100%|█████████████████
36
55
Extracting data files: 100%| ███████████████████████████████████████| 1/1 [00:00< 00:00, 318.98it/s]
37
56
Setting num_proc from 8 to 2 for the train split as it only contains 2 shards.
38
57
Generating train split: 4233923 examples [00:06, 654274.93 examples/s]
39
- Reading from source huggingface...: 100%| ██████████████| 100000/100000 [00:03< 00:00, 30124.10it/s]
40
58
Dataset " open-orca-100k" written to ./data/datasets/local/open-orca-100k
41
59
```
42
60
@@ -46,28 +64,47 @@ Alternately, you can load a preexisting dataset:
46
64
dataset = ll.get_dataset(' local' , ' open-orca-100k' )
47
65
```
48
66
49
- ## Run signals
67
+ ## Compute clusters
50
68
51
- Let's run the PII detection signal on both the ` question ` and the ` response ` field.
69
+ Let's compute clusters on the ` question ` field.
52
70
53
71
``` python
54
- dataset.compute_signal(ll.PIISignal(), ' question' )
55
- dataset.compute_signal(ll.PIISignal(), ' response' )
72
+ dataset.cluster(' question' )
56
73
```
57
74
58
75
Output:
59
76
60
77
``` sh
61
- Computing pii on local/open-orca-100k:question: 100%| █████████████████████████████████████| 100000/100000 [03:36< 00:00, 462.62it/s]
62
- Computing signal " pii" on local/open-orca-100k:question took 216.246s.
63
- Wrote signal output to ./data/datasets/local/open-orca-100k/question/pii
64
- Computing pii on local/open-orca-100k:response: 100%| █████████████████████████████████████| 100000/100000 [02:21< 00:00, 708.04it/s]
65
- Computing signal " pii" on local/open-orca-100k:response took 141.312s.
66
- Wrote signal output to ./data/datasets/local/open-orca-100k/response/pii
78
+ [local/open-orca-10k][1 shards] map " extract_text" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [00:00< 00:00, 59156.94it/s]
79
+ Wrote map output to question__cluster-00000-of-00001.parquet
80
+ [local/open-orca-10k][1 shards] map " cluster_documents" to " ('question__cluster',)" : 0%| | 0/10000 [00:00< ? , ? it/s]
81
+ jinaai/jina-embeddings-v2-small-en using device: mps:0
82
+ Computing embeddings: 100%| ██████████| 10000/10000 [18:30< 00:00, 9.01it/s]
83
+ Computing embeddings took 1113.504s.
84
+ UMAP: Reducing dim from 512 to 5 of 10000 vectors took 21.791s.
85
+ HDBSCAN: Clustering took 0.175s.
86
+ 4515 noise points (45.1%) will be assigned to nearest cluster.
87
+ [local/open-orca-10k][1 shards] map " cluster_documents" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [19:13< 00:00, 8.67it/s]
88
+ HDBSCAN: Computing membership for the noise points took 15.788s.
89
+ Wrote map output to question__cluster-00000-of-00001.parquet
90
+ [local/open-orca-10k][1 shards] map " title_clusters" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [00:26< 00:00, 374.38it/s]
91
+ Wrote map output to question__cluster-00000-of-00001.parquet
92
+ Computing embeddings: 10000it [01:19, 125.71it/s]es
93
+ Computing embeddings took 79.760s.
94
+ UMAP: Reducing dim from 512 to 5 of 10000 vectors took 53.578s.
95
+ HDBSCAN: Clustering took 0.136s.
96
+ 137 noise points (1.4%) will be assigned to nearest cluster.
97
+ [local/open-orca-10k][1 shards] map " cluster_titles" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [02:14< 00:00, 74.37it/s]
98
+ HDBSCAN: Computing membership for the noise points took 0.426s.
99
+ Wrote map output to question__cluster-00000-of-00001.parquet
100
+ [local/open-orca-10k][1 shards] map " title_categories" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [00:25< 00:00, 395.07it/s]
101
+ Wrote map output to question__cluster-00000-of-00001.parquet
102
+ [local/open-orca-10k][1 shards] map " drop_temp_text_column" to " ('question__cluster',)" : 100%| ██████████| 10000/10000 [00:00< 00:00, 71313.87it/s]
103
+ Wrote map output to question__cluster-00000-of-00001.parquet
67
104
```
68
105
69
- The dataset now has the extra fields ` question.pii ` and ` response.pii ` , which we can see by printing
70
- the entire schema:
106
+ The dataset now has the extra fields ` question__cluster ` , which we can see by printing the entire
107
+ schema:
71
108
72
109
``` py
73
110
print (dataset.manifest().data_schema)
@@ -78,48 +115,61 @@ Output:
78
115
``` sh
79
116
id: string
80
117
system_prompt: string
81
- question:
82
- pii:
83
- emails: list( string_span)
84
- ip_addresses: list( string_span)
85
- secrets: list( string_span)
86
- response:
87
- pii:
88
- emails: list( string_span)
89
- ip_addresses: list( string_span)
90
- secrets: list( string_span)
91
- gte-small: list(
92
- embedding: embedding)
118
+ question: string
119
+ response: string
93
120
__hfsplit__: string
94
- __rowid__: string
121
+ question__cluster:
122
+ cluster_id: int32
123
+ cluster_membership_prob: float32
124
+ cluster_title: string
125
+ category_id: int32
126
+ category_membership_prob: float32
127
+ category_title: string
95
128
```
96
129
97
- Note that ` question.pii.emails ` is a list of ` string_span ` values. These are objects with ` start `
98
- and ` end ` indices that point to the location of the email in the original ` question ` text.
99
-
100
130
## Select specific rows
101
131
102
- Let's query 5 rows that have emails in the ` response ` field via [ ] ( #Dataset.select_rows ) , a python
103
- API that is analogous to a ` SQL Select ` statement. We do this by adding an [ ` exists ` ] ( #Filter.op )
104
- filter on ` response.pii.emails ` to make sure it's not empty:
132
+ Let's find all clusters that talk about movies via [ ] ( #Dataset.select_rows ) , which works very
133
+ similarly to a ` SQL Select ` statement. We do this by adding an [ ` exists ` ] ( #Filter.op ) filter on
134
+ ` question__cluster.cluster_title ` .
105
135
106
136
``` py
107
137
df_with_emails = dataset.select_rows(
108
- [' id' , ' response ' , ' response.pii.emails ' ],
138
+ [' id' , ' question ' , ' question__cluster.cluster_title ' , ' question__cluster.cluster_id ' ],
109
139
limit = 5 ,
110
- filters = [(' response.pii.emails ' , ' exists ' )]).df()
140
+ filters = [(' question__cluster.cluster_title ' , ' regex_matches ' , ' [Mm]ovie ' )]).df()
111
141
print (df_with_emails)
112
142
```
113
143
114
144
Output:
115
145
116
146
```
117
- id response response.pii.emails
118
- 0 flan.2166478 Subject: Bruce Colbourne, to D.Colbourne@ nrc-... [{'__value__': {'start': 157, 'end': 183}}, {'...
119
- 1 flan.2168520 Once you have logged into your email account a... [{'__value__': {'start': 482, 'end': 501}}]
120
- 2 flan.294964 Université McGill, 3550 Rue University, Montré... [{'__value__': {'start': 174, 'end': 196}}]
121
- 3 flan.1805392 Step 1: Identify the words in the text.\n\nTo ... [{'__value__': {'start': 274, 'end': 291}}, {'...
122
- 4 niv.204253 In this task, you are asked to translate an En... [{'__value__': {'start': 322, 'end': 341}}, {'...
147
+ id question \
148
+ 0 t0.1073241 Answer the following question: Write a multi-c...
149
+ 1 flan.1059135 Choose the correct sentiment from candidates:\...
150
+ 2 flan.1794922 The "math" aspect to this is merely a gimmick ...
151
+ 3 t0.243847 Q:Read the following paragraph and extract the...
152
+ 4 t0.265856 Please answer the following question: Generate...
153
+
154
+ question__cluster.cluster_title question__cluster.cluster_id
155
+ 0 Answering Movie-Related Questions 320
156
+ 1 Movie Review Sentiments 286
157
+ 2 Extracting Answers from Vampire Movie Plots 325
158
+ 3 Extracting Answers from Movie Plots 313
159
+ 4 Movie Plot Questions 371
160
+ ```
161
+
162
+ After confirming the results of this query, let's delete these rows:
163
+
164
+ ``` py
165
+ dataset.delete_rows(filters = [(' question__cluster.cluster_title' , ' regex_matches' , ' [Mm]ovie' )])
166
+ print (dataset.count(), ' rows remaining' )
167
+ ```
168
+
169
+ Output:
170
+
171
+ ```
172
+ 9174 rows remaining
123
173
```
124
174
125
175
For more information on querying, see [ ] ( #Dataset.select_rows ) .
@@ -131,21 +181,15 @@ content. To do that we need to _index_ the `response` field using a text embeddi
131
181
index once. For a fast on-device embedding, we recommend the
132
182
[ GTE-Small embedding] ( https://huggingface.co/thenlper/gte-small ) .
133
183
134
- Before we can index with GTE-small, we need to install optional dependencies for the gte embedding:
135
-
136
- ``` sh
137
- pip install lilac[gte]
138
- ```
139
-
140
184
``` py
141
185
dataset.compute_embedding(' gte-small' , ' response' )
142
186
```
143
187
144
188
Output:
145
189
146
190
``` sh
147
- Computing gte-small on local/ open-orca-100k:( ' response' ,) : 100%| █████████████████████████████████████ | 100000/100000 [17: 59 < 00:00, 92.67it /s]
148
- Computing signal " gte-small " on local/open-orca-100k:( ' response ' ,) took 1079.260s.
191
+ Compute embedding GTESmall({ " embed_input_type " : " document " , " signal_name " : " gte-small" }) on open-orca-10k: response: 100%| ██████████| 9174/9174 [04: 47 < 00:00, 31.93it /s]
192
+
149
193
```
150
194
151
195
Now we can preview the top 5 responses based on their profanity concept score:
@@ -156,15 +200,25 @@ r = dataset.select_rows(['response'], searches=[search], limit=5)
156
200
print (r.df())
157
201
```
158
202
159
- Output (the response text is removed due to sensitive content) :
203
+ Output:
160
204
161
205
```
162
- response ... lilac/profanity/gte-small(response)
163
- 0 ***************** ... [{'__value__': {'start': 0, 'end': 17}, 'score...
164
- 1 ***************** ... [{'__value__': {'start': 0, 'end': 6}, 'score'...
165
- 2 ***************** ... [{'__value__': {'start': 0, 'end': 143}, 'scor...
166
- 3 ***************** ... [{'__value__': {'start': 0, 'end': 79}, 'score...
167
- 4 ***************** ... [{'__value__': {'start': 0, 'end': 376}, 'scor...
206
+ Computing topk on local/open-orca-10k:('response',) with embedding "gte-small" and vector store "hnsw" took 0.062s.
207
+ Computing signal "concept_labels" on local/open-orca-10k:('response',) took 0.012s.
208
+ Computing signal "concept_score" on local/open-orca-10k:('response',) took 0.025s.
209
+ response \
210
+ 0 Part #1: Understand the text from a social med...
211
+ 1 - Years active: Early 2000s to present\n- Birt...
212
+ 2 sex
213
+ 3 Sure! In a simple way for you to understand, t...
214
+ 4 The nursery rhyme "Ding, Dong, Bell," also kno...
215
+
216
+ response.lilac/profanity/gte-small/preview
217
+ 0 [{'__span__': {'start': 0, 'end': 113}, 'score...
218
+ 1 [{'__span__': {'start': 0, 'end': 103}, 'score...
219
+ 2 [{'__span__': {'start': 0, 'end': 3}, 'score':...
220
+ 3 [{'__span__': {'start': 0, 'end': 78}, 'score'...
221
+ 4 [{'__span__': {'start': 0, 'end': 164}, 'score...
168
222
```
169
223
170
224
To compute the concept score over the entire dataset, we do:
@@ -176,8 +230,8 @@ dataset.compute_concept('lilac', 'profanity', embedding='gte-small', path='respo
176
230
Output:
177
231
178
232
``` sh
179
- Computing lilac/profanity/ gte-small on local/ open-orca-100k:( ' response' ,) : 100%| █████████████████████████████████▉ | 100000/100000 [00:10 < 00:00, 9658.80it /s]
180
- Wrote signal output to ./ data/datasets/local/open-orca-100k /response/lilac/profanity/gte-small/v34
233
+ Compute signal ConceptSignal({ " embedding " : " gte-small" , " namespace " : " lilac " , " concept_name " : " profanity " , " version " :36, " draft " : " main " , " signal_name " : " concept_score " }) on open-orca-10k: response: 100%| ██████████| 9174/9174 [00:01 < 00:00, 7322.02it /s]
234
+ Wrote signal output to data/datasets/local/open-orca-10k /response/lilac/profanity/gte-small
181
235
```
182
236
183
237
## Convert formats
@@ -195,14 +249,18 @@ df.info()
195
249
Output:
196
250
197
251
```
198
- # Column Non-Null Count Dtype
199
- --- ------ -------------- -----
200
- 0 id 100000 non-null object
201
- 1 system_prompt 100000 non-null object
202
- 2 question 100000 non-null object
203
- 3 response 100000 non-null object
204
- 4 __hfsplit__ 100000 non-null object
205
- 5 response.pii 100000 non-null object
206
- 6 response.lilac/profanity/gte-small/v34 100000 non-null object
207
- 7 question.pii 100000 non-null object
252
+ <class 'pandas.core.frame.DataFrame'>
253
+ RangeIndex: 9174 entries, 0 to 9173
254
+ Data columns (total 7 columns):
255
+ # Column Non-Null Count Dtype
256
+ --- ------ -------------- -----
257
+ 0 id 9174 non-null object
258
+ 1 system_prompt 9174 non-null object
259
+ 2 question 9174 non-null object
260
+ 3 response 9174 non-null object
261
+ 4 __hfsplit__ 9174 non-null object
262
+ 5 question__cluster 9174 non-null object
263
+ 6 __deleted__ 0 non-null object
264
+ dtypes: object(7)
265
+ memory usage: 501.8+ KB
208
266
```
0 commit comments