Skip to content

Commit

Permalink
save
Browse files Browse the repository at this point in the history
  • Loading branch information
brilee committed Jan 30, 2024
1 parent e95c628 commit 80892fb
Show file tree
Hide file tree
Showing 5 changed files with 201 additions and 105 deletions.
57 changes: 35 additions & 22 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ In this quick start we're going to:

- Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
for tuning LLMs.
- Find PII (emails, etc)
- Find profanity in the responses (using powerful text embeddings)
- Compute clusters.
- Delete specific clusters.
- Find profanity in the remaining rows (using powerful text embeddings)
- Download the enriched dataset as a json file so we can clean it in a Python notebook

## Start the web server
Expand Down Expand Up @@ -35,7 +36,13 @@ Click the `Add dataset` button on the Getting Started page and fill in:
Fill in HuggingFace-specific fields:

3. HuggingFace dataset name: `Open-Orca/OpenOrca`
4. Sample size: 10000 (it takes ~5mins to compute on-device embeddings for 10,000 items)
4. Sample size: 10000

```{note}
Lilac's sweet spot is 10,000-100,000 rows of data, although up to 10 million rows are possible.
This quickstart uses 10,000 rows so that clustering and embedding operations finish locally
in ~10 minutes even without a GPU.
```

Finally:

Expand All @@ -58,30 +65,35 @@ your media field contains markdown, you can enable markdown rendering.

<video loop muted autoplay controls src="../_static/getting_started/orca-settings.mp4"></video>

## Enrich
## Cluster

Lilac can enrich your media fields with additional metadata by:
Lilac can detect clusters in your dataset. Clusters are a powerful way to understand the types of
content present in your dataset, as well as to target subsets for removal from the dataset.

- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
statistics, etc.)
- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
that you create)
To cluster, open up the dataset schema tray to reveal the fields in your dataset. Here, you can
choose which field will get clustered.

### PII detection
** Add clustering video here **

Let's run the PII detection signal on both the `question` and the `response` field and see if there
is any PII like emails, secret tokens, etc.
The cluster visualizer shows two hierarchical levels of clusters by default. You can also group over
other fields in your dataset by changing the Explore and Group By selections.

<video loop muted autoplay controls src="../_static/getting_started/orca-pii-enrichment.mp4"></video>
## Tagging and Deleting rows

Once it's done, we can see that both the `question` and the `response` fields have emails present.
We can click on an email to apply a filter and see all the rows that contain that email.
Lilac can curate your dataset by tagging or deleting rows.

<video loop muted autoplay controls src="../_static/getting_started/orca-pii-filter.mp4"></video>
Deleting is not permanent - you can toggle visibility of deleted items - but it is a convenient way
to iterate on your dataset by removing undesired slices of data. Later on, when you export data from
Lilac, deleted rows will be excluded by default.

We notice that the selected email in the `response` field was not hallucinated by the LLM because it
was also present in the `question` field. Later we can use the enriched metadata of both fields to
filter out only responses that have hallucinated emails.
## Enrich

Lilac can enrich your media fields with additional metadata by:

- Running a [signal](../signals/signals.md) (e.g. PII detection, language detection, text
statistics, etc.)
- Running a [concept](../concepts/concepts.md) (e.g. profanity, sentiment, etc. or a custom concept
that you create)

### Profanity detection

Expand Down Expand Up @@ -112,9 +124,10 @@ can open the statistics panel to see the distribution of concept scores.

## Download

Now that we've enriched the dataset, let's download it by clicking on the `Download data` button in
the top-right corner. This will download a json file with the same name as the dataset. Once we have
the data, we can continue working with it in a Python notebook, or any other language.
Now that we've clustered, curated, and enriched the dataset, let's download it by clicking on the
`Download data` button in the top-right corner. This will download a json file with the same name as
the dataset. Once we have the data, we can continue working with it in a Python notebook, or any
other language.

You can also get the dataset as a Pandas dataframe through the [Python API](quickstart_python.md).

Expand Down
204 changes: 131 additions & 73 deletions docs/getting_started/quickstart_python.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,27 @@
# Python API

Lilac's UI is built atop a Python library, which you can access through the `lilac` module. If you'd
like to use Lilac's features alongside other popular Python libraries, or prefer a notebook
workflow, read on.
Lilac's UI is built atop a Python library, which you can access through the `lilac` module. The UI
generally defers all computation to Python, so if the feature is in the UI, you'll be able to do the
same from Python.

The UI excels at interactive exploration and tagging/deletion, while the Python API provides
powerful primitives, like `map`, which allows you to run arbitrary Python computations with
developer-friendly features like progress tracking and resumability.

To get the best of both worlds, you can run `ll.start_server()` in your Python notebook or
interpreter to start the Lilac backend as a background thread, and then continue with using the
Lilac API. (Running the Lilac server in the same Python process/kernel is recommended because Lilac
can then share the same database connections and in-memory caches, lowering memory usage and
ensuring data consistency between UI and API.)

In this quickstart, we're going to:

- Load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction dataset
for tuning LLMs.
- Compute clusters.
- Delete specific clusters.
- Find profanity in the remaining rows (using powerful text embeddings)
- Download the enriched dataset as a json file so we can clean it in a Python notebook

## Import lilac

Expand All @@ -21,11 +40,11 @@ ll.set_project_dir('~/my_project')

Let's load [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), a popular instruction
dataset used for tuning LLM models. While the Lilac tool can scale to millions of rows on a single
machine, we are sampling to 100,000 so we can get started quickly.
machine, we are sampling to 10,000 so we can get started quickly.

```python
source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=100_000)
config = ll.DatasetConfig(namespace='local', name='open-orca-100k', source=source)
source = ll.HuggingFaceSource(dataset_name='Open-Orca/OpenOrca', sample_size=10_000)
config = ll.DatasetConfig(namespace='local', name='open-orca-10k', source=source)
dataset = ll.create_dataset(config)
```

Expand All @@ -36,7 +55,6 @@ Downloading data files: 100%|█████████████████
Extracting data files: 100%|███████████████████████████████████████| 1/1 [00:00<00:00, 318.98it/s]
Setting num_proc from 8 to 2 for the train split as it only contains 2 shards.
Generating train split: 4233923 examples [00:06, 654274.93 examples/s]
Reading from source huggingface...: 100%|██████████████| 100000/100000 [00:03<00:00, 30124.10it/s]
Dataset "open-orca-100k" written to ./data/datasets/local/open-orca-100k
```

Expand All @@ -46,28 +64,47 @@ Alternately, you can load a preexisting dataset:
dataset = ll.get_dataset('local', 'open-orca-100k')
```

## Run signals
## Compute clusters

Let's run the PII detection signal on both the `question` and the `response` field.
Let's compute clusters on the `question`field.

```python
dataset.compute_signal(ll.PIISignal(), 'question')
dataset.compute_signal(ll.PIISignal(), 'response')
dataset.cluster('question')
```

Output:

```sh
Computing pii on local/open-orca-100k:question: 100%|█████████████████████████████████████| 100000/100000 [03:36<00:00, 462.62it/s]
Computing signal "pii" on local/open-orca-100k:question took 216.246s.
Wrote signal output to ./data/datasets/local/open-orca-100k/question/pii
Computing pii on local/open-orca-100k:response: 100%|█████████████████████████████████████| 100000/100000 [02:21<00:00, 708.04it/s]
Computing signal "pii" on local/open-orca-100k:response took 141.312s.
Wrote signal output to ./data/datasets/local/open-orca-100k/response/pii
[local/open-orca-10k][1 shards] map "extract_text" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 59156.94it/s]
Wrote map output to question__cluster-00000-of-00001.parquet
[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)": 0%| | 0/10000 [00:00<?, ?it/s]
jinaai/jina-embeddings-v2-small-en using device: mps:0
Computing embeddings: 100%|██████████| 10000/10000 [18:30<00:00, 9.01it/s]
Computing embeddings took 1113.504s.
UMAP: Reducing dim from 512 to 5 of 10000 vectors took 21.791s.
HDBSCAN: Clustering took 0.175s.
4515 noise points (45.1%) will be assigned to nearest cluster.
[local/open-orca-10k][1 shards] map "cluster_documents" to "('question__cluster',)": 100%|██████████| 10000/10000 [19:13<00:00, 8.67it/s]
HDBSCAN: Computing membership for the noise points took 15.788s.
Wrote map output to question__cluster-00000-of-00001.parquet
[local/open-orca-10k][1 shards] map "title_clusters" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:26<00:00, 374.38it/s]
Wrote map output to question__cluster-00000-of-00001.parquet
Computing embeddings: 10000it [01:19, 125.71it/s]es
Computing embeddings took 79.760s.
UMAP: Reducing dim from 512 to 5 of 10000 vectors took 53.578s.
HDBSCAN: Clustering took 0.136s.
137 noise points (1.4%) will be assigned to nearest cluster.
[local/open-orca-10k][1 shards] map "cluster_titles" to "('question__cluster',)": 100%|██████████| 10000/10000 [02:14<00:00, 74.37it/s]
HDBSCAN: Computing membership for the noise points took 0.426s.
Wrote map output to question__cluster-00000-of-00001.parquet
[local/open-orca-10k][1 shards] map "title_categories" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:25<00:00, 395.07it/s]
Wrote map output to question__cluster-00000-of-00001.parquet
[local/open-orca-10k][1 shards] map "drop_temp_text_column" to "('question__cluster',)": 100%|██████████| 10000/10000 [00:00<00:00, 71313.87it/s]
Wrote map output to question__cluster-00000-of-00001.parquet
```

The dataset now has the extra fields `question.pii` and `response.pii`, which we can see by printing
the entire schema:
The dataset now has the extra fields `question__cluster`, which we can see by printing the entire
schema:

```py
print(dataset.manifest().data_schema)
Expand All @@ -78,48 +115,61 @@ Output:
```sh
id: string
system_prompt: string
question:
pii:
emails: list( string_span)
ip_addresses: list( string_span)
secrets: list( string_span)
response:
pii:
emails: list( string_span)
ip_addresses: list( string_span)
secrets: list( string_span)
gte-small: list(
embedding: embedding)
question: string
response: string
__hfsplit__: string
__rowid__: string
question__cluster:
cluster_id: int32
cluster_membership_prob: float32
cluster_title: string
category_id: int32
category_membership_prob: float32
category_title: string
```

Note that `question.pii.emails` is a list of `string_span` values. These are objects with `start`
and `end` indices that point to the location of the email in the original `question` text.

## Select specific rows

Let's query 5 rows that have emails in the `response` field via [](#Dataset.select_rows), a python
API that is analogous to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op)
filter on `response.pii.emails` to make sure it's not empty:
Let's find all clusters that talk about movies via [](#Dataset.select_rows), which works very
similarly to a `SQL Select` statement. We do this by adding an [`exists`](#Filter.op) filter on
`question__cluster.cluster_title`.

```py
df_with_emails = dataset.select_rows(
['id', 'response', 'response.pii.emails'],
['id', 'question', 'question__cluster.cluster_title', 'question__cluster.cluster_id'],
limit=5,
filters=[('response.pii.emails', 'exists')]).df()
filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')]).df()
print(df_with_emails)
```

Output:

```
id response response.pii.emails
0 flan.2166478 Subject: Bruce Colbourne, to D.Colbourne@ nrc-... [{'__value__': {'start': 157, 'end': 183}}, {'...
1 flan.2168520 Once you have logged into your email account a... [{'__value__': {'start': 482, 'end': 501}}]
2 flan.294964 Université McGill, 3550 Rue University, Montré... [{'__value__': {'start': 174, 'end': 196}}]
3 flan.1805392 Step 1: Identify the words in the text.\n\nTo ... [{'__value__': {'start': 274, 'end': 291}}, {'...
4 niv.204253 In this task, you are asked to translate an En... [{'__value__': {'start': 322, 'end': 341}}, {'...
id question \
0 t0.1073241 Answer the following question: Write a multi-c...
1 flan.1059135 Choose the correct sentiment from candidates:\...
2 flan.1794922 The "math" aspect to this is merely a gimmick ...
3 t0.243847 Q:Read the following paragraph and extract the...
4 t0.265856 Please answer the following question: Generate...
question__cluster.cluster_title question__cluster.cluster_id
0 Answering Movie-Related Questions 320
1 Movie Review Sentiments 286
2 Extracting Answers from Vampire Movie Plots 325
3 Extracting Answers from Movie Plots 313
4 Movie Plot Questions 371
```

After confirming the results of this query, let's delete these rows:

```py
dataset.delete_rows(filters=[('question__cluster.cluster_title', 'regex_matches', '[Mm]ovie')])
print(dataset.count(), 'rows remaining')
```

Output:

```
9174 rows remaining
```

For more information on querying, see [](#Dataset.select_rows).
Expand All @@ -131,21 +181,15 @@ content. To do that we need to _index_ the `response` field using a text embeddi
index once. For a fast on-device embedding, we recommend the
[GTE-Small embedding](https://huggingface.co/thenlper/gte-small).

Before we can index with GTE-small, we need to install optional dependencies for the gte embedding:

```sh
pip install lilac[gte]
```

```py
dataset.compute_embedding('gte-small', 'response')
```

Output:

```sh
Computing gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████████| 100000/100000 [17:59<00:00, 92.67it/s]
Computing signal "gte-small" on local/open-orca-100k:('response',) took 1079.260s.
Compute embedding GTESmall({"embed_input_type":"document","signal_name":"gte-small"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [04:47<00:00, 31.93it/s]

```

Now we can preview the top 5 responses based on their profanity concept score:
Expand All @@ -156,15 +200,25 @@ r = dataset.select_rows(['response'], searches=[search], limit=5)
print(r.df())
```

Output (the response text is removed due to sensitive content):
Output:

```
response ... lilac/profanity/gte-small(response)
0 ***************** ... [{'__value__': {'start': 0, 'end': 17}, 'score...
1 ***************** ... [{'__value__': {'start': 0, 'end': 6}, 'score'...
2 ***************** ... [{'__value__': {'start': 0, 'end': 143}, 'scor...
3 ***************** ... [{'__value__': {'start': 0, 'end': 79}, 'score...
4 ***************** ... [{'__value__': {'start': 0, 'end': 376}, 'scor...
Computing topk on local/open-orca-10k:('response',) with embedding "gte-small" and vector store "hnsw" took 0.062s.
Computing signal "concept_labels" on local/open-orca-10k:('response',) took 0.012s.
Computing signal "concept_score" on local/open-orca-10k:('response',) took 0.025s.
response \
0 Part #1: Understand the text from a social med...
1 - Years active: Early 2000s to present\n- Birt...
2 sex
3 Sure! In a simple way for you to understand, t...
4 The nursery rhyme "Ding, Dong, Bell," also kno...
response.lilac/profanity/gte-small/preview
0 [{'__span__': {'start': 0, 'end': 113}, 'score...
1 [{'__span__': {'start': 0, 'end': 103}, 'score...
2 [{'__span__': {'start': 0, 'end': 3}, 'score':...
3 [{'__span__': {'start': 0, 'end': 78}, 'score'...
4 [{'__span__': {'start': 0, 'end': 164}, 'score...
```

To compute the concept score over the entire dataset, we do:
Expand All @@ -176,8 +230,8 @@ dataset.compute_concept('lilac', 'profanity', embedding='gte-small', path='respo
Output:

```sh
Computing lilac/profanity/gte-small on local/open-orca-100k:('response',): 100%|█████████████████████████████████▉| 100000/100000 [00:10<00:00, 9658.80it/s]
Wrote signal output to ./data/datasets/local/open-orca-100k/response/lilac/profanity/gte-small/v34
Compute signal ConceptSignal({"embedding":"gte-small","namespace":"lilac","concept_name":"profanity","version":36,"draft":"main","signal_name":"concept_score"}) on open-orca-10k:response: 100%|██████████| 9174/9174 [00:01<00:00, 7322.02it/s]
Wrote signal output to data/datasets/local/open-orca-10k/response/lilac/profanity/gte-small
```

## Convert formats
Expand All @@ -195,14 +249,18 @@ df.info()
Output:

```
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100000 non-null object
1 system_prompt 100000 non-null object
2 question 100000 non-null object
3 response 100000 non-null object
4 __hfsplit__ 100000 non-null object
5 response.pii 100000 non-null object
6 response.lilac/profanity/gte-small/v34 100000 non-null object
7 question.pii 100000 non-null object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9174 entries, 0 to 9173
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 9174 non-null object
1 system_prompt 9174 non-null object
2 question 9174 non-null object
3 response 9174 non-null object
4 __hfsplit__ 9174 non-null object
5 question__cluster 9174 non-null object
6 __deleted__ 0 non-null object
dtypes: object(7)
memory usage: 501.8+ KB
```
Loading

0 comments on commit 80892fb

Please sign in to comment.