Skip to content

cookbook: pii redaction using lemur #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions fern/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,10 @@ navigation:
path: pages/05-guides/cookbooks/streaming-stt/real-time.mdx
slug: real-time
hidden: true
- page: Redact PII from Text Using LeMUR
path: pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx
slug: lemur-pii-redaction
hidden: true
- section: SDK References
icon: duotone cubes
contents:
Expand Down
4 changes: 4 additions & 0 deletions fern/pages/03-audio-intelligence/pii-redaction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,10 @@ of things. The season has been pretty dry already, and then the fact that we're
getting hit in the US. Is because there's a couple of weather systems that ...
```

<Tip title="PII Redaction Using LeMUR">
If you would like the option to use LeMUR for custom PII redaction, check out this guide [Redact PII from Text Using LeMUR](/docs/lemur/lemur-pii-redaction).
</Tip>

## Create redacted audio files

In addition to redacting sensitive information from the transcription text, you can also generate a version of the original audio file with the PII "beeped" out.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ const client = new AssemblyAI({
Next create the transcript with your audio file, either via local audio file or URL (AssemblyAI's servers need to be able to access the URL, make sure the URL links to a downloadable file).

```javascript
const transcript = await client.transcripts.create({
const transcript = await client.transcripts.transcribe({
audio_url: "./sample.mp4",
});
```
Expand Down
175 changes: 175 additions & 0 deletions fern/pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Redact PII from Text Using LeMUR"
---

This guide will show you how to use AssemblyAI's LeMUR framework to redact personally identifiable information (PII) from text.

## Quickstart

```python
import assemblyai as aai
import json
import os

aai.settings.api_key = 'YOUR API KEY'

def generate_ner(transcript_text):
prompt = '''
You will be given a transcript of a conversation or text. Your task is to generate named entities from the given transcript text.

Please identify and extract the following named entities from the transcript:

1. Person names
2. Organization names
3. Email addresses
4. Phone numbers
5. Full addresses

When extracting these entities, make sure to return the exact spelling and formatting as they appear in the transcript. Do not modify or standardize the entities in any way.

Present your results in a JSON format with a single field named "named_entities". This field should contain an array of strings, where each string is a named entity you've identified. For example:
{
"named_entities": ["John Doe", "Acme Corp", "[email protected]", "123-456-7890", "123 Main St, Anytown, USA 12345"]
}

Important: Do not include any other information, explanations, or text in your response. Your output should consist solely of the JSON object containing the named entities.

If you do not find any named entities of a particular type, simply return a empty array for the "named_entities" field.
'''

response = aai.Lemur().task(
prompt=prompt,
input_text=transcript_text,
max_output_size=4000,
temperature=0.0,
final_model=aai.LemurModel.claude3_5_sonnet
).response

try:
res_json = json.loads(response)
except:
res_json = {'named_entities': []}

named_entities = res_json.get('named_entities', [])

return named_entities

transcriber = aai.Transcriber(config=aai.TranscriptionConfig(language_code='en'))
transcript = transcriber.transcribe('YOUR_AUDIO_URL')

redacted_transcript = ''

for sentence in transcript.get_sentences():
generated_entities = generate_ner(sentence.text)

redacted_sentence = sentence.text

for entity in generated_entities:
redacted_sentence = redacted_sentence.replace(entity, '#' * len(entity))

redacted_transcript += redacted_sentence + ' '
print(redacted_sentence)

print('Full redacted transcript:')
print(redacted_transcript)
```

## Get Started

Check warning on line 77 in fern/pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Get Started'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Get Started'. ", "location": {"path": "fern/pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx", "range": {"start": {"line": 77, "column": 4}}}, "severity": "WARNING"}

Before we begin, make sure you have an AssemblyAI account and an API key. You can [sign up](https://assemblyai.com/dashboard/signup) for an account and get your API key from your dashboard.

For information about LeMUR pricing, see our [pricing page](https://www.assemblyai.com/pricing).

## Step-by-Step Instructions

Check warning on line 83 in fern/pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [AssemblyAI.Headings] Use sentence-style capitalization for 'Step-by-Step Instructions'. Raw Output: {"message": "[AssemblyAI.Headings] Use sentence-style capitalization for 'Step-by-Step Instructions'. ", "location": {"path": "fern/pages/05-guides/cookbooks/lemur/lemur-pii-redaction.mdx", "range": {"start": {"line": 83, "column": 4}}}, "severity": "WARNING"}

Install the SDK.

```python
pip install assemblyai
```

Import the `assemblyai` package and set your API key.

```python
import assemblyai as aai
import json
import os

aai.settings.api_key = 'YOUR API KEY'
```

Define a function `generate_ner` that uses LeMUR to identify named entities (person names, organizations, emails, phone numbers, addresses) in a given text.

```python
def generate_ner(transcript_text):
prompt = '''
You will be given a transcript of a conversation or text. Your task is to generate named entities from the given transcript text.

Please identify and extract the following named entities from the transcript:

1. Person names
2. Organization names
3. Email addresses
4. Phone numbers
5. Full addresses

When extracting these entities, make sure to return the exact spelling and formatting as they appear in the transcript. Do not modify or standardize the entities in any way.

Present your results in a JSON format with a single field named "named_entities". This field should contain an array of strings, where each string is a named entity you've identified. For example:
{
"named_entities": ["John Doe", "Acme Corp", "[email protected]", "123-456-7890", "123 Main St, Anytown, USA 12345"]
}

Important: Do not include any other information, explanations, or text in your response. Your output should consist solely of the JSON object containing the named entities.

If you do not find any named entities of a particular type, simply return a empty array for the "named_entities" field.
'''

response = aai.Lemur().task(
prompt=prompt,
input_text=transcript_text,
max_output_size=4000,
temperature=0.0,
final_model=aai.LemurModel.claude3_5_sonnet
).response

try:
res_json = json.loads(response)
except:
res_json = {'named_entities': []}

named_entities = res_json.get('named_entities', [])

return named_entities
```

Transcribe an audio file using the AssemblyAI Transcriber.

```python
transcriber = aai.Transcriber(config=aai.TranscriptionConfig(language_code='en'))
transcript = transcriber.transcribe('YOUR_AUDIO_URL')
```

Iterate through each sentence in the transcript, identify named entities using `generate_ner`, and replace them with # characters.

```python
redacted_transcript = ''

for sentence in transcript.get_sentences():
generated_entities = generate_ner(sentence.text)

redacted_sentence = sentence.text

for entity in generated_entities:
redacted_sentence = redacted_sentence.replace(entity, '#' * len(entity))

redacted_transcript += redacted_sentence + ' '
print(redacted_sentence)
```

Print the redacted transcript.

```python
print('Full redacted transcript:')
print(redacted_transcript)
```
12 changes: 12 additions & 0 deletions fern/pages/05-guides/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1016,6 +1016,18 @@ For examples using the API without SDKs see [API guides](#api-guides).
/>
</a>
</li>
<li>
<a
href="guides/lemur-pii-redaction"
className="link-cta rounded-lg flex items-center gap-2"
>
Redact PII from Text Using LeMUR{" "}
<Icon
icon="duotone arrow-right"
color="rgba(var(--accent-aaa),var(--tw-text-opacity,1))"
/>
</a>
</li>
</ul>
</div>

Expand Down
Loading