Skip to content

Commit

Permalink
Update pipeline card (#1399)
Browse files Browse the repository at this point in the history
  • Loading branch information
albertoandreottiATgmail authored Aug 21, 2024
1 parent 3abbede commit ed4d2d3
Showing 1 changed file with 27 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ use_language_switcher: "Python-Scala-Java"

## Description

This pipeline can be used to extract PHI information from PDF such as AGE, BIOID, CITY, COUNTRY, DATE, DEVICE, DOCTOR, EMAIL, FAX, HEALTHPLAN, HOSPITAL, IDNUM, LOCATION, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STATE, STREET, URL, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR entities.
This pipeline can be used to mask PHI information in PDFs. Masked entities include AGE, BIOID, CITY, COUNTRY, DATE, DEVICE, DOCTOR, EMAIL, FAX, HEALTHPLAN, HOSPITAL, IDNUM, LOCATION, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STATE, STREET, URL, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, and IPADDR.
The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
Expand All @@ -28,8 +29,6 @@ This pipeline can be used to extract PHI information from PDF such as AGE, BIOID

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
Expand Down Expand Up @@ -82,4 +81,28 @@ deid_pipeline = PretrainedPipeline("pdf_deid_subentity_context_augmented_pipelin
- PositionFinder
- ImageDrawRegions
- ImageToPdf
- PdfAssembler
- PdfAssembler


## Speed Benchmarks

- **Dataset:** 1000 scanned PDF pages.
- **Instance :**
- m5n.4xlarge (16 vCPUs, 64 GiB memory)
- m5n.8xlarge (32 vCPUs, 128 GiB memory)
- **AMI:** ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
- **Versions:**
- **spark-nlp Version:** v5.4.0
- **visual-nlp Version:** v5.3.2
- **spark-nlp-jsl Version :** v5.3.2
- **Spark Version :** v3.4.1
- **Visual NLP Pipeline:** 'pdf_deid_subentity_context_augmented_pipeline'


#### Benchmark Table

{:.table-model-big}
| Instance | memory | cores | input\_data\_pages| partition | second per page | timing |
| ------------- | ------ | ----- | ----------------- | ------------- | --------------- | ------- |
| m5n.4xlarge | 64 GB | 16 | 1000 | 10 | 0.24 | 4 mins |
| m5n.8xlarge | 128 GB | 32 | 1000 | 32 | 0.15 | 2.5 mins|

0 comments on commit ed4d2d3

Please sign in to comment.