diff --git a/docs/assets/images/Chart_to_Text_1.svg b/docs/assets/images/Chart_to_Text_1.svg new file mode 100644 index 0000000000..8f03162479 --- /dev/null +++ b/docs/assets/images/Chart_to_Text_1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Checkbox_Detection.svg b/docs/assets/images/Checkbox_Detection.svg new file mode 100644 index 0000000000..16c0de6faf --- /dev/null +++ b/docs/assets/images/Checkbox_Detection.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/De-identify_PDF_documents_GDPR_Compliance.svg b/docs/assets/images/De-identify_PDF_documents_GDPR_Compliance.svg new file mode 100644 index 0000000000..7369f6a39c --- /dev/null +++ b/docs/assets/images/De-identify_PDF_documents_GDPR_Compliance.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/De-identify_PDF_documents_HIPAA_Compliance.svg b/docs/assets/images/De-identify_PDF_documents_HIPAA_Compliance.svg new file mode 100644 index 0000000000..16ea228e51 --- /dev/null +++ b/docs/assets/images/De-identify_PDF_documents_HIPAA_Compliance.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Deidentify_DICOM_documents_1.svg b/docs/assets/images/Deidentify_DICOM_documents_1.svg new file mode 100644 index 0000000000..15c92c4451 --- /dev/null +++ b/docs/assets/images/Deidentify_DICOM_documents_1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Deidentify_Images.svg b/docs/assets/images/Deidentify_Images.svg new file mode 100644 index 0000000000..f24e33cee5 --- /dev/null +++ b/docs/assets/images/Deidentify_Images.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Document_Layout_Analysis.svg b/docs/assets/images/Document_Layout_Analysis.svg new file mode 100644 index 0000000000..7fb49259d9 --- /dev/null +++ b/docs/assets/images/Document_Layout_Analysis.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/HOCR_Table_Structure_Recognition_in_Document_Images.svg b/docs/assets/images/HOCR_Table_Structure_Recognition_in_Document_Images.svg new file mode 100644 index 0000000000..efaa0ae626 --- /dev/null +++ b/docs/assets/images/HOCR_Table_Structure_Recognition_in_Document_Images.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Image_Classifier_in_Document_Images.svg b/docs/assets/images/Image_Classifier_in_Document_Images.svg new file mode 100644 index 0000000000..345d3cb2eb --- /dev/null +++ b/docs/assets/images/Image_Classifier_in_Document_Images.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Image_Cleaner_to_Improve_Quality_of_Document_Images.svg b/docs/assets/images/Image_Cleaner_to_Improve_Quality_of_Document_Images.svg new file mode 100644 index 0000000000..caecb8e590 --- /dev/null +++ b/docs/assets/images/Image_Cleaner_to_Improve_Quality_of_Document_Images.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Image_Processing_to_Improve_Quality_of_Document_Images.svg b/docs/assets/images/Image_Processing_to_Improve_Quality_of_Document_Images.svg new file mode 100644 index 0000000000..5aa98fa2d3 --- /dev/null +++ b/docs/assets/images/Image_Processing_to_Improve_Quality_of_Document_Images.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_for_reading_and_skewing_correction.svg b/docs/assets/images/Pretrained_pipeline_for_reading_and_skewing_correction.svg new file mode 100644 index 0000000000..83409558d4 --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_for_reading_and_skewing_correction.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_for_reading_on_handwritten_documents.svg b/docs/assets/images/Pretrained_pipeline_for_reading_on_handwritten_documents.svg new file mode 100644 index 0000000000..d49bf1f0b6 --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_for_reading_on_handwritten_documents.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg b/docs/assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg new file mode 100644 index 0000000000..e5d73c6c9f --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg b/docs/assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg new file mode 100644 index 0000000000..66bb538f87 --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_for_readingand_removing.svg b/docs/assets/images/Pretrained_pipeline_for_readingand_removing.svg new file mode 100644 index 0000000000..843014a2e4 --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_for_readingand_removing.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/Pretrained_pipeline_forreading_on_printed_documents.svg b/docs/assets/images/Pretrained_pipeline_forreading_on_printed_documents.svg new file mode 100644 index 0000000000..50296533a0 --- /dev/null +++ b/docs/assets/images/Pretrained_pipeline_forreading_on_printed_documents.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/readme/Detect_Text_in_Document_Images_1.svg b/docs/assets/images/readme/Detect_Text_in_Document_Images_1.svg new file mode 100644 index 0000000000..d4d3714153 --- /dev/null +++ b/docs/assets/images/readme/Detect_Text_in_Document_Images_1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/demos/enhance_low_quality_images.md b/docs/demos/enhance_low_quality_images.md index c3cfb26270..36a059e45c 100644 --- a/docs/demos/enhance_low_quality_images.md +++ b/docs/demos/enhance_low_quality_images.md @@ -81,5 +81,52 @@ data: - text: Colab type: blue_btn url: https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRGPUOperations.ipynb - + - title: Image Cleaner to Improve Quality of Document Images + id: image_cleaner_improve_quality_document_images + image: + src: /assets/images/Image_Cleaner_to_Improve_Quality_of_Document_Images.svg + excerpt: This model improves the quality of document images using our pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/IMAGE_CLEANER/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/Cards/SparkOcrImageCleaner.ipynb + - title: Image Processing to Improve Quality of Document Images + id: image_processing_improve_quality_document_images + image: + src: /assets/images/Image_Processing_to_Improve_Quality_of_Document_Images.svg + excerpt: This model improves the quality of documents using different image processing algorithms from our pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/IMAGE_PROCESSING/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImagePreprocessing.ipynb + - title: Pretrained pipeline for reading and removing noise on mixed scanned and digital PDF documents + id: Pretrained_pipeline_noise_mixed_scanned_digital_pdf_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on printed texts. It ensures precise and efficient text extraction from printed images of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_IMAGE_PRINTED_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesMixedScannedDigitalPdfImageCleaner.ipynb + - title: Pretrained pipeline for reading and skewing correction on mixed scanned and digital documents + id: Pretrained_pipeline_for_reading_skewing_correction_mixed_scanned_digital_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, for conducting Optical Character Recognition (OCR) on mixed scanned and digital PDF documents with page rotation correction. It ensures precise and efficient text extraction from PDFs of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_MIXED_SCANNED_DIGITAL_PDF_SKEW_CORRECTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesMixedScannedDigitalPdfSkewCorrection.ipynb --- diff --git a/docs/demos/extract_handwritten_texts.md b/docs/demos/extract_handwritten_texts.md index 4e1bdc8639..161d3543c6 100644 --- a/docs/demos/extract_handwritten_texts.md +++ b/docs/demos/extract_handwritten_texts.md @@ -68,4 +68,28 @@ data: - text: Colab type: blue_btn url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/3.6.0/jupyter/SparkOcrImageHandwrittenDetection.ipynb + - title: Pretrained pipeline for reading on handwritten PDF documents + id: pretrained_pipeline_reading_handwritten_pdf_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_handwritten_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on handwritten texts. It ensures precise and efficient text extraction from handwritten pdfs of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_PDF_HANDWRITTEN_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesPdfHandwrittenTransformerExtraction.ipynb + - title: Pretrained pipeline for reading on handwritten documents + id: pretrained_pipeline_reading_handwritten_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_handwritten_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on handwritten texts. It ensures precise and efficient text extraction from handwritten images of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_IMAGE_HANDWRITTEN_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesImageHandwrittenTransformerExtraction.ipynb --- diff --git a/docs/demos/extract_text_from_documents.md b/docs/demos/extract_text_from_documents.md index 6dd6cee473..8c34c27533 100644 --- a/docs/demos/extract_text_from_documents.md +++ b/docs/demos/extract_text_from_documents.md @@ -105,4 +105,64 @@ data: - text: Colab type: blue_btn url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextPrinted_V2_opt.ipynb + - title: Detect Text in Document Images + id: detect_text_document_images + image: + src: /assets/images/Detect_Text_in_Document_Images_1.svg + excerpt: This model detects text in documents using our pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/TEXT_DETECTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/Cards/SparkOcrImageTextDetection.ipynb + - title: Pretrained pipeline for reading on printed documents + id: pretrained_pipeline_reading_printed_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on printed texts. It ensures precise and efficient text extraction from printed images of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_IMAGE_PRINTED_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesImagePrintedTransformerExtraction.ipynb + - title: Pretrained pipeline for reading and removing noise on mixed scanned and digital PDF documents + id: Pretrained_pipeline_noise_mixed_scanned_digital_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_mixed_scanned_and_digital_PDF_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on printed texts. It ensures precise and efficient text extraction from printed images of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_IMAGE_PRINTED_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesMixedScannedDigitalPdfImageCleaner.ipynb + - title: Pretrained pipeline for reading on printed PDF documents + id: pretrained_pipeline_reading_printed_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, pipeline for doing transformer based OCR on printed texts. It ensures precise and efficient text extraction from printed pdfs of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_PDF_PRINTED_TRANSFORMER_EXTRACTION/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesPdfPrintedTransformerExtraction.ipynb + - title: Pretrained pipeline for reading on mixed scanned and digital PDF documents + id: pretrained_pipeline_reading_mixed_scanned_digital_pdf_documents + image: + src: /assets/images/Pretrained_pipeline_for_reading_on_printed_documents.svg + excerpt: Pretrained pipeline based on our pre-trained Spark OCR models, for conducting Optical Character Recognition (OCR) on mixed scanned and digital PDF documents. It ensures precise and efficient text extraction from PDFs of various origins and formats, improving the overall OCR accuracy. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PP_MIXED_SCANNED_DIGITAL_PDF/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrPretrainedPipelinesMixedScannedDigitalPdf.ipynb --- \ No newline at end of file diff --git a/docs/demos/visual_document_understanding.md b/docs/demos/visual_document_understanding.md index 903647ff67..706c8fedd2 100644 --- a/docs/demos/visual_document_understanding.md +++ b/docs/demos/visual_document_understanding.md @@ -129,4 +129,112 @@ data: - text: Colab type: blue_btn url: https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/208cebd1353c5b194baadbcea6e32c292eb46a08/jupyter/Cards/SparkOCRInfographicsVisualQuestionAnswering.ipynb + - title: Checkbox Detection + id: checkbox_detection + image: + src: /assets/images/Checkbox_Detection.svg + excerpt: This model detects and classifies checkboxes in document images using our pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/CHECKBOX_DETECTION/ + - text: Colab + type: blue_btn + url: https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrCheckBoxDetection.ipynb + - title: Deidentify DICOM documents + id: deidentify_dicom_documents_1 + image: + src: /assets/images/Deidentify_DICOM_documents_1.svg + excerpt: Deidentify DICOM documents by masking PHI information on the image and by either masking or obfuscating PHI from the metadata. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/DEID_DICOM_IMAGE/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImageDeIdentification.ipynb + - title: Deidentify Images + id: deidentify_images + image: + src: /assets/images/Deidentify_Images.svg + excerpt: Deidentify images by masking sensitive information on the image and by either masking or obfuscating. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/DEID_IMAGE/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImageDeIdentification.ipynb + - title: De-identify PDF documents - GDPR Compliance + id: deidentify_pdf_documents_gdpr_compliance + image: + src: /assets/images/De-identify_PDF_documents_GDPR_Compliance.svg + excerpt: Deidentify PDF documents using GDPR guidelines by anonymizing PHI information using out of the box Spark NLP models. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/DEID_PDF_GDPR/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImageDeIdentification.ipynb + - title: De-identify PDF documents - HIPAA Compliance + id: deidentify_pdf_documents_hippa_compliance + image: + src: /assets/images/De-identify_PDF_documents_HIPAA_Compliance.svg + excerpt: Deidentify PDF documents using HIPAA guidelines by masking PHI information using out of the box Spark NLP models. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/DEID_PDF_HIPAA/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImageDeIdentification.ipynb + - title: Image Classifier in Document Images + id: image_classifier_document_images + image: + src: /assets/images/Image_Classifier_in_Document_Images.svg + excerpt: This model classifies document images using our pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/IMAGE_CLASSIFIER/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOCRVisualDocumentClassifierv3.ipynb + - title: Document Layout Analysis + id: document_layout_analysis + image: + src: /assets/images/Document_Layout_Analysis.svg + excerpt: Identify and structure the visual elements in a document by using our pre-trained Spark OCR models. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/LAYOUT_ANALYSIS/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/Cards/SparkOCRDitLayoutAnalyze.ipynb + - title: Chart to Text + id: chart_text + image: + src: /assets/images/Chart_to_Text_1.svg + excerpt: Obtain a deeper interpretation of the charts in the PDF input document by using our Spark OCR model powered by LLM. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PDF_CHART_TO_TEXT/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrChartToTextLLM.ipynb + - title: HOCR Table Structure Recognition in Document Images + id: hocr_table_structure_recognition_document_images + image: + src: /assets/images/HOCR_Table_Structure_Recognition_in_Document_Images.svg + excerpt: This model obtains the table structure of documents images using our HOCR pre-trained Spark OCR model. + actions: + - text: Live Demo + type: normal + url: https://demo.johnsnowlabs.com/ocr/PDF_TABLE_RECOGNITION_HOCR/ + - text: Colab + type: blue_btn + url: https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrImageTableRecognitionWHOCR.ipynb ---