Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nvidia ingest component #6333

Merged
merged 30 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
b25858f
initial
jordanrfrazier Feb 13, 2025
11f01c0
cleanup
jordanrfrazier Feb 13, 2025
7ab10e1
[autofix.ci] apply automated fixes
autofix-ci[bot] Feb 13, 2025
893ef3e
ruff
jordanrfrazier Feb 13, 2025
01fdc7c
add else
jordanrfrazier Feb 13, 2025
c494220
update deps
jordanrfrazier Feb 13, 2025
31aa1bc
uv lock
jordanrfrazier Feb 14, 2025
de69da2
Make nv-ingest an optional dep
jordanrfrazier Feb 14, 2025
76c082e
revert change to validate
jordanrfrazier Feb 14, 2025
d5078ae
rebase fixes
jordanrfrazier Feb 14, 2025
0fdcbb2
[autofix.ci] apply automated fixes
autofix-ci[bot] Feb 14, 2025
941ddec
Update language
jordanrfrazier Feb 14, 2025
6848a23
add extra args to make target
jordanrfrazier Feb 14, 2025
733861d
[autofix.ci] apply automated fixes
autofix-ci[bot] Feb 14, 2025
8d43e24
update error language
jordanrfrazier Feb 18, 2025
f8a3c76
lockfile update
jordanrfrazier Feb 18, 2025
7b4c3ce
rebase lockfile:
jordanrfrazier Feb 18, 2025
859efbf
[autofix.ci] apply automated fixes
autofix-ci[bot] Feb 18, 2025
0ff64f8
Adds nv-ingest by default to -ep docker image
jordanrfrazier Feb 19, 2025
c2dfa7c
caps fixes
jordanrfrazier Feb 20, 2025
05aedd2
update uv lock
jordanrfrazier Feb 20, 2025
9246315
revert ruff upgrade
jordanrfrazier Feb 20, 2025
ee45664
ruff
jordanrfrazier Feb 20, 2025
2ad3e48
Fix lint
jordanrfrazier Feb 20, 2025
5c82ebe
[autofix.ci] apply automated fixes
autofix-ci[bot] Feb 20, 2025
aa9a2f5
Merge branch 'main' into nvidia-components-ingest
ogabrielluiz Feb 20, 2025
ed4bb2f
No code changes made.
ogabrielluiz Feb 20, 2025
979aab5
fix: update ruff configuration to ignore additional linting rule and …
ogabrielluiz Feb 20, 2025
6528a4e
fix: update ruff command to ignore linting rule A005 during autofix
ogabrielluiz Feb 20, 2025
27e0298
fix: update Ruff check command to ignore linting rule A005
ogabrielluiz Feb 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/py_autofix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ jobs:
- uses: actions/checkout@v4
- name: "Setup Environment"
uses: ./.github/actions/setup-uv
- run: uv run ruff check --fix-only .
- run: uv run ruff format .
- run: uv run ruff check --fix-only . --ignore A005
- run: uv run ruff format . --config pyproject.toml
- uses: autofix-ci/action@551dded8c6cc8a1054039c8bc0b8b48c51dfc6ef
- name: Minimize uv cache
run: uv cache prune --ci
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/style-check-py.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ jobs:
- name: Register problem matcher
run: echo "::add-matcher::.github/workflows/matchers/ruff.json"
- name: Run Ruff Check
run: uv run --only-dev ruff check --output-format=github .
run: uv run --only-dev ruff check --output-format=github . --ignore A005
- name: Minimize uv cache
run: uv cache prune --ci
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ reinstall_backend: ## forces reinstall all dependencies (no caching)

install_backend: ## install the backend dependencies
@echo 'Installing backend dependencies'
@uv sync --frozen
@uv sync --frozen $(EXTRA_ARGS)

install_frontend: ## install the frontend dependencies
@echo 'Installing frontend dependencies'
Expand Down Expand Up @@ -198,7 +198,7 @@ fix_codespell: ## run codespell to fix spelling errors
poetry run codespell --toml pyproject.toml --write

format_backend: ## backend code formatters
@uv run ruff check . --fix --ignore EXE002
@uv run ruff check . --fix --ignore EXE002 --ignore A005
@uv run ruff format . --config pyproject.toml

format_frontend: ## frontend code formatters
Expand Down
4 changes: 2 additions & 2 deletions docker/build_and_push_ep.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=src/backend/base/README.md,target=src/backend/base/README.md \
--mount=type=bind,source=src/backend/base/uv.lock,target=src/backend/base/uv.lock \
--mount=type=bind,source=src/backend/base/pyproject.toml,target=src/backend/base/pyproject.toml \
uv sync --frozen --no-install-project --no-editable
uv sync --frozen --no-install-project --no-editable --extra nv-ingest

COPY ./src /app/src

Expand All @@ -58,7 +58,7 @@ COPY ./uv.lock /app/uv.lock
COPY ./README.md /app/README.md

RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-editable
uv sync --frozen --no-editable --extra nv-ingest

################################
# RUNTIME
Expand Down
11 changes: 11 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,14 @@ local = [
clickhouse-connect = [
"clickhouse-connect==0.7.19"
]
nv-ingest = [
# nv-ingest-client 2025.2.7.dev0 does not correctly install its
# dependencies, so we need to install some manually.
"nv-ingest-client==2025.2.7.dev0",
"python-pptx==0.6.23",
"pymilvus[bulk_writer,model]==2.5.0",
"llama-index-embeddings-nvidia==0.1.5",
]

[project.scripts]
langflow = "langflow.__main__:main"
Expand Down Expand Up @@ -276,6 +284,9 @@ external = ["RUF027"]
"SLF001",
]

[tool.ruff.lint.flake8-builtins]
builtins-allowed-modules = [ "io", "logging", "socket"]

[tool.mypy]
plugins = ["pydantic.mypy"]
follow_imports = "skip"
Expand Down
3 changes: 2 additions & 1 deletion src/backend/base/langflow/components/nvidia/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .nvidia_ingest import NvidiaIngestComponent
from .nvidia_rerank import NvidiaRerankComponent

__all__ = ["NvidiaRerankComponent"]
__all__ = ["NvidiaIngestComponent", "NvidiaRerankComponent"]
235 changes: 235 additions & 0 deletions src/backend/base/langflow/components/nvidia/nvidia_ingest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
from pathlib import Path
from urllib.parse import urlparse

from loguru import logger

from langflow.custom import Component
from langflow.io import BoolInput, DropdownInput, FileInput, IntInput, MessageTextInput, Output
from langflow.schema import Data


class NvidiaIngestComponent(Component):
display_name = "NVIDIA Ingest"
description = "Process, transform, and store data."
documentation: str = "https://github.com/NVIDIA/nv-ingest/tree/main/docs"
icon = "NVIDIA"
name = "NVIDIAIngest"
beta = True

try:
from nv_ingest_client.util.file_processing.extract import EXTENSION_TO_DOCUMENT_TYPE

file_types = list(EXTENSION_TO_DOCUMENT_TYPE.keys())
supported_file_types_info = f"Supported file types: {', '.join(file_types)}"
except ImportError:
msg = (
"NVIDIA Ingest dependencies missing. "
"Please install them using your package manager. (e.g. uv sync --extra nv-ingest)"
)
logger.warning(msg)
file_types = [msg]
supported_file_types_info = msg

inputs = [
MessageTextInput(
name="base_url",
display_name="NVIDIA Ingestion URL",
info="The URL of the NVIDIA Ingestion API.",
),
FileInput(
name="path",
display_name="Path",
file_types=file_types,
info=supported_file_types_info,
required=True,
),
BoolInput(
name="extract_text",
display_name="Extract Text",
info="Extract text from documents",
value=True,
),
BoolInput(
name="extract_charts",
display_name="Extract Charts",
info="Extract text from charts",
value=False,
),
BoolInput(
name="extract_tables",
display_name="Extract Tables",
info="Extract text from tables",
value=True,
),
DropdownInput(
name="text_depth",
display_name="Text Depth",
info=(
"Level at which text is extracted (applies before splitting). "
"Support for 'block', 'line', 'span' varies by document type."
),
options=["document", "page", "block", "line", "span"],
value="document", # Default value
advanced=True,
),
BoolInput(
name="split_text",
display_name="Split Text",
info="Split text into smaller chunks",
value=True,
),
DropdownInput(
name="split_by",
display_name="Split By",
info="How to split into chunks ('size' splits by number of characters)",
options=["page", "sentence", "word", "size"],
value="word", # Default value
advanced=True,
),
IntInput(
name="split_length",
display_name="Split Length",
info="The size of each chunk based on the 'split_by' method",
value=200,
advanced=True,
),
IntInput(
name="split_overlap",
display_name="Split Overlap",
info="Number of segments (as determined by the 'split_by' method) to overlap from previous chunk",
value=20,
advanced=True,
),
IntInput(
name="max_character_length",
display_name="Max Character Length",
info="Maximum number of characters in each chunk",
value=1000,
advanced=True,
),
IntInput(
name="sentence_window_size",
display_name="Sentence Window Size",
info="Number of sentences to include from previous and following chunk (when split_by='sentence')",
value=0,
advanced=True,
),
]

outputs = [
Output(display_name="Data", name="data", method="load_file"),
]

def load_file(self) -> list[Data]:
try:
from nv_ingest_client.client import Ingestor
except ImportError as e:
msg = (
"NVIDIA Ingest dependencies missing. "
"Please install them using your package manager. (e.g. uv sync --extra nv-ingest)"
)
raise ImportError(msg) from e

self.base_url: str | None = self.base_url.strip() if self.base_url else None

if not self.path:
err_msg = "Upload a file to use this component."
self.log(err_msg, name="NVIDIAIngestComponent")
raise ValueError(err_msg)

resolved_path = self.resolve_path(self.path)
extension = Path(resolved_path).suffix[1:].lower()
if extension not in self.file_types:
err_msg = f"Unsupported file type: {extension}"
self.log(err_msg, name="NVIDIAIngestComponent")
raise ValueError(err_msg)

try:
parsed_url = urlparse(self.base_url)
if not parsed_url.hostname or not parsed_url.port:
err_msg = "Invalid URL: Missing hostname or port."
self.log(err_msg, name="NVIDIAIngestComponent")
raise ValueError(err_msg)
except Exception as e:
self.log(f"Error parsing URL: {e}", name="NVIDIAIngestComponent")
raise

self.log(
f"Creating Ingestor for host: {parsed_url.hostname!r}, port: {parsed_url.port!r}",
name="NVIDIAIngestComponent",
)
try:
from nv_ingest_client.client import Ingestor

ingestor = (
Ingestor(message_client_hostname=parsed_url.hostname, message_client_port=parsed_url.port)
.files(resolved_path)
.extract(
extract_text=self.extract_text,
extract_tables=self.extract_tables,
extract_charts=self.extract_charts,
extract_images=False, # Currently not supported
text_depth=self.text_depth,
)
)
except Exception as e:
self.log(f"Error creating Ingestor: {e}", name="NVIDIAIngestComponent")
raise

if self.split_text:
ingestor = ingestor.split(
split_by=self.split_by,
split_length=self.split_length,
split_overlap=self.split_overlap,
max_character_length=self.max_character_length,
sentence_window_size=self.sentence_window_size,
)

try:
result = ingestor.ingest()
except Exception as e:
self.log(f"Error during ingestion: {e}", name="NVIDIAIngestComponent")
raise

self.log(f"Results: {result}", name="NVIDIAIngestComponent")

data = []
document_type_text = "text"
document_type_structured = "structured"

# Result is a list of segments as determined by the text_depth option (if "document" then only one segment)
# each segment is a list of elements (text, structured, image)
for segment in result:
for element in segment:
document_type = element.get("document_type")
metadata = element.get("metadata", {})
source_metadata = metadata.get("source_metadata", {})
content_metadata = metadata.get("content_metadata", {})

if document_type == document_type_text:
data.append(
Data(
text=metadata.get("content", ""),
file_path=source_metadata.get("source_name", ""),
document_type=document_type,
description=content_metadata.get("description", ""),
)
)
# Both charts and tables are returned as "structured" document type,
# with extracted text in "table_content"
elif document_type == document_type_structured:
table_metadata = metadata.get("table_metadata", {})
data.append(
Data(
text=table_metadata.get("table_content", ""),
file_path=source_metadata.get("source_name", ""),
document_type=document_type,
description=content_metadata.get("description", ""),
)
)
else:
# image is not yet supported; skip if encountered
self.log(f"Unsupported document type: {document_type}", name="NVIDIAIngestComponent")

self.status = data if data else "No data"
return data
Loading
Loading