Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Expand Knowledge Document Ingestion Pipeline #324

Open
4 tasks
aakankshaduggal opened this issue Oct 25, 2024 · 1 comment
Open
4 tasks

[EPIC] Expand Knowledge Document Ingestion Pipeline #324

aakankshaduggal opened this issue Oct 25, 2024 · 1 comment
Assignees
Labels
epic Larger tracking issue encompassing multiple smaller issues stale

Comments

@aakankshaduggal
Copy link
Member

Add support for ingesting and processing various document types (Markdown, PDF, DOCX, etc.) into formats compatible with SDG workflows.

Key Features:

  • InstructLab Schema: Define an instructlab schema to standardize input formats for SDG and RAG.
  • Docling Integration: Use Docling for converting document formats (PDF, DOCX, HTML) into JSON-compatible schema.
  • Document Chunking Command: Develop ilab document format for chunking and formatting documents as per SDG schema.
  • Simplified Git Workflows: Introduce script to handle Git repo setup, structure, and file organization for knowledge documents.
Copy link

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

@github-actions github-actions bot added the stale label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Larger tracking issue encompassing multiple smaller issues stale
Projects
None yet
Development

No branches or pull requests

3 participants