Skip to content

components llm_rag_crack_and_chunk_and_embed

github-actions[bot] edited this page Dec 12, 2024 · 50 revisions

LLM - Crack, Chunk and Embed Data

llm_rag_crack_and_chunk_and_embed

Overview

Creates chunks no larger than chunk_size from input_data, extracted document titles are prepended to each chunk

LLM models have token limits for the prompts passed to them, this is a limiting factor at embedding time and even more limiting at prompt completion time as only so much context can be passed along with instructions to the LLM and user queries. Chunking allows splitting source data of various formats into small but coherent snippets of information which can be 'packed' into LLM prompts when asking for answers to user query related to the source documents.

Supported formats: md, txt, html/htm, pdf, ppt(x), doc(x), xls(x), py

Also generates embeddings vectors for data chunks if configured.

If embeddings_container is supplied, input chunks are compared to existing chunks in the Embeddings Container and only changed/new chunks are embedded, existing chunks being reused.

Version: 0.0.48

Tags

Preview

View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_crack_and_chunk_and_embed/version/0.0.48

Inputs

Input AzureML Data

Name Description Type Default Optional Enum
input_data Uri Folder containing files to be chunked. uri_folder

Files to handle from source

Name Description Type Default Optional Enum
input_glob Limit files opened from input_data, defaults to '**/*'. string True

Chunking options

Name Description Type Default Optional Enum
chunk_size Maximum number of tokens to put in each chunk. integer 768
chunk_overlap Number of tokens to overlap between chunks. integer 0
doc_intel_connection_id Connection id for Document Intelligence service. If provided, will be used to extract content from .pdf document. string True
citation_url Base URL to join with file paths to create full source file URL for chunk metadata. string True
citation_replacement_regex A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url. string True
use_rcts Whether to use RecursiveCharacterTextSplitter to split documents into chunks string True ['True', 'False']

If adding to previously generated Embeddings

Name Description Type Default Optional Enum
embeddings_container Folder containing previously generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. uri_folder True

Embeddings settings

Name Description Type Default Optional Enum
embeddings_model The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' string True
embeddings_connection_id The connection id of the Embeddings Model provider to use. string True
batch_size Batch size to use when embedding data. integer 100
num_workers Number of workers to use when embedding data. -1 defaults to CPUs / 2. integer -1
verbosity Verbosity level for embedding process, specific to document processing information. 0: Aggregate Source/Document Info, 1: Source Ids logged as processed, 2: Document Ids logged as processed. integer 0

Outputs

Name Description Type
embeddings Where to save data with embeddings. This should be a subfolder of previous embeddings if supplied, typically named using '${name}'. e.g. /my/prev/embeddings/${name} uri_folder

Environment

azureml:llm-rag-embeddings@latest

Clone this wiki locally