-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trim input to TGI, moved clustering and summarization to dataprep and store in DB #893
Open
rbrugaro
wants to merge
15
commits into
opea-project:main
Choose a base branch
from
rbrugaro:GRAG_1.1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+238
−104
Open
Changes from 13 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
dd63c23
trim input to TGI, moved clustering and summarization to dataprep and…
rbrugaro c730089
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5acc016
removed inspect_db causing error in precommit
rbrugaro 63a50c0
removed inspect db causing issues in precommit
rbrugaro f3f2519
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b7fe867
add HF token to dataprep container because tokenizer is used now
rbrugaro f9bacb2
updated READMEs to reflect latest changes
rbrugaro fd6e4f6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d66052b
bug fix all files are ingested and graph extracted first followed by …
rbrugaro 33551bf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 60a53c7
update README based on fix for multifile
rbrugaro b140381
Changes to make graphrag ui work
ichbinblau 2ce9658
Merge branch 'main' into GRAG_1.1
rbrugaro 4b7f903
fix bug build communities done once at end of ingestion
rbrugaro d84bab7
Merge branch 'main' into GRAG_1.1
rbrugaro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,14 @@ | ||
# Dataprep Microservice with Neo4J | ||
|
||
This Dataprep microservice performs: | ||
|
||
- Graph extraction (entities, relationships and descripttions) using LLM | ||
- Performs hierarchical_leiden clustering to identify communities in the knowledge graph | ||
- Generates a community symmary for each community | ||
- Stores all of the above in Neo4j Graph DB | ||
|
||
This miroservice follows the graphRAG approached defined by Microsoft paper "From Local to Global: A Graph RAG Approach to Query-Focused Summarization" with some differences such as: 1) only level zero cluster summaries are leveraged, 2) The input context to the final answer generation is trimmed to fit maximum context length. | ||
|
||
This dataprep microservice ingests the input files and uses LLM (TGI or OpenAI model when OPENAI_API_KEY is set) to extract entities, relationships and descriptions of those to build a graph-based text index. | ||
|
||
## Setup Environment Variables | ||
|
@@ -78,6 +87,11 @@ curl -X POST \ | |
http://${host_ip}:6004/v1/dataprep | ||
``` | ||
|
||
Please note that clustering of extracted entities and summarization happens in this data preparation step. The consecuence of this is: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We can also use "result" instead of "consequence" but I have no preference |
||
|
||
- Large processing time for large dataset. An LLM call is done to summarize each cluster which may result in large volume of LLM calls | ||
- Need to clean graph GB entity_info and Cluster if dataprep is run multiple times since the resulting cluster numbering will differ between consecutive calls and will corrupt the results. | ||
|
||
We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast". | ||
|
||
Note: If you specify "table_strategy=llm" TGI service will be used. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in Microservice:
miroservice
->microservice
I'd use a hyperlink as below and leave the owners name like:
This microservice follows the graphRAG approached defined by paper From Local to Global: A Graph RAG Approach to Query-Focused Summarization