-
Hello, I am trying to get my head around the process and more specifically how deduplication works. How does it detect two nodes are duplicates, by the entity_name alone? It is possible to have the same entity_name corresponding to two different entities. For example, the "potential" in electrical engineering is different from the "potential" in medicine. Vice-versa, you may have two entity_name values, slightly (or very) different, meaning the same thing. I presume vector embeddings as well as graph data (relationships) could be used to determine this stochastically. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I believe the answer can be found directly in the code. If you check
In short: if you have multiple nodes with the same If there are more than 5 such nodes, the LLM generates a new, summarized description instead. |
Beta Was this translation helpful? Give feedback.
I believe the answer can be found directly in the code. If you check
operate.py
, look at the functionsmerge_nodes_and_edges
and_merge_nodes_then_upsert
. The logic is as follows:Collect all nodes by
entity_name
:Merge and Upsert logic:
already_description
, then either:num_fragment >= force_llm_summary_on_merge
, generate a new summary using: