How does "Deduplication to Optimize Graph Operation D(.)" identify duplicate nodes? #1526

ntsarb · 2025-05-05T18:46:29Z

ntsarb
May 5, 2025

Hello, I am trying to get my head around the process and more specifically how deduplication works. How does it detect two nodes are duplicates, by the entity_name alone?

It is possible to have the same entity_name corresponding to two different entities. For example, the "potential" in electrical engineering is different from the "potential" in medicine.

Vice-versa, you may have two entity_name values, slightly (or very) different, meaning the same thing. I presume vector embeddings as well as graph data (relationships) could be used to determine this stochastically.

Answered by reqyou

Jun 4, 2025

I believe the answer can be found directly in the code. If you check operate.py, look at the functions merge_nodes_and_edges and _merge_nodes_then_upsert. The logic is as follows:

Collect all nodes by entity_name:
```
all_nodes[entity_name].extend(entities)
```
Merge and Upsert logic:
- If the node already exists: append its description to already_description, then either:
  - Append the summary
```
GRAPH_FIELD_SEP.join(sorted(set([dp["description"] for dp in nodes_data] + already_description)))
```
  or
  - If num_fragment >= force_llm_summary_on_merge, generate a new summary using:
```
summary = await use_llm_func_with_cache(...)
```
- If the node doesn’t exist: create it and for sure merge the description or wr…

View full answer

reqyou · 2025-06-04T15:17:21Z

reqyou
Jun 4, 2025

I believe the answer can be found directly in the code. If you check operate.py, look at the functions merge_nodes_and_edges and _merge_nodes_then_upsert. The logic is as follows:

Collect all nodes by entity_name:
```
all_nodes[entity_name].extend(entities)
```
Merge and Upsert logic:
- If the node already exists: append its description to already_description, then either:
  - Append the summary
```
GRAPH_FIELD_SEP.join(sorted(set([dp["description"] for dp in nodes_data] + already_description)))
```
  or
  - If num_fragment >= force_llm_summary_on_merge, generate a new summary using:
```
summary = await use_llm_func_with_cache(...)
```
- If the node doesn’t exist: create it and for sure merge the description or write a summary with the same logic from above.

In short: if you have multiple nodes with the same entity_name (e.g., potential with contexts like electric and medicine), their descriptions are merged. A single node will appear in the graph, but its description will include both topics.

If there are more than 5 such nodes, the LLM generates a new, summarized description instead.

1 reply

ntsarb Jun 4, 2025
Author

Thank you.

While the example I used earlier, from electrical engineering and medicine, can be avoided by maintaining different knowledge graphs for different subjects, it is very often the case that even within the boundaries of a scientific field, e.g. computing, a word can take different meanings depending on the context, e.g. programming language, communications protocol, etc. I suspect that similar overlaps are found in all other scientific fields.

I will check if there is something relevant already in the new feature requests and recommend that this is considered as a new feature/development, i.e. the algorithm to allow for duplicate entities where the meaning/context is different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does "Deduplication to Optimize Graph Operation D(.)" identify duplicate nodes? #1526

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How does "Deduplication to Optimize Graph Operation D(.)" identify duplicate nodes? #1526

Uh oh!

Uh oh!

ntsarb May 5, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

reqyou Jun 4, 2025

Uh oh!

ntsarb Jun 4, 2025 Author

ntsarb
May 5, 2025

Replies: 1 comment 1 reply

reqyou
Jun 4, 2025

ntsarb Jun 4, 2025
Author