Question on certain use case: Checklist Deduplication #20

BradKML · 2021-09-26T10:49:51Z

Given a list of sentences and words, and assuming that I want to deduplicate them, what is the best way to automate the elimination of duplicate items (similar wordings of the same item)?

BradKML · 2021-09-26T11:16:37Z

import spacy_universal_sentence_encoder

nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')

with open('file.txt') as f:
    lines = f.readlines()

lines2 = [nlp(i).vector for i in lines]

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering

k=256

cluster = AgglomerativeClustering(n_clusters=k, affinity='euclidean', linkage='ward')
a = cluster.fit_predict(lines2)

for i in range(k):
  print(*[lines[j] for j in [j for j, x in enumerate(a) if x == i]])
  print()  


with open("myfile.txt", "w") as file1:
    for i in range(k):
      file1.writelines([lines[j] for j in [j for j, x in enumerate(a) if x == i]])
      file1.write("\n")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on certain use case: Checklist Deduplication #20

Question on certain use case: Checklist Deduplication #20

BradKML commented Sep 26, 2021

BradKML commented Sep 26, 2021 •

edited

Loading

Question on certain use case: Checklist Deduplication #20

Question on certain use case: Checklist Deduplication #20

Comments

BradKML commented Sep 26, 2021

BradKML commented Sep 26, 2021 • edited Loading

BradKML commented Sep 26, 2021 •

edited

Loading