Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on certain use case: Checklist Deduplication #20

Open
BradKML opened this issue Sep 26, 2021 · 1 comment
Open

Question on certain use case: Checklist Deduplication #20

BradKML opened this issue Sep 26, 2021 · 1 comment

Comments

@BradKML
Copy link

BradKML commented Sep 26, 2021

Given a list of sentences and words, and assuming that I want to deduplicate them, what is the best way to automate the elimination of duplicate items (similar wordings of the same item)?

@BradKML
Copy link
Author

BradKML commented Sep 26, 2021

import spacy_universal_sentence_encoder

nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')

with open('file.txt') as f:
    lines = f.readlines()

lines2 = [nlp(i).vector for i in lines]

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering

k=256

cluster = AgglomerativeClustering(n_clusters=k, affinity='euclidean', linkage='ward')
a = cluster.fit_predict(lines2)

for i in range(k):
  print(*[lines[j] for j in [j for j, x in enumerate(a) if x == i]])
  print()  


with open("myfile.txt", "w") as file1:
    for i in range(k):
      file1.writelines([lines[j] for j in [j for j, x in enumerate(a) if x == i]])
      file1.write("\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant