FAISS for deduplication #4526

angelotc · 2025-08-10T05:05:30Z

angelotc
Aug 10, 2025

Hi, I have online listings from multiple source sites, and I want to dedupe them. Is FAISS a good use for this? My listing metadata is in Supabase (PostgreSQL)

Answered by mnorris11

Aug 12, 2025

Is your goal to remove semantic duplicates or exact ones? And do you anticipate needing to do this many times or 1 time? And what scale is your data set?

If exact, you don't need Faiss, you can just hash or exact match your data to dedupe. Especially if you just need to do it once, this is easiest.

If semantic duplicates, then Faiss could be useful, as you can turn your online listing data into embeddings, and find near neighbors. However there are also projects like https://github.com/facebookresearch/SemDeDup which may be useful for you.

View full answer

mnorris11 · 2025-08-12T17:29:21Z

mnorris11
Aug 12, 2025
Collaborator

Is your goal to remove semantic duplicates or exact ones? And do you anticipate needing to do this many times or 1 time? And what scale is your data set?

If exact, you don't need Faiss, you can just hash or exact match your data to dedupe. Especially if you just need to do it once, this is easiest.

If semantic duplicates, then Faiss could be useful, as you can turn your online listing data into embeddings, and find near neighbors. However there are also projects like https://github.com/facebookresearch/SemDeDup which may be useful for you.

1 reply

angelotc Aug 12, 2025
Author

So I ended up just using an HNSW index for semantic deduplication and computing a custom L2 similarity score on ~300k listing vectors - https://supabase.com/docs/guides/ai/vector-indexes/hnsw-indexes

GREATEST(0, 100 * exp(-(similar_listings.l2_distance / 2.0))) AS similarity_score,

Works similar to FAISS.

angelotc · 2025-08-12T20:02:27Z

angelotc
Aug 12, 2025
Author

Btw, do you recommend any feature engineering tool @mnorris11 ? I ended up creating my own 9 feature vector based off of domain knowledge. Just curious if there's something out there that creates these better.

1 reply

mnorris11 Aug 12, 2025
Collaborator

Sorry, not too sure about that part of the domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FAISS for deduplication #4526

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FAISS for deduplication #4526

Uh oh!

angelotc Aug 10, 2025

Replies: 2 comments · 2 replies

Uh oh!

mnorris11 Aug 12, 2025 Collaborator

Uh oh!

Uh oh!

angelotc Aug 12, 2025 Author

Uh oh!

angelotc Aug 12, 2025 Author

Uh oh!

mnorris11 Aug 12, 2025 Collaborator

angelotc
Aug 10, 2025

Replies: 2 comments 2 replies

mnorris11
Aug 12, 2025
Collaborator

angelotc Aug 12, 2025
Author

angelotc
Aug 12, 2025
Author

mnorris11 Aug 12, 2025
Collaborator