-
Couldn't load subscription status.
- Fork 579
Description
Hi, thanks for making Turso!
I'm really excited about this project so I'd really love if you nail the vector search functionality because it would be quite perfect to have all I need in one place.
But I have some concerns when looking at what you went for with libsql and now seemingly with Turso. I tried to do vector search with libsql which uses a DiskANN search index and the experience was rather bad tbh. I only used 30,000 vectors with 1024 dimensions and single precision (32 bit) which is around 117 MiB but the index took around 5 GiB of disk space which is just too much imo, especially when scaling up the dataset. Imagine having a million vectors of this size (which certainly is a realistic use case), that'd be around 166 GiB which disqualifies such a vector search for many edge computing scenarios. Of course, one can use quantization and tune parameters like the number of neighbors per node, but you can only get down that blowup this much. Also, the index updates when inserting data or updating data slowed down everything quite a lot as far as I can tell from my benchmarks.
So please please please, going forward with Turso, can you please first focus on making the brute-force search very efficient by using SIMD? I think some people underestimate how far you can get with just that. You can stay under 100ms per query even for millions of vectors and you don't have any of the disadvantages you get when using a search index. No extra memory or disk space usage. No expensive index updating when inserting or updating the index. Just look at what the sqlite.ai guys did here https://github.com/sqliteai/sqlite-vector It works really well. No fancy index, just highly optimized brute-force search with dynamic runtime dispatch depending on what kind of CPU you're on (checking for SSE2, AVX2 or ARM Neon or whatever). And no crazy memory or disk usage. Technically they also offer some kind of index which is simply storing quantized and tightly packed vector data but it's really fast to update and the memory usage can be capped. Or look at how usearch uses a SIMD-accelerated brute-force search https://github.com/unum-cloud/usearch?tab=readme-ov-file#exact-vs-approximate-search
I'd argue that this is what 95% of your users will ever need and it's way more robust and efficient than a search index like DiskANN or HNSW.
Don't get me wrong, feel free to implement those more complex indices for users who want to scale up to crazy amounts of vectors and run their database on beefy servers. But what portion of you user base will that be? Of course there is a limit to the feasibility of brute-force search. But it's way higher than some people might assume, so please give this some consideration. Thanks!