My usual line of feedback would be to start with a more aggressive benchmark.
Indexing 100K dense vector (100ish MB here) is not generally a good idea.
Brute-force search at that scale is already trivial at 10 GB/s/core.
They say int he post that they're doing optimized brute-force search, which honestly makes a lot of sense for the local-scaled context.
Vector databases are often over-optimized for getting results into papers (where the be-all-end-all measure is recall@latency benchmarking). Indexed search is very rigid - filtered search is a pain, and you're stuck with the metric you used for indexing.
At smaller data scales, you get a lot more flexibility, and shoving things into a indexed search mindlessly will lead to a lot of pain. Providing optimized flexible search at smaller scales is quite valuable, imo.
Ah, I see the article does mention "brute-force-like" — I must have skimmed past that. I'd be curious what exactly is meant by it in practice.
A small note: since the project seems to include @maratyszcza's fp16 library (MIT), it might be nice to add a line of attribution: https://github.com/maratyszcza/fp16
And if you ever feel like broadening the benchmarks, it could be interesting to compare with USearch. It has had an SQLite connector for a while and covers a wide set of hardware backends and similarity metrics, though it's not yet as straightforward to use on some OSes: https://unum-cloud.github.io/usearch/sqlite
To be clear, I'm not the author of the post. But I do maintain a library for folks working with large audio datasets, built on a combination of SQLite and usearch. :)
What library is that? My current project is working with voice recordings. My personal collection of voice recordings spans 20 years and measures in the high tens of GiB.
It's geared towards bioacoustics in particular. It's pretty easy to provide a wrapper for any given model though. Feel free to send a ping if you try it out for speech; will be happy to hear how it goes.
Interesting. Audio search isn't a problem I've thought about addressing, as I'll have transcriptions anyway. But knowing this exists might inspire some additional features or use cases that I haven't thought of yet. Thank you.
Yep, makes sense - conversion to text and then aligning the text with the audio is a very reasonable way to handle large volumes of speech data. For bioacoustics, we tend to have a loooooot of variation for which there is no real notation, and which may be from areas where we haven't seen much training data, or on taxa where we don't have lots of scientists (eg, insects). So working with the raw audio embeddings tends to be best.