Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even better, _ALL USEFUL_ AI retrival systems are insecure by design, because all those RAG vectors that sells vector-databases? That's basically your documents lossily encoded.


>That's basically your documents lossily encoded.

Vector embeddings are lossy encodings of documents roughly in the same way a SHA256 hash is a lossy encoding. It's virtually impossible to reverse the embedding vector to recover the original document.

Note: when vectors are combined with other components for search and retrieval, it's trivial to end up with a horribly insecure system, but just vector embeddings are useful by themselves and you said "all useful AI retrieval systems are insecure by design", so I felt it necessary to disagree with that part.


> Vector embeddings are lossy encodings of documents roughly in the same way a SHA256 hash is a lossy encoding.

Incorrect. With a hash, I need to have the identical input to know whether it matches. If I'm one bit off, I get no information. Vector embeddings by design will react differently for similar inputs, so if you can reproduce the embedding algorithm then you can know how close you are to the input. It's like a combination lock that tells you how many numbers match so far (and for ones that don't, how close they are).

> It's virtually impossible to reverse the embedding vector to recover the original document.

If you can reproduce the embedding process, it is very possible (with a hot/cold type of search: "you're getting warmer!"). But also, you no longer even need to recover the exact original. You can recover something close enough (and spend more time to make it incrementally closer).


I wouldn't say those two are equivalent. A cryptographic hash requires the exact full document to be available to "recover it" from the hash. With a vector embedding you can extract information related to the document from the embedding alone as long as you know (or can guess) what embedding model was used. You won't be able to reconstruct the document but you will be able to infer some meaning from the vector alone


Yes there have been multiple papers showing information extraction from embedding vectors if you know the model used. SHA by design maps similar strings pseud-randomly. Embeddings by design map similar strings similarly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: