Great project! Love the library+db approach. Some questions:
1. How much work is it to add bindings for new languages?
2. I know you provide conductor as a service. What are my options for workflow recovery if I don't have outbound network access?
3. Considering this came out of https://dbos-project.github.io/, do you guys have plans beyond durable workflows?
Shameless plug - for folks who don't want to take on the work of model selection, on-demand scaling of model serving, scaling the vector database for search set size and query throughput, we built a service that hides all this behind a simple API [1]. The example in [1] is for images, but here is quick-start for text [2].
The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples.
This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.
If looking for exact or near-exact duplicates, a transformer seems like it's probably overkill. Maybe it's not bad if you already have one that you can use for inference in the database, but I suspect that something as simple as Fasttext would do the job. A transformer would probably be more useful if you want to catch things like replacing words with synonyms out of a thesaurus.
Wouldn't that find articles that are semantically similar, rather than structurally similar (which I interpreted GP as wanting)?
In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.
Ha, I agree that software engineering is too hard for ML engineers, and even for software engineers like myself who have been doing it for 20 years like
zcw100 said :).
Author of the blog post here. It was definitely written from my narrow viewpoint and experience. Our goal is to make more solutions accessible to software developers and instinct was the same as yours - a lot of ML can be within the realm of engineering (even small / one-person teams) and that there are accidental complexities standing in the way of wider use. Our solution (AutoML+SaaS) def doesn't work for every situation. I'm curious to hear more of your thoughts on how ML can be made more accessible to Eng (and vice versa).
Author of the blog post here - it's very cool to see this on HN!
I wrote this as someone who considers himself a half-decent software engineer trying to use ML for a side project and feeling frustrated by all the effort and "accidental complexity" involved. Why focus on software engineers and ML in this post/rant/company? Because "software is eating the world" and having ML be more accessible to software engineers will broaden the range of problems they can solve.
Thanks for all the comments - I acknowledge all/most of the criticisms as valid. A SaaS/AutoML solution won't work for everyone and definitely not for every problem, and it won't be the only answer to making ML more approachable.
There are a continuum of offerings in this space. Some have lots of custom control of the training pipeline and deployment, and on the other side, things like RoboFlow that try to make it easy / hide the complexity. We consider ourselves even further to the “hide complexity” side, since we try several deep networks automatically vs. making you choose, re-train automatically, abstract-away non-essential ML jargon, etc. In addition, we don't limit ourselves to only vision - we'd like to be the one stop shop for ML as a service. We also have developer-friendly pricing with quick and easy signup.
Thank you! We do think that "model export" is important, but we're still working out how to do it in the most seamless and non-ML-expert friendly way. Do you have a use-case and target hardware in mind?
1. How much work is it to add bindings for new languages? 2. I know you provide conductor as a service. What are my options for workflow recovery if I don't have outbound network access? 3. Considering this came out of https://dbos-project.github.io/, do you guys have plans beyond durable workflows?