Hacker Newsnew | past | comments | ask | show | jobs | submit | saintarian's commentslogin

Great project! Love the library+db approach. Some questions:

1. How much work is it to add bindings for new languages? 2. I know you provide conductor as a service. What are my options for workflow recovery if I don't have outbound network access? 3. Considering this came out of https://dbos-project.github.io/, do you guys have plans beyond durable workflows?


1. We also have support for Python and TypeScript with Java coming soon: https://github.com/dbos-inc

2. There are built-in APIs for managing workflow recovery, documented here: https://docs.dbos.dev/production/self-hosting/workflow-recov...

3. We'll see! :)


Elixir? Or does Oban hew close enough, that it’s not worth it?



Shameless plug - for folks who don't want to take on the work of model selection, on-demand scaling of model serving, scaling the vector database for search set size and query throughput, we built a service that hides all this behind a simple API [1]. The example in [1] is for images, but here is quick-start for text [2].

[1] https://www.nyckel.com/semantic-image-search [2] https://www.nyckel.com/docs/text-search-quickstart


The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples.

This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.

[1] github.com/facebookresearch/faiss

[2] github.com/google-research/google-research/tree/master/scann

[3] www.pinecone.io


If looking for exact or near-exact duplicates, a transformer seems like it's probably overkill. Maybe it's not bad if you already have one that you can use for inference in the database, but I suspect that something as simple as Fasttext would do the job. A transformer would probably be more useful if you want to catch things like replacing words with synonyms out of a thesaurus.


Wouldn't that find articles that are semantically similar, rather than structurally similar (which I interpreted GP as wanting)?

In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.

Or really, an LSH would do the trick.


Ha, I agree that software engineering is too hard for ML engineers, and even for software engineers like myself who have been doing it for 20 years like zcw100 said :).

Author of the blog post here. It was definitely written from my narrow viewpoint and experience. Our goal is to make more solutions accessible to software developers and instinct was the same as yours - a lot of ML can be within the realm of engineering (even small / one-person teams) and that there are accidental complexities standing in the way of wider use. Our solution (AutoML+SaaS) def doesn't work for every situation. I'm curious to hear more of your thoughts on how ML can be made more accessible to Eng (and vice versa).


Author of the blog post here - it's very cool to see this on HN!

I wrote this as someone who considers himself a half-decent software engineer trying to use ML for a side project and feeling frustrated by all the effort and "accidental complexity" involved. Why focus on software engineers and ML in this post/rant/company? Because "software is eating the world" and having ML be more accessible to software engineers will broaden the range of problems they can solve.

Thanks for all the comments - I acknowledge all/most of the criticisms as valid. A SaaS/AutoML solution won't work for everyone and definitely not for every problem, and it won't be the only answer to making ML more approachable.


Thanks for the input - that is useful to know.


Thank you! Everything just clicked when we saw that XKCD strip.

Yes, you are right - 'includes X invocations' are per month.


There are a continuum of offerings in this space. Some have lots of custom control of the training pipeline and deployment, and on the other side, things like RoboFlow that try to make it easy / hide the complexity. We consider ourselves even further to the “hide complexity” side, since we try several deep networks automatically vs. making you choose, re-train automatically, abstract-away non-essential ML jargon, etc. In addition, we don't limit ourselves to only vision - we'd like to be the one stop shop for ML as a service. We also have developer-friendly pricing with quick and easy signup.

We benchmarked ourselves against Google AutoML and HuggingFace, looking at both user experience and model performance, and wrote it up in a blog post that may interest you: https://www.nyckel.com/blog/automl-benchmark-nyckel-google-h...


Thank you! We do think that "model export" is important, but we're still working out how to do it in the most seamless and non-ML-expert friendly way. Do you have a use-case and target hardware in mind?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: