kabochya's comments

kabochya · on Aug 8, 2023

We've built something probably very similar to this product at my previous job. We had double digit TB daily ML traffic that didn't require realtime latency, so we moved those all onto S3 and also saw approx ~90% cost savings. This was built on top of JVM and still used a 6 broker Kafka cluster to keep metadata (vs probably 300 when it was originally all on Kafka). Kafka's compute and storage model doesn't scale too well for the extreme use cases that can tolerate latency, and the Apache Pulsar model sort of worked better (though at that time Pulsar wasn't too stable for us to use in prod). One of the keys to cost efficiency for us was that the size of the data was large enough that we didn't need to wait too long before hitting an economic file size to upload. Trying to imagine how a pipeline with less than 10 MB/s would work with this efficiently.

richieartoul · on Aug 8, 2023

(WarpStream co-founder)

Yeah we’ve run into a number of people who’ve rolled their own solution in this space. The “push pointers to S3 through traditional Kafka” approach is a very practical one.

Was this memq at Pinterest, or something else?

kabochya · on Aug 8, 2023

Yeah it is, sort of glad that it's actually known. I've since left the ingestion side of data infra so not very familiar of the landscape after one year.