Who from HN would need this and why? Serious question. I would like to know diff...

tgtweak · on March 19, 2018

We actually use memsql for almost exactly this use case. Realtime arbitrary grouping and filtering on multiple (30+) columns on a billion+ rows with substantial concurrency (reporting tools and api's of them). Previously with pgsql this required a lot of fine tuning of indexes and lots of etl scripting to break things into separate dimension tables, and then large join queries to pull them back. This is expensive at the development and operational level, and data was batched in which involves delays. It was also extremely resource intensive to query and moderate levels of concurrency required a master and several slaves, with response times for anything more than a week of data requiring multi-second waits, with the worse cases approaching the minute mark.

The hardware in this example is overkill and impractical for most use cases - to say the least. For our setup, Memsql does this for us on a single node with 256Gb of ram, 40 cores (1 aggregator and 4 leafs) and a modest enterprise nvme ssd. The machine cost $4,500 over a year ago. Adding more machines to mem is pretty trivial should we ever need to partition this across machines, despite this not being necessary.

There are some gotchas and it should not be consisted a drop-in replacement for MySQL.

Fiahil · on March 19, 2018

So, you're still shipping data between your primary datastore and memsql, or you've switched entirely on memsql ?

tgtweak · on March 19, 2018

Originally it was just a port, but now the inserts go straight into mem. This used to be a big no-no on mysql (with inno and myisam anyway) as it would invalidate a lot of query cache on every insert. Here you can refresh the query every second and see the counts go up.

mej10 · on March 19, 2018

Pretty much anyone working in ad tech.

Not necessarily trillions at a time, but even small ad tech firms deal with billions of new data points across many dimensions every day.

BeenAround · on March 19, 2018

Ad Tech has no need for realtime data viewing or aggregation (in this manner), even for platform log data. Offline parallel processing is the standard. Redshift is particularly efficient, while others use Spark or other ad-hoc solutions.

For users, you always want mediation/adjustment steps that (can) modify realtime data to provide timesliced totals. For developers/administrators, you want to be able to persist data. Running totals in memory are too fragile to be reliable. There is an assumption of errors, misconfigurations, and bad actors at all times in AdTech.

manigandham · on March 20, 2018

Has no need? Did you just make this up?

We used MemSQL for real-time data for 2 years. All data is fully persistent, but the rowstore tables are also fully held in memory compared to columnstores which are mainly on disk. There's nothing fragile about it. SQL Server's Hekaton, SAP's HANA, Oracle's Times Ten, and several other databases do the same.

Timesliced totals is just a SQL query, and mediation or some other buffer from live numbers for customers is up to every business to decide, not some default proclamation for an entire industry.

captain_perl · on March 21, 2018

Actually RTB's do need to do processing quickly - RTB stands for Real-Time Bidding, and bids are rejected after 250 ms.

stareatgoats · on March 19, 2018

Real time analytics without the need for aggregation or periodic ETL? To turn your question round: Which system does not want that? MemSQL or similar offerings (on preferably more standard hardware) is definitely interesting.