Breaking the trillion-rows-per-second barrier with MemSQL

JVerstry · on March 19, 2018

I have doubts about the "We’re actually spending some time on every row" claim. Saving data using columnstore often comes with meta-data saved at the page level, such as min, max and count values for the data in that page. These values are used to filter and optimize the flow of data processed for a query (I mean 'skipped' here). If you run a 'count' query 10 times, it's very unlikely the DB will count rows 10 times. It will rely on the page's existing meta-data when available (i.e., already computed). The tests described in the post are misleading IMHO.

EDIT: This comes on top of the fact that DBs can store queries results too. Moreover the post does not tell whether they have implemented clustered or filtered indexes on the considered columns. It does not explain how partition has been performed too. All this has a big impact on execution time.

nikita · on March 19, 2018

The query in the example is not “count” but a count + group by query. While it is possible to precompute group by results on every possible columns per page we don’t do that and this query does touch every row. The article touches on how this is possible: operations on encoded data, AVX2, not using a hash table for the intermediate result of the group by. And we certainly don’t fake it with storing query results.

We guarantee that the result is legit.

HenryR · on March 20, 2018

What does the columnar format look like? Particularly, is the group by column compressed with RLE? That’s kind of a pre-computed group-by + count that would make this kind of query very very fast :)

nikita · on March 22, 2018

This is an astute observation. Here is the accurate answer from the author of the feature: “We do have special handling for RLE for filters and group by column but in both cases there would be per row component (updating selection vector for filter on RLE, updating aggregates for group by on RLE) and so far I saw RLE encoding usually slower than dictionary encoding for both these cases”

nikita · on March 19, 2018

There is a clustered columnstore index on the table. MemSQL doesn’t support “filtered” indexes (indexes that let you specify a where clause that can be matched in a query with an equal or stricter where clause). Since the query needs a full scan those don’t help anyway.

We partition data on stock symbol but this doesn’t make a difference for this query either. It would if we had filters.

JVerstry · on March 19, 2018

I wouldn't be so sure. If data is partitioned on stock symbols, then there is already a grouping by subset happening at the partition level, which may considerably ease the job of aggregation (it depends on the number of symbols and it seems you have about 1500-1600). Considering there is a clustered index, it means values are saved "in order". Then, a page could easily have a min value = max value situation, and the saved count in meta-data would be valid (no need to scan all values in the page again). Should the index use a dictionary, then one only has to count the number of items in the corresponding entry without scanning the rows. Such information is stored in the index, not in the rows. I have no doubt about one's intention. It's just that I think we could do better to support the claim. If your where clause contained something like (shares mod 3) = 0, then I would be pretty sure all rows would be scanned, because such information is not aggregated at the page level. If possible, I would also check the execution plan for any incongruent values.

nikita · on March 19, 2018

That’s right: partitioning on stock symbol allows to push down LIMIT 10 to each partition, however in this case with 1500 stock symbols it doesn’t buy us much. It’s actually possible to compute a full group by (without a limit) by on every partition and merge them at the end. Merging 1500 groups is computationally trivial.

Yes, shared mod 3 or other predicate would make it impossilbe to run this query in O(metadata). It would of course burn more instructions so we would have to have a bigger cluster to hit a trillion a second as well as have a complex explanation in the blog post why this matters.

nmstoker · on March 19, 2018

Impressive although reading that they used 448 cores on super-expensive Intel lab machines takes the edge off a little

tgtweak · on March 19, 2018

We use it on inexpensive second hand R810s with 256G of ram and 4x 10-core xeons of yesteryear. These can be found reasonably cheap: https://m.ebay.ca/itm/Dell-Poweredge-R810-II-4x-Ten-Core-Xeo...

Edit: while these cpus don't have avx/avx2, they still give good results and don't cost $40,000 per server. The best value system with avx2 would be a dual e5-2650v3 such as this one: https://m.ebay.ca/itm/DELL-POWEREDGE-R430-SERVER-2-X-E5-2650...

lazyjones · on March 19, 2018

What's the licensing cost for MemSQL per server for this?

tgtweak · on March 19, 2018

If you go with Enterprise edition (we use community) it used to be in the 4-digits for a support contract and some operational niceties (streaming native backup and restore).

cbg0 · on March 19, 2018

They no longer seem to have a community offering, only paid. Also, having asked at one point, they say their pricing starts at $25K/year.

qeternity · on March 19, 2018

They renamed the CE to "Dev Edition" and I think now commercial use is prohibited.

AndyNemmity · on March 19, 2018

They used to try and confuse the issue to make it seem like they were open source, now they don't even have a free to use option?

"Start Using MemSQL Community Edition Now Unlimited scale and capacity. Free forever.".... Free forever right? Now it doesn't exist.

AmVess · on March 19, 2018

1,280,625,752,550 rows per second is pretty impressive even if it was done on a million dollar's worth of CPU.

mnw21cam · on March 19, 2018

So, for anyone who thought the headline was a little bit excessive, it looks like this article is using short-scale trillion and billion.

nayuki · on March 19, 2018

Who uses the long scale in English?

mnw21cam · on March 19, 2018

Um. England? (Except the BBC and parliament, and I have no idea why.)

StreamBright · on March 19, 2018

And do you have a business use case that requires that?

AndyNemmity · on March 19, 2018

I've worked on a ton of businesses cases that require it. This is what I worked on for years.

It's generally a large company, large issue sort of problem. You can throw millions in hardware/software into a problem that saves you 100's of millions.

nikita · on March 20, 2018

We can solve it quite efficiently. You can define an index on a columnstore table and we will maintain order on a per node basis

rplnt · on March 19, 2018

Analytics, BI, ...

More about rows/second it's about lowering minutes/operation.

manigandham · on March 19, 2018

Some people do... but this is about showing the potential.

ddorian43 · on March 19, 2018

So some questions::

1. Isn't (3)vectorization and (4)SIMD the same thing ?

2. I don't see the data-size before-after compression ?

3. How much RAM has each server ?

4. How do all cores work for all queries ? Is the data sharded by core on each machine or each core can work on whatever data ?

5. What's a comparison open-source tool to this ? Only I can think about is snappydata.

plamb · on March 19, 2018

SnappyData employee here. In general this is called the "HTAP" industry (Gartner's phrase: Hybrid Transactional/Analytical Processing).

SnappyData: https://www.snappydata.io, MemSQL: https://www.memsql.com/, Splice Machine: https://www.splicemachine.com/, SAP Hana: https://www.sap.com/products/hana.html, GridGain: https://www.gridgain.com/

are some of the technologies within it

ryanworl · on March 19, 2018

For 5, check out Clickhouse. It isn’t identical, but scanning a trillion rows a second is just a matter of sharding the data into enough nodes.

https://clickhouse.yandex

dataflow · on March 19, 2018

(1) I don't know if this is what they're thinking, but not all single instructions that operate on "multiple data" are vector operations... e.g. I wouldn't call summing all items in an __m128 to be a vectorized operation. Similarly I suppose not all vectorized operations are SIMD, since they might just use normal (SISD) instructions on entire arrays. But again, I'm not sure this is really what they mean; it wouldn't seem to be a very useful distinction to make.

danbruc · on March 19, 2018

Meh. They used 448 cores to count the frequency of bit patterns of some small length in a probably more or less continuous block of memory. They had 57,756,221,440 total rows, that are 128,920,138 rows per core. If the data set contained 256 or less different stock symbols, then the task boils down to finding the byte histogram of a 123 MiB block of memory. My several years old laptop does this with the most straight forward C# implementation in 170 ms. That is less than a factor of 4 away from their 45.1 ms and given that AVX-512 can probably process 64 bytes at a time, we should have quite a bit room to spare for all the other steps involved in processing the query.

Don't get me wrong, in some sense it is really impressive that we reached that level of processing power and that this database engine can optimize that query down to counting bytes and generating highly performant code to do so, but as an indicator that this database can process trillions of rows per second it is just a publicity stunt. Sure, it can do it with this setup and this query, but don't be to surprised if you don't get anywhere near that with other queries.

defen · on March 19, 2018

> My several years old laptop does this with the most straight forward C# implementation in 170 ms.

Sure, but writing a custom C# program for each query on data that has been pre-formatted in the most optimal manner for that program is not really comparable to writing a bog standard SQL query.

danbruc · on March 19, 2018

I am not sure I understand what you want to say but I assume you want to say that what I did was kind of cheating because I wrote code tailored to that problem and to work on data in a format that makes that problem rather easy? Correct me if I am wrong.

But they did exactly the same just that the query planer, optimizer, and compiler generated the code from the SQL query. They still picked a data set, a physical layout of that data set, and a query that would result in maximum throughput. That was not any random SQL query, it was a carefully picked one, chosen because it would be able to take advantage of all possible optimizations.

macspoofing · on March 19, 2018

Meh? Really?

jnordwick · on March 19, 2018

And of course, how does it compare to kdb? So it seems less expensive, but also lacks the advanced query language.

The last tests I saw for kdb was the 1.1 billion taxi ride.

http://tech.marksblogg.com/billion-nyc-taxi-kdb.html

Where it basically outperformed every other CPU based system with slightly more complex queries.

Any comparisons planned?

manigandham · on March 19, 2018

It'll be faster than Redshift and about the same as ClickHouse, +/- depending on hardware and setup.

It's a great system, we used it for 2 years and it's one of the most polished databases out there with a simple MySQL interface. It's more general purpose than kdb, with a nice rowstore + columnstore architecture. I believe they're adding full-text search indexes in the latest version too.

If you need the query language, the advanced/asof joins, or the tightly integrated query/process environment, then there's no match to kdb though.

thinkMOAR · on March 19, 2018

"When you deliver response time that drops down to about a quarter of a second, results seem to be instantaneous to users."

I don't think everybody agrees with this statement.

WorkLifeBalance · on March 19, 2018

The threshold for most people is slightly lower than a quarter second, but it's certainly the case that above a quarter-second users will notice the delay[0].

However, people may be primed for longer delays still seeming instant. e.g. Smartphones for many years had a built-in 300ms delay on any click event, and even without that event typically still have delays on many 'instant' actions.

So while the delay will be registered as being present, it may not be registered as "this site is slow" but "it's just a natural delay".

[0] https://psychology.stackexchange.com/questions/1664/what-is-...

spyder · on March 19, 2018

The threshold for noticeable touch "rubberbanding" is 1 ms:

https://www.youtube.com/watch?time_continue=52&v=vOvQCPLkPt4

mickronome · on March 19, 2018

I completely agree, a quarter of a second is absolutely not instantaneous.

Especially not to oddballs like me that has some kind of perceptive 'bug', it's like I lack a 'motion filter'. One effect being that movement in computer games doesn't feel completely fluid until the refresh rate is close to 200Hz.

Before internet and slow loading web pages might have lowered peoples expectations, I remember Human-computer interaction guidelines stating that < 0.1 was experienced as instantaneous in almost all cases. Above 1s without feedback started to cause measurable stress in test subjects.

Since this was before computers were ubiquitous, it's probably a good measure for how we react on a more basal, subconscious level. Anything measured today is likely to be include learned expectations, so a quarter of a second seems like reasonable learned expectation of the perception of instantaneous in that particular context.

manigandham · on March 19, 2018

This isn't about raw physics and biology but rather the use of business applications by users, specifically with regards to analytics and exploratory queries.

Anything less than a second to run a query and return results is considered "instant" (also sometimes "interactive") for analysis.

thinkMOAR · on March 19, 2018

Since the example data is stock data, i think it is safe to assume people don't want to be 249.9ms late with their purchase :)

manigandham · on March 19, 2018

What? No human trades that quickly, and this is not about automated HFT trading systems but rather running database queries and what typical business users expect as "instant" results.

rplnt · on March 19, 2018

There's context missing. But I don't think anyone would imagine an interactive application (say, a game) when the topic is some data manipulation.

n0tme · on March 19, 2018

If all this data just fits in memory then what is surprising about the speed?

bufferoverflow · on March 19, 2018

Well, you can try and run MySQL or Postgre off a RAM drive, but I suspect you won't get close to this result. And then your data is not on a reliable storage media.

n0tme · on March 19, 2018

Can't talk about MySQL or PostgreSQL, but with Oracle I am pretty sure you can get very much close to this result with cached data. And this is without using the in-memory option. However, talking about it withouth any experiments to prove a point is just speculation. Memory vs Disks changes a lot performance-wise.

exikyut · on March 19, 2018

It was expressly forbidden in the software contract to benchmark Oracle back in the 80s.

https://danluu.com/anon-benchmark/

captain_perl · on March 21, 2018

It's forbidden to publish the benchmarks, not to make them. And it's because most users are rubes who can't tune 2 databases at the same time.

olavgg · on March 19, 2018

With the ClickHouse OLAP db I'm limited by memory bandwidth.

neildahlke · on March 19, 2018

I believe you're conflating the name with the actual contents of the post. The demo uses the MemSQL columnstore, which is disk-based and leverages memory only for indexes, metadata, query processing etc.

paulsutter · on March 19, 2018

How fast is data import? Loading into RAM? (for example booting up a cluster for an existing imported database on AWS)

Working with some datasets with 100s of billions of short rows, curious to give it a try.

nikita · on March 19, 2018

We have seen 1Gb/s streaming ingest into MemSQL. Also feel free to join https://memsql-public.slack.com where people help with developer edition.

paulsutter · on March 19, 2018

With how many nodes? Does it scale linearly with nodes? (1Gb/s by itself doesn't help me estimate the scale of the project), Thanks!

nikita · on March 19, 2018

Data ingest scales linearly assuming the source is scalable (S3, HDFS, or Kafka). There are other things that matter: how wide is the table, what data types, etc. We achieve 1GB/s on a 16 nodes cluster for some combination of the above. What is your target?

tabtab · on March 19, 2018

The speed of light is a hard limit. I don't believe there is any free lunch[1], but trade-offs to manage. I'm skeptical of any claim that implies free or easy speed without potentially significant trade-offs.

If you can live with somewhat out-of-date and/or out-of-sync data, you can throw mass parallelism at big read-only queries to get speed. The trade-offs often are best tuned from a domain perspective such that it's not really a technology problem, although technology may make certain tunings/tradeoffs easier to manage.

[1] (Faster hardware may give us incremental improvements, but the speed of light probably prevents any tradeoff-free breakthroughs.)

heavenlyblue · on March 19, 2018

What are you on about? The author is counting rows in a database and checks how much time it takes the database to process the query. Counting rows is easily parallelisable.

So are many problems, and we still are at a point where we are creating languages and frameworks that allow us to work with the data at a higher level: e.g. theano, numpy or TensorFlow.

Speed of light is only an issue for latency-sensitive operations: e.g. if every sub-operation is non-commutative in a given transaction then it would be expected to be completed before any other sub-operation.

tabtab · on March 20, 2018

Re: Speed of light is only an issue for latency-sensitive operations

That was my point: if you don't care about data being "fresh" and consistent (per transaction coordination) you CAN throw mass parallelism at the problem. If you do care, then the speed of light is the bottleneck.

tabtab · on March 21, 2018

Clarification: "mass parallelism" may also involve distributed and duplicated data sets.

amelius · on March 19, 2018

How fast can it sort those rows?

nikita · on March 20, 2018

Can you share the whole use case as well as data size?

EGreg · on March 19, 2018

Who from HN would need this and why?

Serious question. I would like to know different real use cases from people on HN, given our backgrounds.

tgtweak · on March 19, 2018

We actually use memsql for almost exactly this use case. Realtime arbitrary grouping and filtering on multiple (30+) columns on a billion+ rows with substantial concurrency (reporting tools and api's of them). Previously with pgsql this required a lot of fine tuning of indexes and lots of etl scripting to break things into separate dimension tables, and then large join queries to pull them back. This is expensive at the development and operational level, and data was batched in which involves delays. It was also extremely resource intensive to query and moderate levels of concurrency required a master and several slaves, with response times for anything more than a week of data requiring multi-second waits, with the worse cases approaching the minute mark.

The hardware in this example is overkill and impractical for most use cases - to say the least. For our setup, Memsql does this for us on a single node with 256Gb of ram, 40 cores (1 aggregator and 4 leafs) and a modest enterprise nvme ssd. The machine cost $4,500 over a year ago. Adding more machines to mem is pretty trivial should we ever need to partition this across machines, despite this not being necessary.

There are some gotchas and it should not be consisted a drop-in replacement for MySQL.

Fiahil · on March 19, 2018

So, you're still shipping data between your primary datastore and memsql, or you've switched entirely on memsql ?

tgtweak · on March 19, 2018

Originally it was just a port, but now the inserts go straight into mem. This used to be a big no-no on mysql (with inno and myisam anyway) as it would invalidate a lot of query cache on every insert. Here you can refresh the query every second and see the counts go up.

mej10 · on March 19, 2018

Pretty much anyone working in ad tech.

Not necessarily trillions at a time, but even small ad tech firms deal with billions of new data points across many dimensions every day.

BeenAround · on March 19, 2018

Ad Tech has no need for realtime data viewing or aggregation (in this manner), even for platform log data. Offline parallel processing is the standard. Redshift is particularly efficient, while others use Spark or other ad-hoc solutions.

For users, you always want mediation/adjustment steps that (can) modify realtime data to provide timesliced totals. For developers/administrators, you want to be able to persist data. Running totals in memory are too fragile to be reliable. There is an assumption of errors, misconfigurations, and bad actors at all times in AdTech.

manigandham · on March 20, 2018

Has no need? Did you just make this up?

We used MemSQL for real-time data for 2 years. All data is fully persistent, but the rowstore tables are also fully held in memory compared to columnstores which are mainly on disk. There's nothing fragile about it. SQL Server's Hekaton, SAP's HANA, Oracle's Times Ten, and several other databases do the same.

Timesliced totals is just a SQL query, and mediation or some other buffer from live numbers for customers is up to every business to decide, not some default proclamation for an entire industry.

captain_perl · on March 21, 2018

Actually RTB's do need to do processing quickly - RTB stands for Real-Time Bidding, and bids are rejected after 250 ms.

stareatgoats · on March 19, 2018

Real time analytics without the need for aggregation or periodic ETL? To turn your question round: Which system does not want that? MemSQL or similar offerings (on preferably more standard hardware) is definitely interesting.