I have doubts about the "We’re actually spending some time on every row" claim. Saving data using columnstore often comes with meta-data saved at the page level, such as min, max and count values for the data in that page. These values are used to filter and optimize the flow of data processed for a query (I mean 'skipped' here).
If you run a 'count' query 10 times, it's very unlikely the DB will count rows 10 times. It will rely on the page's existing meta-data when available (i.e., already computed). The tests described in the post are misleading IMHO.
EDIT: This comes on top of the fact that DBs can store queries results too. Moreover the post does not tell whether they have implemented clustered or filtered indexes on the considered columns. It does not explain how partition has been performed too. All this has a big impact on execution time.
The query in the example is not “count” but a count + group by query. While it is possible to precompute group by results on every possible columns per page we don’t do that and this query does touch every row. The article touches on how this is possible: operations on encoded data, AVX2, not using a hash table for the intermediate result of the group by. And we certainly don’t fake it with storing query results.
What does the columnar format look like? Particularly, is the group by column compressed with RLE? That’s kind of a pre-computed group-by + count that would make this kind of query very very fast :)
This is an astute observation. Here is the accurate answer from the author of the feature:
“We do have special handling for RLE for filters and group by column but in both cases there would be per row component (updating selection vector for filter on RLE, updating aggregates for group by on RLE) and so far I saw RLE encoding usually slower than dictionary encoding for both these cases”
There is a clustered columnstore index on the table. MemSQL doesn’t support “filtered” indexes (indexes that let you specify a where clause that can be matched in a query with an equal or stricter where clause). Since the query needs a full scan those don’t help anyway.
We partition data on stock symbol but this doesn’t make a difference for this query either. It would if we had filters.
I wouldn't be so sure. If data is partitioned on stock symbols, then there is already a grouping by subset happening at the partition level, which may considerably ease the job of aggregation (it depends on the number of symbols and it seems you have about 1500-1600).
Considering there is a clustered index, it means values are saved "in order". Then, a page could easily have a min value = max value situation, and the saved count in meta-data would be valid (no need to scan all values in the page again). Should the index use a dictionary, then one only has to count the number of items in the corresponding entry without scanning the rows. Such information is stored in the index, not in the rows.
I have no doubt about one's intention. It's just that I think we could do better to support the claim. If your where clause contained something like (shares mod 3) = 0, then I would be pretty sure all rows would be scanned, because such information is not aggregated at the page level.
If possible, I would also check the execution plan for any incongruent values.
That’s right: partitioning on stock symbol allows to push down LIMIT 10 to each partition, however in this case with 1500 stock symbols it doesn’t buy us much. It’s actually possible to compute a full group by (without a limit) by on every partition and merge them at the end. Merging 1500 groups is computationally trivial.
Yes, shared mod 3 or other predicate would make it impossilbe to run this query in O(metadata). It would of course burn more instructions so we would have to have a bigger cluster to hit a trillion a second as well as have a complex explanation in the blog post why this matters.
If you go with Enterprise edition (we use community) it used to be in the 4-digits for a support contract and some operational niceties (streaming native backup and restore).
I've worked on a ton of businesses cases that require it. This is what I worked on for years.
It's generally a large company, large issue sort of problem. You can throw millions in hardware/software into a problem that saves you 100's of millions.
(1) I don't know if this is what they're thinking, but not all single instructions that operate on "multiple data" are vector operations... e.g. I wouldn't call summing all items in an __m128 to be a vectorized operation. Similarly I suppose not all vectorized operations are SIMD, since they might just use normal (SISD) instructions on entire arrays. But again, I'm not sure this is really what they mean; it wouldn't seem to be a very useful distinction to make.
Meh. They used 448 cores to count the frequency of bit patterns of some small length in a probably more or less continuous block of memory. They had 57,756,221,440 total rows, that are 128,920,138 rows per core. If the data set contained 256 or less different stock symbols, then the task boils down to finding the byte histogram of a 123 MiB block of memory. My several years old laptop does this with the most straight forward C# implementation in 170 ms. That is less than a factor of 4 away from their 45.1 ms and given that AVX-512 can probably process 64 bytes at a time, we should have quite a bit room to spare for all the other steps involved in processing the query.
Don't get me wrong, in some sense it is really impressive that we reached that level of processing power and that this database engine can optimize that query down to counting bytes and generating highly performant code to do so, but as an indicator that this database can process trillions of rows per second it is just a publicity stunt. Sure, it can do it with this setup and this query, but don't be to surprised if you don't get anywhere near that with other queries.
> My several years old laptop does this with the most straight forward C# implementation in 170 ms.
Sure, but writing a custom C# program for each query on data that has been pre-formatted in the most optimal manner for that program is not really comparable to writing a bog standard SQL query.
I am not sure I understand what you want to say but I assume you want to say that what I did was kind of cheating because I wrote code tailored to that problem and to work on data in a format that makes that problem rather easy? Correct me if I am wrong.
But they did exactly the same just that the query planer, optimizer, and compiler generated the code from the SQL query. They still picked a data set, a physical layout of that data set, and a query that would result in maximum throughput. That was not any random SQL query, it was a carefully picked one, chosen because it would be able to take advantage of all possible optimizations.
It'll be faster than Redshift and about the same as ClickHouse, +/- depending on hardware and setup.
It's a great system, we used it for 2 years and it's one of the most polished databases out there with a simple MySQL interface. It's more general purpose than kdb, with a nice rowstore + columnstore architecture. I believe they're adding full-text search indexes in the latest version too.
If you need the query language, the advanced/asof joins, or the tightly integrated query/process environment, then there's no match to kdb though.
The threshold for most people is slightly lower than a quarter second, but it's certainly the case that above a quarter-second users will notice the delay[0].
However, people may be primed for longer delays still seeming instant. e.g. Smartphones for many years had a built-in 300ms delay on any click event, and even without that event typically still have delays on many 'instant' actions.
So while the delay will be registered as being present, it may not be registered as "this site is slow" but "it's just a natural delay".
I completely agree, a quarter of a second is absolutely not instantaneous.
Especially not to oddballs like me that has some kind of perceptive 'bug', it's like I lack a 'motion filter'. One effect being that movement in computer games doesn't feel completely fluid until the refresh rate is close to 200Hz.
Before internet and slow loading web pages might have lowered peoples expectations, I remember Human-computer interaction guidelines stating that < 0.1 was experienced as instantaneous in almost all cases. Above 1s without feedback started to cause measurable stress in test subjects.
Since this was before computers were ubiquitous, it's probably a good measure for how we react on a more basal, subconscious level. Anything measured today is likely to be include learned expectations, so a quarter of a second seems like reasonable learned expectation of the perception of instantaneous in that particular context.
This isn't about raw physics and biology but rather the use of business applications by users, specifically with regards to analytics and exploratory queries.
Anything less than a second to run a query and return results is considered "instant" (also sometimes "interactive") for analysis.
What? No human trades that quickly, and this is not about automated HFT trading systems but rather running database queries and what typical business users expect as "instant" results.
Well, you can try and run MySQL or Postgre off a RAM drive, but I suspect you won't get close to this result. And then your data is not on a reliable storage media.
Can't talk about MySQL or PostgreSQL, but with Oracle I am pretty sure you can get very much close to this result with cached data. And this is without using the in-memory option. However, talking about it withouth any experiments to prove a point is just speculation. Memory vs Disks changes a lot performance-wise.
I believe you're conflating the name with the actual contents of the post. The demo uses the MemSQL columnstore, which is disk-based and leverages memory only for indexes, metadata, query processing etc.
Data ingest scales linearly assuming the source is scalable (S3, HDFS, or Kafka). There are other things that matter: how wide is the table, what data types, etc. We achieve 1GB/s on a 16 nodes cluster for some combination of the above. What is your target?
The speed of light is a hard limit. I don't believe there is any free lunch[1], but trade-offs to manage. I'm skeptical of any claim that implies free or easy speed without potentially significant trade-offs.
If you can live with somewhat out-of-date and/or out-of-sync data, you can throw mass parallelism at big read-only queries to get speed. The trade-offs often are best tuned from a domain perspective such that it's not really a technology problem, although technology may make certain tunings/tradeoffs easier to manage.
[1] (Faster hardware may give us incremental improvements, but the speed of light probably prevents any tradeoff-free breakthroughs.)
What are you on about? The author is counting rows in a database and checks how much time it takes the database to process the query. Counting rows is easily parallelisable.
So are many problems, and we still are at a point where we are creating languages and frameworks that allow us to work with the data at a higher level: e.g. theano, numpy or TensorFlow.
Speed of light is only an issue for latency-sensitive operations: e.g. if every sub-operation is non-commutative in a given transaction then it would be expected to be completed before any other sub-operation.
Re: Speed of light is only an issue for latency-sensitive operations
That was my point: if you don't care about data being "fresh" and consistent (per transaction coordination) you CAN throw mass parallelism at the problem. If you do care, then the speed of light is the bottleneck.
We actually use memsql for almost exactly this use case. Realtime arbitrary grouping and filtering on multiple (30+) columns on a billion+ rows with substantial concurrency (reporting tools and api's of them). Previously with pgsql this required a lot of fine tuning of indexes and lots of etl scripting to break things into separate dimension tables, and then large join queries to pull them back. This is expensive at the development and operational level, and data was batched in which involves delays. It was also extremely resource intensive to query and moderate levels of concurrency required a master and several slaves, with response times for anything more than a week of data requiring multi-second waits, with the worse cases approaching the minute mark.
The hardware in this example is overkill and impractical for most use cases - to say the least. For our setup, Memsql does this for us on a single node with 256Gb of ram, 40 cores (1 aggregator and 4 leafs) and a modest enterprise nvme ssd. The machine cost $4,500 over a year ago. Adding more machines to mem is pretty trivial should we ever need to partition this across machines, despite this not being necessary.
There are some gotchas and it should not be consisted a drop-in replacement for MySQL.
Originally it was just a port, but now the inserts go straight into mem. This used to be a big no-no on mysql (with inno and myisam anyway) as it would invalidate a lot of query cache on every insert. Here you can refresh the query every second and see the counts go up.
Ad Tech has no need for realtime data viewing or aggregation (in this manner), even for platform log data. Offline parallel processing is the standard. Redshift is particularly efficient, while others use Spark or other ad-hoc solutions.
For users, you always want mediation/adjustment steps that (can) modify realtime data to provide timesliced totals. For developers/administrators, you want to be able to persist data. Running totals in memory are too fragile to be reliable. There is an assumption of errors, misconfigurations, and bad actors at all times in AdTech.
We used MemSQL for real-time data for 2 years. All data is fully persistent, but the rowstore tables are also fully held in memory compared to columnstores which are mainly on disk. There's nothing fragile about it. SQL Server's Hekaton, SAP's HANA, Oracle's Times Ten, and several other databases do the same.
Timesliced totals is just a SQL query, and mediation or some other buffer from live numbers for customers is up to every business to decide, not some default proclamation for an entire industry.
Real time analytics without the need for aggregation or periodic ETL? To turn your question round: Which system does not want that? MemSQL or similar offerings (on preferably more standard hardware) is definitely interesting.
EDIT: This comes on top of the fact that DBs can store queries results too. Moreover the post does not tell whether they have implemented clustered or filtered indexes on the considered columns. It does not explain how partition has been performed too. All this has a big impact on execution time.