I have doubts about the "We’re actually spending some time on every row" claim. ...

nikita · on March 19, 2018

The query in the example is not “count” but a count + group by query. While it is possible to precompute group by results on every possible columns per page we don’t do that and this query does touch every row. The article touches on how this is possible: operations on encoded data, AVX2, not using a hash table for the intermediate result of the group by. And we certainly don’t fake it with storing query results.

We guarantee that the result is legit.

HenryR · on March 20, 2018

What does the columnar format look like? Particularly, is the group by column compressed with RLE? That’s kind of a pre-computed group-by + count that would make this kind of query very very fast :)

nikita · on March 22, 2018

This is an astute observation. Here is the accurate answer from the author of the feature: “We do have special handling for RLE for filters and group by column but in both cases there would be per row component (updating selection vector for filter on RLE, updating aggregates for group by on RLE) and so far I saw RLE encoding usually slower than dictionary encoding for both these cases”

nikita · on March 19, 2018

There is a clustered columnstore index on the table. MemSQL doesn’t support “filtered” indexes (indexes that let you specify a where clause that can be matched in a query with an equal or stricter where clause). Since the query needs a full scan those don’t help anyway.

We partition data on stock symbol but this doesn’t make a difference for this query either. It would if we had filters.

JVerstry · on March 19, 2018

I wouldn't be so sure. If data is partitioned on stock symbols, then there is already a grouping by subset happening at the partition level, which may considerably ease the job of aggregation (it depends on the number of symbols and it seems you have about 1500-1600). Considering there is a clustered index, it means values are saved "in order". Then, a page could easily have a min value = max value situation, and the saved count in meta-data would be valid (no need to scan all values in the page again). Should the index use a dictionary, then one only has to count the number of items in the corresponding entry without scanning the rows. Such information is stored in the index, not in the rows. I have no doubt about one's intention. It's just that I think we could do better to support the claim. If your where clause contained something like (shares mod 3) = 0, then I would be pretty sure all rows would be scanned, because such information is not aggregated at the page level. If possible, I would also check the execution plan for any incongruent values.

nikita · on March 19, 2018

That’s right: partitioning on stock symbol allows to push down LIMIT 10 to each partition, however in this case with 1500 stock symbols it doesn’t buy us much. It’s actually possible to compute a full group by (without a limit) by on every partition and merge them at the end. Merging 1500 groups is computationally trivial.

Yes, shared mod 3 or other predicate would make it impossilbe to run this query in O(metadata). It would of course burn more instructions so we would have to have a bigger cluster to hit a trillion a second as well as have a complex explanation in the blog post why this matters.