You're nitpicking the term "journalism" here, but even in casual conversation if a friend went on and on about how great a company was and then I later found out that they:
* Got paid a referral fee if I signed up
* Owned shares in the company
* Even was roommates with the founders/CEO but failed to mention it
I would trust that person less going forward.
If you want to be perceived as trustworthy, then you shouldn't say things that you have a hidden interest in saying.
This is correct, I appreciate you for putting it so
coherently :). I think I didn’t make it clear enough in the piece that I’m coming from a stance of fast access being table stakes, and the question being about how that’s accomplished.
"Caching" is an idea of storing a result of an expensive computation in storage that is faster to get from than doing the original computation (in very generic computer terms, computation can be simply fetching from the network or slower local storage).
What you describe as "caching algorithms" are not really caching algorithms, but cached object lifetime management algorithms (LRU, LFU...).
It sounds like you are both trying to redefine what "caching" means (tying it to implementations of particular algorithms), but also what "abstraction" means.
We should be very deliberate with the language we use, and our main goal should be to make it simpler to understand, not harder — I believe you are doing the latter here.
Thanks! That was very kind of you to say. Whenever I write stuff like that, I wonder, "Does anyone find this useful?" It helps to hear every once in a while that the answer is sometimes yes.
More deserves to be written on the ILP idea, I haven't actually tried to make it work but it seems to me like the only real direction to optimize ("optimize" used in the mathematical sense, rather than the programming sense) queries that you can't just exploit the principle of optimality on (see this earlier article I wrote [1] for a bit more exposition). I think maybe some of the egg [2] people have experimented with this.
Speaking from experience, there's lots of rewrites you'd want to do in a query optimizer that having access to efficient DAG-shaped query plans would make tenable, but as a specific example they are an important part of doing full subquery decorrelation [3].
Also just in general the idea of at least being worst case optimal (polynomial faster multi way join-project w.r.t. input rows, assuming fixed query (but those polynomials are given by concepts like the "fractional hypertree width" of a query) but pathologically connected/connecting input rows to the join):
the simplest example is a simple two-column table of a directed graph's edges, e(from, to). Then:
Triangle(a, b, c) := e(a, b), e(b, c), e(c, a).
#Triangle <= c * (#e)^1.5, for a small c. (I think c = 3 in this case, but I'm not sure.)
Using binary joins for pathological input will however take O((#e)^2). If you do row-granularity lookup using an auxiliary index (actually 2 in the simplified way, but they can be merged by pushing the post-lookup choice down to index creation) along with indexing e twice (once for each of the two columns), you can do it in O((#e)^1.5):
Create auxiliary indices eaux1 and eaux2:
eaux1(from, (#e(from, to))).
eaux2(to, (#e(from, to))).
Iterate over e(a, b):
Look up eaux1(b, ?countbc) and eaux2(a, ? countca).
If countbc < countca, then:
expand through point lookup e(b, ?c). Filter those resulting e(?a, b), e(b, ?c) candidates using a point membership query suitable index of e(c, a).
else:
expand through point lookup e(?c, a). Filter those resulting e(a, ?b), e(?c, a) candidates using a point membership query suitable index of e(b, c).
Done.
As can be seen, the trick is to select between the current options for "next table to join current partial row against" at each nesting level of the generally nested-loop-join structured query execution, using premade indices that reveal the iteration count of the upcoming inner nested loop in O(1) time.
(I'm the author of the original post [1]) And while I think the rules, in the end, make sense, I think it's not quite as clear cut as you describe, consider:
SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
which returns
3
3
3
Personally, I think having the inner aggregation always attach to the nearest `SELECT` would have been an equally valid way of defining how this works, but it just so happens it is not defined that way.
I think this one is obvious actually! There's no choice but to aggregate it at the level of the `xx` `SELECT`, no other level has access to the `x` column.
I agree, I think the original sin here is the fact that whether a `SELECT` is an aggregation is determined by the contents of the scalar expressions at all. I think most of this weirdness comes directly out of wanting to be able to write both `SELECT sum(x) FROM xx` and `SELECT x FROM xx` and have them work.
Not that I have a better solution offhand, in SQL grouping by a constant value is not actually the same as not writing `GROUP BY` at all since the behaviour on empty tables is different.
What's an aggregation per se? A SQL query is best thought of an arbitrary generator function. You can, using functions like UNNEST, end up emitting multiple row-tuples for each processed input row-tuple. An aggregation is just a generator function that happens to reduce over all the input rows and then emit one row-tuple when there's no more input. Query-planning engines do not special-case this. It's just a generator node like any other generator node.
Consider: using window functions, you can do partial aggregations over subsets of the input — without even necessarily partitioning the input (i.e. you can compute "running totals" and other wacky state-machine-like outputs.)
Not writing `group by` is the same as writing `group by ()`.
And yeah, the difference between that and a value is one of those really surprising things on SQL that actually make sense and should be this way. Unfortunately, there are many of those.