I think the point is that now that we're in the era of distributed/cloud/partiti...

jandrewrogers · on Feb 26, 2019

The query logic/execution is pushed down to the storage in most reasonable distributed database architectures, along with enough metadata that the individual storage engines are fully aware of (their piece of) the execution plan and orchestrate it directly with other nodes e.g. for joins. The reason for local storage is that this will be almost completely bandwidth bound. When you look at growing markets like sensor analytics for edge computing this is the only model that works at scale. Most new parallel databases aren't using a classic MapReduce model or really anything that requires centralized orchestration.

The distributed databases in open source are a bit of a biased sample as they tend to copy architectures from each other and are a trailing indicator of the state-of-the-art. For example, you virtually never see facilities for individual nodes to self-orchestrate parallel query execution (this is almost always centralized), which entirely changes how you would go about scaling out e.g. join operations or mixed workload. Building this facility typically requires a tightly coupled storage and execution engine, but open source systems are strongly biased toward decoupling these things for expediency. Open source database architecture is trapped in a local minima.

cozos · on Feb 26, 2019

In other words, are you saying that distributed databases with local storage can greatly reduce network I/O with things like predicate pushdown, which is not possible when the database is built on top of a distributed file system?