Separating the storage layer from the database and query layer seems to be the f...

jandrewrogers · on Feb 26, 2019

People often greatly underestimate the performance loss incurred by separating the storage layer from the execution layer. The difference is an integer factor. In a high-performance database design, the execution scheduling behavior must be tightly and intricately coupled to the I/O operation scheduling behavior or you lose much low-hanging throughput optimization.

That said, this can be solved by changing the level of abstraction. If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine. But that is not a thing that exists in open source AFAIK.

The loss from separating storage and execution may not matter for databases where operational efficiency is not paramount (e.g. because the scale is too small for it to matter) but many database engine designers are not going to take that use case limiting performance hit because the CapEx/OpEx implications are large.

ralusek · on Feb 26, 2019

I think the point is that now that we're in the era of distributed/cloud/partitioned databases, the query logic cannot necessarily be colocated with the storage, so might as well get the advantages of decoupling.

The performance gains at the distributed level seem to rely on either parallelized compute (a-la MapReduce) or over-indexing (a-la ElasticSearch/GIN) rather than the sort of efficiency gains that were made at the level of optimizing between cpu/cache/memory/disk. I think modern problems regarding distributed databases are concerned with the time complexity not exploding as the size of the data does.

jandrewrogers · on Feb 26, 2019

The query logic/execution is pushed down to the storage in most reasonable distributed database architectures, along with enough metadata that the individual storage engines are fully aware of (their piece of) the execution plan and orchestrate it directly with other nodes e.g. for joins. The reason for local storage is that this will be almost completely bandwidth bound. When you look at growing markets like sensor analytics for edge computing this is the only model that works at scale. Most new parallel databases aren't using a classic MapReduce model or really anything that requires centralized orchestration.

The distributed databases in open source are a bit of a biased sample as they tend to copy architectures from each other and are a trailing indicator of the state-of-the-art. For example, you virtually never see facilities for individual nodes to self-orchestrate parallel query execution (this is almost always centralized), which entirely changes how you would go about scaling out e.g. join operations or mixed workload. Building this facility typically requires a tightly coupled storage and execution engine, but open source systems are strongly biased toward decoupling these things for expediency. Open source database architecture is trapped in a local minima.

cozos · on Feb 26, 2019

In other words, are you saying that distributed databases with local storage can greatly reduce network I/O with things like predicate pushdown, which is not possible when the database is built on top of a distributed file system?

evanweaver · on Feb 26, 2019

> If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine.

This is what we are doing with FaunaDB; Fauna's own native semantics are essentially a unified intermediate representation of a variety of high level query languages, forthcoming.

necubi · on Feb 26, 2019

Indeed, but it's not a new idea. This is Bigtable's architecture after all, which is now 15 (?) years old. There are a bunch of advantages: DB developers can focus on the DB, not the storage layer (which is in many ways much harder to get right), you can scale storage and processing independently, you can easily support different tiers of storage (SSD/HDD).

But by far the biggest advantage is operational. Your DB is now stateless! Nodes can die and be replaced in seconds, because they don't store any data; this makes designs like Bigtable's much more practical. You can quickly scale up the processing to deal with spikes without spending hours re-replicating. Operating the storage layer is still inherently challenging, but just has to be figured out once for all your DBs and other processing systems.

While performance can take a hit, this can be mitigated by local caching and good internal networking. It isn't able to meet the lowest levels of latency (sub ms), but in practice it works great for web applications.

ddorian43 · on Feb 26, 2019

Bigtable was always colocated on the data server. If it failed, you had to have nodes with bigtable having the same ranges of data.

Though later it went stateless.

necubi · on Feb 27, 2019

I'm not totally clear on the history, but from the bigtable paper it appears that (at least circa 2006) Bigtable and GFS were not colocated.

louis-paul · on Feb 26, 2019

CockroachDB is made up of several layers to implement SQL and transactions on top of a distributed key-value store: https://www.cockroachlabs.com/docs/stable/architecture/overv...

TiDB has a similar architecture: https://www.pingcap.com/docs/architecture/

gigatexal · on Feb 26, 2019

The TiDB approach is a lot more like Aurora though as you have much finer grained controls over how many storage and query nodes you have.

senand · on Feb 26, 2019

Datomic does the same. Storage, query and writing are completely decoupled.

See more https://docs.datomic.com/on-prem/storage.html#storage-servic...

BrentOzar · on Feb 26, 2019

Microsoft’s new Azure SQL DB Hyperscale does something very similar:

https://www.brentozar.com/archive/2019/01/how-azure-sql-db-h...

I sketched out a lot of architecture diagrams in that, showing how it differs from traditional RDBMS’s, and I tried to keep the post approachable for database folks from different platforms.

mirlyn86 · on Feb 26, 2019

It's too bad it doesn't perform very well. Was looking forward to an Aurora-like SQL Server.

I hope they fix it

Arqu · on Feb 26, 2019

In my eyes, Google pioneered the commercial application of this approach. The advent of BigTable and it's underlying colossus storage engine they literally pushed the advent of HDFS and all the BigData tooling from a decade ago.

qaq · on Feb 26, 2019

Depends on the use-case you can look at performance of say ClickHouse vs alternatives which separate storage layer. The performance difference is fairly significant.

avereveard · on Feb 26, 2019

CosmoDb has a similar approach but still a long way to go