Separating the storage layer from the database and query layer seems to be the future of databases - there's no reason for every database to re-implement the storage and replication layer just to provide a different query language or data model. This is also what makes FoundationDB so attractive for me. Are there any other large projects doing this?
People often greatly underestimate the performance loss incurred by separating the storage layer from the execution layer. The difference is an integer factor. In a high-performance database design, the execution scheduling behavior must be tightly and intricately coupled to the I/O operation scheduling behavior or you lose much low-hanging throughput optimization.
That said, this can be solved by changing the level of abstraction. If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine. But that is not a thing that exists in open source AFAIK.
The loss from separating storage and execution may not matter for databases where operational efficiency is not paramount (e.g. because the scale is too small for it to matter) but many database engine designers are not going to take that use case limiting performance hit because the CapEx/OpEx implications are large.
I think the point is that now that we're in the era of distributed/cloud/partitioned databases, the query logic cannot necessarily be colocated with the storage, so might as well get the advantages of decoupling.
The performance gains at the distributed level seem to rely on either parallelized compute (a-la MapReduce) or over-indexing (a-la ElasticSearch/GIN) rather than the sort of efficiency gains that were made at the level of optimizing between cpu/cache/memory/disk. I think modern problems regarding distributed databases are concerned with the time complexity not exploding as the size of the data does.
The query logic/execution is pushed down to the storage in most reasonable distributed database architectures, along with enough metadata that the individual storage engines are fully aware of (their piece of) the execution plan and orchestrate it directly with other nodes e.g. for joins. The reason for local storage is that this will be almost completely bandwidth bound. When you look at growing markets like sensor analytics for edge computing this is the only model that works at scale. Most new parallel databases aren't using a classic MapReduce model or really anything that requires centralized orchestration.
The distributed databases in open source are a bit of a biased sample as they tend to copy architectures from each other and are a trailing indicator of the state-of-the-art. For example, you virtually never see facilities for individual nodes to self-orchestrate parallel query execution (this is almost always centralized), which entirely changes how you would go about scaling out e.g. join operations or mixed workload. Building this facility typically requires a tightly coupled storage and execution engine, but open source systems are strongly biased toward decoupling these things for expediency. Open source database architecture is trapped in a local minima.
In other words, are you saying that distributed databases with local storage can greatly reduce network I/O with things like predicate pushdown, which is not possible when the database is built on top of a distributed file system?
> If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine.
This is what we are doing with FaunaDB; Fauna's own native semantics are essentially a unified intermediate representation of a variety of high level query languages, forthcoming.
Indeed, but it's not a new idea. This is Bigtable's architecture after all, which is now 15 (?) years old. There are a bunch of advantages: DB developers can focus on the DB, not the storage layer (which is in many ways much harder to get right), you can scale storage and processing independently, you can easily support different tiers of storage (SSD/HDD).
But by far the biggest advantage is operational. Your DB is now stateless! Nodes can die and be replaced in seconds, because they don't store any data; this makes designs like Bigtable's much more practical. You can quickly scale up the processing to deal with spikes without spending hours re-replicating. Operating the storage layer is still inherently challenging, but just has to be figured out once for all your DBs and other processing systems.
While performance can take a hit, this can be mitigated by local caching and good internal networking. It isn't able to meet the lowest levels of latency (sub ms), but in practice it works great for web applications.
I sketched out a lot of architecture diagrams in that, showing how it differs from traditional RDBMS’s, and I tried to keep the post approachable for database folks from different platforms.
In my eyes, Google pioneered the commercial application of this approach. The advent of BigTable and it's underlying colossus storage engine they literally pushed the advent of HDFS and all the BigData tooling from a decade ago.
Depends on the use-case you can look at performance of say ClickHouse vs alternatives which separate storage layer. The performance difference is fairly significant.
> But here in the paper, it compares Aurora to a MySQL synchronous mirroring setup, which is really unfair, IMO. Why the active primary has to replicate everything over the wire to the active standby? Doesn't MySQL's replication support Statement-Based Replication?
The comparison in the paper is against the synchronous-mirroring pattern underlying RDS Multi-AZ [1], which is the existing fully-managed, high-availability MySQL solution Amazon offers, so the direct comparison is appropriate for existing RDS Multi-AZ users considering a migration to Aurora.
One of the neat things about aurora is there is a very small slave lag between the primary and read replicas. If you only have a single thread doing replication then you can easily have a situation where a multi-cpu master can get ahead of a slave only using a single thread. Lazily materializing the pages gives aurora 'parallel replication' for free. I think Mysql supports parallel replication now but I think it is limited if transactions conflict [https://mariadb.com/kb/en/library/parallel-replication/].
Another thing I noticed with Aurora is the incremental cost of storage for Aurora is extremely cheap compared to the cost of storage in EBS or the cost of storage if you used instance storage in EC2. The Aurora storage is 0.1/GB but this is replicated 6 times which is more than EBS and EBS costs the same 0.1/GB. This also might be why no-one is going to build a similar system to Aurora. It will be hard to sell something like Aurora to cloud users because the storage costs are going to more than what EC2 charges.
Funny how the CQRS circle closes: it is an architectural design that overhauls some concepts of database design (the redo log, views) into the application layer (for some, to the extreme of reducing the database to a simple log.) And now this same separation is appearing in a DBMS design :)
As per my understanding, CQRS fits here if you consider the database log as the "event source". But wouldn't it be terrible performance wise to create the "view" from the event source rather than using the storage layer.
I'm wondering how aurora would behave with queries mixing read and write.
For example, with something like "INSERT INTO my_table(field1, field2, field3, ...) SELECT f1, f2, f3... from root_table WHERE ...", I'm wondering how it will spread the load across nodes.
> I don’t understand how aurora achieves the speed it does with a log based approach. Can someone please clarify?
Aurora splits out 'database' nodes (the server instances you provision and pay for) from 'storage' nodes (a 'multi-tenant scale-out storage service' that automatically performs massively-parallel disk I/O in the background). Instead of MySQL writing various data to tablespaces, redo log, double-write buffer, and binary log, Aurora sends only the redo-log over the network to the storage service (in parallel to 6 nodes/3 AZs for durability).
No need for extra tablespace, double-write buffer, binary-log writes, or extra storage-layer mirroring, since durability is guaranteed as soon as a quorum of storage nodes receives the redo-log. The reduced write amplification results in 7.7x fewer network IOs per transaction at the 'database' layer for Aurora (vs standard MySQL running on EBS networked storage, in the benchmark described in the paper), and 46x fewer disk IOs at the 'storage' layer [1].
Last time we benchmarked Aurora (~2 years ago) the write speed of Aurora is pretty slow compared to RDS (Postgres RDS was able to achieve 3x write throughput)
Anecdotally after migrating from RDS MySQL to serverless Aurora there was a noticeable slowdown of our dashboard and reporting tools. Our typical workloads (ecommerce transactions) are slightly slower on average, but the peaks seem down.
I never did a thorough benchmark but I was working with a poorly indexed DB (actually...no indexes) that had millions of records and despite the lack of indices the database still queried quickly.
Read speeds are pretty good but write speeds not so much. For our particular use case 99.99% of queries ran would have been inserts with reports only generated once per month.
I don't remember as this was 2 years ago. We were only concerned with write speeds as that was the majority of queries we would be performing. Read speeds were pretty good.
Be careful though - we use Aurora PG and it's great for what it does, but they do not support managed upgrades across major PG versions yet! We're stuck on 9.6.x because the time to dump and restore our large DB is a non-starter with the rest of the business.