Notes on the Amazon Aurora Paper

matharmin · on Feb 26, 2019

Separating the storage layer from the database and query layer seems to be the future of databases - there's no reason for every database to re-implement the storage and replication layer just to provide a different query language or data model. This is also what makes FoundationDB so attractive for me. Are there any other large projects doing this?

jandrewrogers · on Feb 26, 2019

People often greatly underestimate the performance loss incurred by separating the storage layer from the execution layer. The difference is an integer factor. In a high-performance database design, the execution scheduling behavior must be tightly and intricately coupled to the I/O operation scheduling behavior or you lose much low-hanging throughput optimization.

That said, this can be solved by changing the level of abstraction. If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine. But that is not a thing that exists in open source AFAIK.

The loss from separating storage and execution may not matter for databases where operational efficiency is not paramount (e.g. because the scale is too small for it to matter) but many database engine designers are not going to take that use case limiting performance hit because the CapEx/OpEx implications are large.

ralusek · on Feb 26, 2019

I think the point is that now that we're in the era of distributed/cloud/partitioned databases, the query logic cannot necessarily be colocated with the storage, so might as well get the advantages of decoupling.

The performance gains at the distributed level seem to rely on either parallelized compute (a-la MapReduce) or over-indexing (a-la ElasticSearch/GIN) rather than the sort of efficiency gains that were made at the level of optimizing between cpu/cache/memory/disk. I think modern problems regarding distributed databases are concerned with the time complexity not exploding as the size of the data does.

jandrewrogers · on Feb 26, 2019

The query logic/execution is pushed down to the storage in most reasonable distributed database architectures, along with enough metadata that the individual storage engines are fully aware of (their piece of) the execution plan and orchestrate it directly with other nodes e.g. for joins. The reason for local storage is that this will be almost completely bandwidth bound. When you look at growing markets like sensor analytics for edge computing this is the only model that works at scale. Most new parallel databases aren't using a classic MapReduce model or really anything that requires centralized orchestration.

The distributed databases in open source are a bit of a biased sample as they tend to copy architectures from each other and are a trailing indicator of the state-of-the-art. For example, you virtually never see facilities for individual nodes to self-orchestrate parallel query execution (this is almost always centralized), which entirely changes how you would go about scaling out e.g. join operations or mixed workload. Building this facility typically requires a tightly coupled storage and execution engine, but open source systems are strongly biased toward decoupling these things for expediency. Open source database architecture is trapped in a local minima.

cozos · on Feb 26, 2019

In other words, are you saying that distributed databases with local storage can greatly reduce network I/O with things like predicate pushdown, which is not possible when the database is built on top of a distributed file system?

evanweaver · on Feb 26, 2019

> If you tightly coupled the execution and storage engine as a single module, which is not intrinsically dependent on the type of data model, it would allow high-performance multi-use within reasonable constraints. Albeit with a very different API than a storage engine.

This is what we are doing with FaunaDB; Fauna's own native semantics are essentially a unified intermediate representation of a variety of high level query languages, forthcoming.

necubi · on Feb 26, 2019

Indeed, but it's not a new idea. This is Bigtable's architecture after all, which is now 15 (?) years old. There are a bunch of advantages: DB developers can focus on the DB, not the storage layer (which is in many ways much harder to get right), you can scale storage and processing independently, you can easily support different tiers of storage (SSD/HDD).

But by far the biggest advantage is operational. Your DB is now stateless! Nodes can die and be replaced in seconds, because they don't store any data; this makes designs like Bigtable's much more practical. You can quickly scale up the processing to deal with spikes without spending hours re-replicating. Operating the storage layer is still inherently challenging, but just has to be figured out once for all your DBs and other processing systems.

While performance can take a hit, this can be mitigated by local caching and good internal networking. It isn't able to meet the lowest levels of latency (sub ms), but in practice it works great for web applications.

ddorian43 · on Feb 26, 2019

Bigtable was always colocated on the data server. If it failed, you had to have nodes with bigtable having the same ranges of data.

Though later it went stateless.

necubi · on Feb 27, 2019

I'm not totally clear on the history, but from the bigtable paper it appears that (at least circa 2006) Bigtable and GFS were not colocated.

louis-paul · on Feb 26, 2019

CockroachDB is made up of several layers to implement SQL and transactions on top of a distributed key-value store: https://www.cockroachlabs.com/docs/stable/architecture/overv...

TiDB has a similar architecture: https://www.pingcap.com/docs/architecture/

gigatexal · on Feb 26, 2019

The TiDB approach is a lot more like Aurora though as you have much finer grained controls over how many storage and query nodes you have.

senand · on Feb 26, 2019

Datomic does the same. Storage, query and writing are completely decoupled.

See more https://docs.datomic.com/on-prem/storage.html#storage-servic...

BrentOzar · on Feb 26, 2019

Microsoft’s new Azure SQL DB Hyperscale does something very similar:

https://www.brentozar.com/archive/2019/01/how-azure-sql-db-h...

I sketched out a lot of architecture diagrams in that, showing how it differs from traditional RDBMS’s, and I tried to keep the post approachable for database folks from different platforms.

mirlyn86 · on Feb 26, 2019

It's too bad it doesn't perform very well. Was looking forward to an Aurora-like SQL Server.

I hope they fix it

Arqu · on Feb 26, 2019

In my eyes, Google pioneered the commercial application of this approach. The advent of BigTable and it's underlying colossus storage engine they literally pushed the advent of HDFS and all the BigData tooling from a decade ago.

qaq · on Feb 26, 2019

Depends on the use-case you can look at performance of say ClickHouse vs alternatives which separate storage layer. The performance difference is fairly significant.

avereveard · on Feb 26, 2019

CosmoDb has a similar approach but still a long way to go

wgjordan · on Feb 26, 2019

> But here in the paper, it compares Aurora to a MySQL synchronous mirroring setup, which is really unfair, IMO. Why the active primary has to replicate everything over the wire to the active standby? Doesn't MySQL's replication support Statement-Based Replication?

The comparison in the paper is against the synchronous-mirroring pattern underlying RDS Multi-AZ [1], which is the existing fully-managed, high-availability MySQL solution Amazon offers, so the direct comparison is appropriate for existing RDS Multi-AZ users considering a migration to Aurora.

[1] https://aws.amazon.com/blogs/database/amazon-rds-under-the-h...

uvdn7 · on Feb 26, 2019

Thanks! That makes much more sense!

benmmurphy · on Feb 26, 2019

One of the neat things about aurora is there is a very small slave lag between the primary and read replicas. If you only have a single thread doing replication then you can easily have a situation where a multi-cpu master can get ahead of a slave only using a single thread. Lazily materializing the pages gives aurora 'parallel replication' for free. I think Mysql supports parallel replication now but I think it is limited if transactions conflict [https://mariadb.com/kb/en/library/parallel-replication/].

Another thing I noticed with Aurora is the incremental cost of storage for Aurora is extremely cheap compared to the cost of storage in EBS or the cost of storage if you used instance storage in EC2. The Aurora storage is 0.1/GB but this is replicated 6 times which is more than EBS and EBS costs the same 0.1/GB. This also might be why no-one is going to build a similar system to Aurora. It will be hard to sell something like Aurora to cloud users because the storage costs are going to more than what EC2 charges.

ralusek · on Feb 26, 2019

Given this architecture, do we know why Aurora caps at 64TB max size?

eecc · on Feb 26, 2019

Funny how the CQRS circle closes: it is an architectural design that overhauls some concepts of database design (the redo log, views) into the application layer (for some, to the extreme of reducing the database to a simple log.) And now this same separation is appearing in a DBMS design :)

tirumaraiselvan · on Feb 26, 2019

As per my understanding, CQRS fits here if you consider the database log as the "event source". But wouldn't it be terrible performance wise to create the "view" from the event source rather than using the storage layer.

ricardobeat · on Feb 26, 2019

I think there’s a parallel with how apps keep a trailing materialied view of the data instead of re-creating from the log, unless necessary.

kakwa_ · on Feb 26, 2019

I'm wondering how aurora would behave with queries mixing read and write.

For example, with something like "INSERT INTO my_table(field1, field2, field3, ...) SELECT f1, f2, f3... from root_table WHERE ...", I'm wondering how it will spread the load across nodes.

danielecook · on Feb 26, 2019

I don’t understand how aurora achieves the speed it does with a log based approach. Can someone please clarify?

wgjordan · on Feb 26, 2019

> I don’t understand how aurora achieves the speed it does with a log based approach. Can someone please clarify?

Aurora splits out 'database' nodes (the server instances you provision and pay for) from 'storage' nodes (a 'multi-tenant scale-out storage service' that automatically performs massively-parallel disk I/O in the background). Instead of MySQL writing various data to tablespaces, redo log, double-write buffer, and binary log, Aurora sends only the redo-log over the network to the storage service (in parallel to 6 nodes/3 AZs for durability).

No need for extra tablespace, double-write buffer, binary-log writes, or extra storage-layer mirroring, since durability is guaranteed as soon as a quorum of storage nodes receives the redo-log. The reduced write amplification results in 7.7x fewer network IOs per transaction at the 'database' layer for Aurora (vs standard MySQL running on EBS networked storage, in the benchmark described in the paper), and 46x fewer disk IOs at the 'storage' layer [1].

[1] https://www.allthingsdistributed.com/files/p1041-verbitski.p...

gigatexal · on Feb 26, 2019

That’s an impressive savings in io

dmlittle · on Feb 26, 2019

Last time we benchmarked Aurora (~2 years ago) the write speed of Aurora is pretty slow compared to RDS (Postgres RDS was able to achieve 3x write throughput)

PetahNZ · on Feb 26, 2019

Anecdotally after migrating from RDS MySQL to serverless Aurora there was a noticeable slowdown of our dashboard and reporting tools. Our typical workloads (ecommerce transactions) are slightly slower on average, but the peaks seem down.

sciurus · on Feb 26, 2019

Serverless Aurora likely has different performance characteristics than normal Aurora.

danielecook · on Feb 26, 2019

I never did a thorough benchmark but I was working with a poorly indexed DB (actually...no indexes) that had millions of records and despite the lack of indices the database still queried quickly.

dmlittle · on Feb 27, 2019

Read speeds are pretty good but write speeds not so much. For our particular use case 99.99% of queries ran would have been inserts with reports only generated once per month.

preetamjinka · on Feb 26, 2019

What was the concurrency like in those benchmarks?

dmlittle · on Feb 27, 2019

I don't remember as this was 2 years ago. We were only concerned with write speeds as that was the majority of queries we would be performing. Read speeds were pretty good.

ddorian43 · on Feb 26, 2019

For something open-source like this, look at tikv(transactional key-value (rust)) + tidb(query layer (go)).

anentropic · on Feb 26, 2019

dumb question which maybe has a well-known answer: has anyone done the same thing but for postgres instead of mysql?

a-robinson · on Feb 26, 2019

Amazon - https://aws.amazon.com/rds/aurora/details/postgresql-details...

anentropic · on Feb 26, 2019

aha cheers!

LOL

mcheshier · on Feb 26, 2019

Be careful though - we use Aurora PG and it's great for what it does, but they do not support managed upgrades across major PG versions yet! We're stuck on 9.6.x because the time to dump and restore our large DB is a non-starter with the rest of the business.

foxylion · on Feb 26, 2019

Did you try to upgrade using AWS database migration service?

ddorian43 · on Feb 26, 2019

Everything is faster when you move your data to ebs \s.

Though google has done this for all databases (at least their internal ones) (even bigtable). And some companies say they can do it for nvme & ram.

You just need real-good networking and some type of asic and removing kernel from the path.