I never understood why systems like Spanner use real time rather than a Lamport clock like this ...
Is it simply because they don't want the clients to have to take/give a clock parameter with every request? Or is there some theoretical reason spanner couldn't work with logical clocking?
So the basic issue is touched on in the article in the Partial Order section.
Lamport Clocks cannot order unrelated events. But for higher consistency levels in databases, this is exactly what we need.
The classic solution to this is to extend Lamport clocks into Vector Clocks. As the name implies, these are a vector, where each entry corresponds to a Lamport clock at that server. The obvious downside of this is you need vectors as long as the number of servers, making timestamps considerably more expensive, as well as the inflexibility this implies.
One of Spanner's key properties is External Consistency. This is as strong as it gets. Spanner is capable of correctly ordering all events across the entire system in relation to idealized physical time. It does this through a combination of better than typical physical clock synchronization, and waiting out any ambiguity window as needed. The latter does have a performance cost, but the paper makes the argument such costs combined with consistency are preferable to the subtle bugs developers will inevitably introduce without such.
Another thing in the mix here is Hybrid Logical Clocks, which are an extension of Lamport Clocks that keeps them within some constant of the idealized physical time. This is actually what CockroachDB uses (or at least a variation on it).
To get this stuff to gel in your head I would suggest reading the primary papers:
There are also a couple educators/experts that have great presentations on this topic material, namely Kyle Kingsbury aka Aphyr and Lindsey Kuper. I apologize for not chasing down the exact talks, but they should be up on youtube and such.
>Spanner is capable of correctly ordering all events across the entire system in relation to idealized physical time
As "idealized physical time" itself does not have consistent ordering, surely this is impossible in real life? The notion of a single universal instant is rough approximation which breaks down when separate nodes are moving relative to one another, or are at different gravitational potentials. We're putting gigahertz CPUs on spacecraft now so this isn't just a theoretical problem! How is this resolved? Perhaps an enforced maximum temporal resolution?
> As "idealized physical time" itself does not have consistent ordering, surely this is impossible in real life?
It's possible when you have reliable clocks to provide time bounds, and knowledge of the physical system to provide space bounds to permit clock synchronisation to some accuracy.
If two external agents make requests that conflict or interact (aka are non-commutative), if the requests are close enough in time and separate enough in space, the lack of true global time (ie. relativity) means there will be no physically unambiguous order to the requests.
But you can delay the responses to non-commutative requests for long enough to impose an unambiguous global order on any external logic that depends on information in the response.
In effect, the "global time" of external agents events isn't a function of the instant when they make a request, it is an ordering function of the both the request and response events, which makes each external request take a time range physically. The ranges are managed to ensure there is a global order to any possible information dependencies the external agents may be using. In a way, this is a neat optimisation of vector logical clocks while maintaining the logic, reducing the dimensionality of the vector. This can be generalised in large distributed systems to partial dimension reduction.
When the system has clocks that are synchronised across the internal system to some accuracy, "close enough in time" can be resolved more accurately, and the required delays to disambiguate requests that are close together in time can be reduced.
It is possible Spanner might develop an inconsistency near the event horizon of a black hole. Something to watch out for.
'As "idealized physical time" itself does not have consistent ordering' - No. A consistent "causality" ordering exists relative to the big bang. A simultaneous "now" does not exist, but that does not prevent relative orderings from being discerned. An enforced max [absolute] time resolution is not necessary, but a consistent universal ordering requires relative speeds, which require a reference frame or gauge, and a sequencing clock for each observer. With regards to satellites, the frame is motion relative to the [center of the] Earth.
Is this distinction even necessary to make here? For all practical database usages, I doubt you need enough precision to have to account for special and general relativity.
There’s a difference between causality ordering and global total ordering.
Lamport timestamps and vectors help you get causality ordering. This is not enough for Spanner.
Spanner needs total global ordering. A perfect real time clock would let you achieve total ordering. As you’ve pointed out, Spanner relies on a clock. However, no clock is perfect, so Spanner has to account for that margin of error. And because of its clock (TrueTime) that margin of error is small enough to keep latency low.
I don't think there's a catchy one-liner which can explain why Spanner "needs" real time instead of Lamport clock. Spanner is a system with some very concrete characteristics (Paxos consensus, strongly consistent pessimistic writes, non-blocking snapshot reads etc.). There are ways to use Lamport clock instead of TrueTime, but then you'll also have to lessen some of the correctness/performance guarantees [and thus it would no longer be "Spanner"]. To fully understand the trade-offs you kinda have to dive into it and understand the system as a whole.
Spanner only uses clocks, aka “TrueTime”, for transactions. It uses an estimate of error to basically build in waits into the system so transactions can be globally ordered.
The underlying system is all built on Paxos though, which uses logical clocks under the hood. That is, the fine grained distributed log is all done w/ logical clocks and events.
Why doesn’t it use logical clocks for transactions? Well I guess it is just a lot simpler and scales better while providing global ordering. The tradeoff being additional milliseconds to decide a transaction - which is probably fine for high level transactions.
But the whole point is that "real time" doesn't exist in some comparable sense, due to (if nothing else) relativity (which can't be ignored at the speed of light level, but is also relevant at an analogous, slower, speed of packets level). You need something like vector clocks, interval trees, or (at least) lamport clocks to understand the causal impacts of effects, as if something supposedly happened somewhere but there was no way for you to have yet known it happened due to being too far away to have noticed the impact yet by the time you did something else, you simply can't claim anything you did happened "after" the other thing: the ordering of two events disconnected by a-causality is fundamentally subjective and you can't sweep that under the rug.
I’m not quite sure what point you’re making here. Spanner ensures causality by waiting until TrueTime tells you that the commit time is considered “in the past” for all replicas. There’s nothing swept under the rug here and this is a theoretically sound way of ensuring consistency (as long as you’re willing to sacrifice correctness if you’re not able to keep clocks in sync).
The point of a database is to compute on. Nobody cares about the nanosecond precision time of when each event happened or the exact real-world ordering between them. What they care about is that all replicas process events in the same order so that they get deterministic results.
The strong consistency spanner provides gives that guarantee, whereas using causal consistency would not.
I feel like the "sweeping under the rug" is being done here by way of engineering compromises (assumptions & bounds).
The compromise in this case happens to be something along the lines of "If the GPS constellation fails, TrueTime fails (or gets really slow)". For better or worse, all of these TrueTime-style products (AWS has one now too), are effectively dependent on GPS to provide a global synchronous context. I don't think there is any feasible way to bootstrap or maintain a distributed clock with higher precision than its terrestrial network links would otherwise allow.
The way I look at this, the GPS constellation contributes to the core of these database engines. It provides the most essential identity for all transactions (precise & unique timestamps). The actual database servers are just storing the data and ensuring transactions are interleaved properly.
It is probably reasonable from a practical perspective to align your failure modes with those of other telecom and military organizations.
To deal with this threat model, spanner has atomic clocks in each data center, named “Armageddon time masters”, the most badass name of a component in any distributed system I know of.
Exactly, and Google claims to have made those bounds so small they can be assumed to be zero for all practical applications. They mention the bound size somewhere, and if you have enough leading zeroes for your use case you can just round it down.
Do you have a source for this? From my understanding these bounds are actively being used during the commit protocol: The leader will wait (i.e. just run sleep()) for a period of time. This period is determined from the bounds and ensures that the commit is definitely in the past [across all replicas].
Cockroach uses Hybrid Logical Clocks, which are substantially different from Lamport Clocks. It does avoid the need for atomic clocks, but is different enough from LC that I think it should be seen as its own thing.
If you want to write a consistent distributed system or replicated state machine there are roughly two choices - two phases commit (“2PC”) and Paxos (there are many variants). Lamport clocks are one of the stepping stones to Paxos (along with the FLP result). It’s a fun field of study
Is it simply because they don't want the clients to have to take/give a clock parameter with every request? Or is there some theoretical reason spanner couldn't work with logical clocking?