I believe you can slightly change the flash attention kernel to implement the same kernel of this page attention, since both of them work on the key/value cache at block level.
We wanted to ensure everyone on the team has explored the full design space, and everyone agreed that the chosen solution is the best we can think of, given the current information and constraints.
Once we have done that, everyone is aligned and it doesn't matter who to implement it; it will be implemented in the same way.
Co-creator of Delos here, Let me give a concrete example for [18].
We have a "StorageEngine" interface that wraps around RocksDB, i.e. "RocksDbStorageEngine" to provide key-value interfaces; however we are worried that someone would leak the implementation detail (RocksDB is so powerful) from the "StorageEngine" interface.
To solve the problem, we write another fully in memory storage engine implementation called "MemoryStorageEngine", which acts as a specification of the "StorageEngine" interface.
We run all storage engine tests against both implementations (rocksdb/memory), except for the durability part; we also configured the system to run over both storage engines.
By doing this:
- we can use the memory based one to enforce the behavior RocksDB based one by running unit tests against both implementations.
- no one can easily leak some implementation details of RocksDB from the StorageEngine interface; to do that, you need to first add those advanced feature into memory based StorageEngine as well!
This is just one example, and we have a list of such abstractions, such as MetadataStore, Logs etc. We created multiple implementation of each of these interfaces and ensure the system could run on any combination of those implementations.
The first engineer on the Delos team here. This design decision gave us an edge: we delivered the first production cluster in 8 months by layering over existing systems ; and we smoothly migrated all underlying implementations in the following 2 years without our customer aware of it (0 downtime). This design decision is well captured by the OSDI’20 Virtual Consensus in Delos paper.
Very interesting. How did you deal with the issue where the old API had some feature X that wasn't supported in the new API but some customers depended on feature X?
I guess that a hard question for user facing API: you need to support both the old and the new features during the migration.
In our case, this design principle mostly applied to our internal APIs. Particularly, we designed our internal sub components (consensus protocol, particularly) in a way we can swap it without any downtime.
Right. So it seems you did a lift and shift over two years without changing the basic API surface. That’s ok, but stability of requirements throughout a rearchitecting project is a luxury and not the norm