Hi there! There's work to do on both, and I encourage you to contribute to core if you like! We could really use performance improvements on the checkers--Knossos and set-full are great, but much more expensive than they need to be. It'd be nice to have a robust mechanism for adjusting CLOCK_MONOTONIC & friends. I don't have a good way to inject filesystem level faults, like simulated node crashes where non-fsynced data is lost. jepsen.control is plagued by what I think are race conditions in Jsch, which I've never had the chance to dig in and fix.
Database hacker here. I have dreamed of a filesystem fault injection system like this: a fuse filesystem that just passes through to a real filesystem, but also writes a log of syscalls. Then a test mode that can simulate crashes at various points in the log history, with pages (of some size) flushed at random (or other settings) until fsync is issued and completes. This could be used for recovery testing, to looking for missing fsyncs and torn page resilience.
I've used it so far to find and deterministically reproduce situations where the Haskell compiler and build tools wrote files in non-atomic/non-durable ways that would lead to failures when the machine was hard-rebooted at the wrong time.
Nice! Yeah, syscall interception may be a better way to do this than fuse.
So, the thing I want is the ability to take a log (which probably also has checkpoints in it, I dunno) captured by normal operation of a database, and then run a super long slow test that puts the (fake) filesystem into a state as of various points along that log where some not-yet-fsync'd changes are lost, on a sector-by-sector basis (most systems rely on at least 512 sector pages being atomic, so you'd probably do it at that size). Then you'd test if the database can recover successfully and pass some sanity test at all those points along the log.
People do this kind of testing by pulling the power on busy databases. This would simulate pulling the power at a huge number of times, on a maximally non-forgiving filesystem/hardware (= 512 byte sector atomicity, flushed in random order).
Thank you! Clojure is a natural fit for Jepsen: it has access to broad library support (key for testing lots of different databases), makes data manipulation easy (simplifies writing checkers), and has a good concurrency story (important for a concurrent testing system). It's got reasonable performance and a very expressive language core.
Writing tests is also great! Pick a system you like and dive in. There's a full tutorial (https://github.com/jepsen-io/jepsen/tree/master/doc/tutorial) that walks you through designing and running a test from scratch.