As I understand it, it unblocks a dataflow to provide answers with respect to a user-time processed as-of a system-time, which is certainly an improvement.
But you'd also have unbounded growth of tracked timestamps unless you're _also_ able to extract promises from the user that no further records at (or below) a user-time will be submitted...
a promise they may or may not be able to make.
Yes, I think so. If you want to be able to handle out-of-order data, I don't think there is a way to garbage collect old time periods and still produce correct results unless you stop accepting new data for those time periods.
With the differential dataflow code I wrote for this post, that cutoff point for gc is also coupled to when you can first emit results, creating a sharp tradeoff between latency of downstream results and how out-of-order upstream data is allowed to be. With bitemporal timestamps or multiple watermarks those are decoupled, so you can emit provisional results and correct them later without giving up internal consistency, and also make a totally separate decision about when to garbage collect. That doesn't remove the problem entirely, but it means that you can make decisions about latency and decisions about gc separately instead of them being controlled by one variable.
With flink datastreams there are a lot of different levers to pull. The code that Vasia contributed recently waits for watermarks like differential dataflow. There is also a notion of triggers that emit early results from windowed operators, but these are not coordinated across different operators so they can result in internal inconsistency.
The flink table api, as far as I can tell, mostly just ignores event-time outside of windowed aggregates. So it doesn't have to confront these tradeoffs.
> the broader context is building a view over transaction decisions made by another system(s).
Most the time this is fine. If upstream is:
* a transactional database, we get ordered inputs
* another system that has a notion of watermarks or liveness (eg spanner), we can use their guarantees to decide when to gc
* a bunch of phones or a sensor network (eg mobile analytics), then we pick a cutoff that balance resource usage with data quality, and it's fine that we're not fully consistent with upstream because we're never compared against it
The hard case would be when your upstream system doesn't have any timing guarantees AND you need to be fully consistent with it AND you need to handle data in event-time order. I can't think of any examples like that of the top of head. I think the answer would probably have to be either buy a lot of ram or to soft-gc old state - write it out to cheap slow storage and hope that really old inputs don't come in often enough to hurt performance.
---
This has been an interesting conversation and I'll probably try to write this up to clarify my thinking. I'm not on hn often though, so if you want to continue feel free to email jamie@scattered-thoughts.net
As I understand it, it unblocks a dataflow to provide answers with respect to a user-time processed as-of a system-time, which is certainly an improvement.
But you'd also have unbounded growth of tracked timestamps unless you're _also_ able to extract promises from the user that no further records at (or below) a user-time will be submitted... a promise they may or may not be able to make.
Do I have that right ?