Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting article. However, I find some claims regarding Flink to be inaccurate:

> Flink is split into two apis - the datastream api is strongly focused on high-temporal-locality problems and so can't express our running example

While the streaming API doesn't have the relational SQL syntax, it should have all the necessary building blocks to do anything that can be done with the table API.

> The file source appears to load the current contents of the file in a single batch and then ignore any future appends, so it's not usable for testing streaming behavior

The file source can be used for testing streaming behavior, for example by using a directory and using a PROCESS_CONTINUOUSELY watch type. See: https://ci.apache.org/projects/flink/flink-docs-stable/dev/d...

> Flink also has a 5s trailing watermark, but it doesn't reject any inputs in non-windowed computations.

This is expected, that's how Flink deals with late data by default, so that it doesn't loose any. But it can be done using the Datastream API, afaik. In winodwed computations, allowedLateness(...) can be used to drop late records: https://ci.apache.org/projects/flink/flink-docs-release-1.12.... If the computation graph has no windows such as the article example, the low level operation ProcessFunction can be used to access the watermark using timerService.currentWatermark() and to drop late events. Keep in mind though, that this comes at the cost of parallelization as the data stream should be keyed in order to be able to use timers (see for example: https://stackoverflow.com/a/47071833/3398493).

Also, it seems very odd to me that a watermark of 5s is used while the expected out-of-orderness is 10s. The watermark is usually set in accordance with the extected out-of-orderness, precisely to avoid consistency issues caused by late data. Why did the author choose to do that?



> it should have all the necessary building blocks to do anything that can be done with the table API

That's true in that the table api itself is built on top of the datastream api. But to run the example you'd have to implement your own joins and retraction-aware aggregates and then we'd be testing the consistency properties of that implementation, not of the datastream api itself, and we might as well test a mature implementation like the table api instead.

> The file source can be used for testing streaming behavior, for example by using a directory and using a PROCESS_CONTINUOUSELY watch type.

Ah, I was only looking at the table api version (https://ci.apache.org/projects/flink/flink-docs-release-1.12...) which doesn't mention that option. I guess I could make stream and turn it into a table?

> This is expected, that's how Flink deals with late data...

I should clarify in the article that I was surprised not because it's wrong, but because it's very different from the behavior that I'm used to.

> Also, it seems very odd to me that a watermark of 5s is used while the expected out-of-orderness is 10s. Why did the author choose to do that?

To explore how late data is treated.

There are some plausible sources of inconsistency that from late data handling that I didn't get to demonstrate in the article. For example, it sounds like even without allowed_lateness if two different windowed operations had different window sizes they might drop different subsets of late data, which could causes inconsistency between their outputs.


Thank you for your response.

Regarding FileProcessingMode.PROCESS_CONTINUOUSLY, what I meant is that you can watch a directory, and write batches to new files instead of appending to the same file. That way, each file will be read only once.


Ok, that makes sense.


Oh, wait, this doesn't sound promising.

> If the watchType is set to FileProcessingMode.PROCESS_CONTINUOUSLY, when a file is modified, its contents are re-processed entirely. This can break the “exactly-once” semantics, as appending data at the end of a file will lead to all its contents being re-processed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: