Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An inherent problem with data engineering dag frameworks is that they are not directly integrated with the nodes, they live above the nodes. The nodes don’t know about the DAG and depending on the software they’re running they’ll have different semantics for checking if something is stuck, cancellation, etc.

I think there’s a lot of room for innovation here. Given 1+ data streams or ingestion locations, a bunch of SQL scripts, a DAG to orchestrate the scripts, and 1+ data destinations there are many different execution models that could be used but aren’t. You’ve only specified the pipeline semantics rather than implementation, so smart tooling should be able to automatically implement patterns like streaming or intermediate queueing without much further input. IMO that’s what DAG frameworks could be: “compilers” for your data pipeline, rather than orchestrators. There’s progress in the area but nothing that quite gets there yet AFAIK



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: