I worked in Data Engineering for 6 years before recently deciding to at least ta...

I worked in Data Engineering for 6 years before recently deciding to at least take a siesta to write some "normal" software as a product engineer.

I really like data engineering in general - writing spark jobs, manipulating huge amounts of data, concerting pipelines, writing SQL... I just like it all a lot. I like SQL - there, I said it!!

But damn do the rough edges just burn you out. Airflow is one of those edges. IMO part of the problem is that everyone wants to hold it in a slightly different way, and so you end up with Airflow being this opinionless monster thing that lets you use it any way you want as long as it "works". And everything built on top of this opinion-less mess is of course also kind of a mess. And plus at a certain scale, Airflow itself needs to be maintained as a distributed system.

Spark is another one of those things with lots of rough edges. Some of it might've been the places I worked at were holding it all wrong, but Spark and all its adjacent context are so damn complicated you either have a small group of experts writing all spark code and you have to heavily prioritize what they work on, or you have a bunch of non-data-engineer folks writing spark jobs and kind of making it work sorta but it's all as inefficient as possible while still making it work, and it's really brittle as they don't know all the ways they need to think about things scaling.

Notably, I got really tired of people coming to our team and going "we wrote this big spark job, it worked fine up until it didn't, can you please help us fix it, it's mission critical". And it's just an OOM because they're pulling data into memory and have just been bumping -Xmx every 2 weeks for six months. or perhaps even worse, they were doing some wonky thing that wouldn't scale and then now they're running into weird errors because they used the feature so wrong (like, having a small dataframe with 1-2 partitions that then gets used in a subsequent stage that's HUGE and all the executors absolutely swamp the poor node or two that's hosting those partitions).

Anyway, it's easy to write a spark job that works at the time of writing, and really hard to make a spark job that will work 6 months from now.

Add to the fact that Spark is a giant money firehose, data engineering departments are constantly asked to sacrifice reliability to increase efficiency (just run everything hotter! who needs slack or buffer space?) and it just makes all the above issues worse.