A lot of the hard part isn't the model, and especially in a world where bert, xgboost, optuna, pytorch, etc have solved much of the classic problem and forced 'real' DS to specialize on either the business consulting side (not math/engineering) or theory side (barely implemented). The rebrand of 'data analyst' (SQL, powerbi, . ..) to 'data scientist' by even top tech companies underscores this. It's not yet to where web dev has gotten in terms of global $20/hr fiverrr contractors, but already at say $40/hr for someone who can build real production models for more boring scenarios.
The result is the vast bulk of data scientists (phd, self-trained, consulting, ...) we interview are weak engineers, so going from a make-believe notebook to a trickier production scenario requires the data engineer / MLOps / etc to solve a lot that a typical DS doesn't really understand in practice. Scale, latency, distributed systems, testing, etc. Likewise, the part the DS solves has little to do with the latest neuroips paper, and more just about lifecycle tasks like getting better data, which the other folks on the team will often be involved with as well.
So 2 natural high-paying paths here:
data engineer / MLOps -> MLEngineer -> DS
data engineer -> all-in-one data analyst/scientist -> ML/AI data scientist
I agree with this. From my experience most of the data scientists I have worked with didn't exit the world of Jupyter notebooks. For them, code management, CI/CD, dev/stage/prod separation, etc. is a world of its own that they are not very comfortable with. Heck, they even used Sagemaker to create git repo for their Jupyter notebooks.
It doesn't mean that there aren't data scientists who have some engineering experience as well, but this seems to be rare. For that reason, getting those ML models that they painstakingly build to where they'll generate some real value is super hard. They just don't know where to start.
Working across multiple teams and multiple functions is very challenging and it often creates friction. Therefore, creating tools and systems that will enable those data scientists to see the actual value of their labor is paramount.
That's why we're seeing a huge resurgence of so called MLOps tools and platforms that aim to solve all or some of the problems of the entire stack. We are very very early in this journey, but I believe 2020's will be for ML and AI what 2010's were for the cloud and data, ie. new Snowflakes and Databricks but for the actual ML apps. It's exciting.
It's useful to work backwards from the knowledge a DS needs to be worth their weight. Imagine a small team of $400K/yr DS + $400K/yr DE + ... and whatever hw/sw . So say a $2-3M/yr project driving $3M+ of new growing revenue or $6-12M of annual savings. At bigger companies, even more magnitudes & pressure :)
The DS will likely:
- be close to the business case & business stakeholders to ask questions a normal lead can't
- know the relevant math + ML algorithms, and build up specializations pairing DS niches ("time series forecasting") with industry niches ("supply chains in manufacturing")
- enough engineering & performance understanding to work with a DE on going from small data sets to big ones
- have an intuitive feel for all of the above - how data/usecases/etc. go right/wrong
That's a lot!!
One path is jumping in as a low-paid intern or new grad and doing your time. But a pivot is different, esp. to get paid along the way. Most CS grads had little math ("intros to stats, combinatorics, & algs; dropped linear algebra"), weak ML ("did algs; intro to ML only covered kmeans & bayes; tried running a BERT model on some data"), and little intuition for how ML typically goes wrong ("what's class imbalance?"). So if they do get hired directly as a mid-level DS, it's probably on a team of the blind-leading-the-blind. Oops.
BUT SQL/Spark/K8S/pandas/regex are real skills. Doing the data engineering, ML operations, etc., around making an ML pipeline more than a fanciful notebook that wouldn't last a minute in production is real work. That stuff does pay well, and by working with the ML folks, you'd naturally get pulled into the ML tasks as well. DS write all sorts of bugs that surface as production evolves and the full team works together on, and new features that needs a team to make real. So taking a job that mixes engineering specialties with ML specialties is a smoother pivot path for the typical CS backgrounds I've seen. Over time, drift to more ML-y aspects of the projects happening until you can do the full hop. (Nit: That won't teach the math & deeper intuition, so I'd still do courses + projects on the side.)
I wish I had real numbers. So instinct from what I've seen:
- a data analyst role rebranded as a DS role will be lower paid than a DE role, maybe 50% diff
- an actual DS role is probably higher paid than a DE role, but really depends on the job+co
- a great DS role and a great DE role are both super well compensated. Though maybe again DS higher than DE in most just b/c ability to more directly drive $. Unless something like an infra company, the DS will be inherently closer to the business & outcomes. ("I did this clever thing that netted 2% revenue spike that adds up to $40M/yr in new revenue, what did you do?")
A lot of the hard part isn't the model, and especially in a world where bert, xgboost, optuna, pytorch, etc have solved much of the classic problem and forced 'real' DS to specialize on either the business consulting side (not math/engineering) or theory side (barely implemented). The rebrand of 'data analyst' (SQL, powerbi, . ..) to 'data scientist' by even top tech companies underscores this. It's not yet to where web dev has gotten in terms of global $20/hr fiverrr contractors, but already at say $40/hr for someone who can build real production models for more boring scenarios.
The result is the vast bulk of data scientists (phd, self-trained, consulting, ...) we interview are weak engineers, so going from a make-believe notebook to a trickier production scenario requires the data engineer / MLOps / etc to solve a lot that a typical DS doesn't really understand in practice. Scale, latency, distributed systems, testing, etc. Likewise, the part the DS solves has little to do with the latest neuroips paper, and more just about lifecycle tasks like getting better data, which the other folks on the team will often be involved with as well.
So 2 natural high-paying paths here:
data engineer / MLOps -> MLEngineer -> DS
data engineer -> all-in-one data analyst/scientist -> ML/AI data scientist