Work where others make vacation on building a modern green-field data infrastructure with Open Source technologies. Python, PostgreSQL, BigQuery as Data Lake, Kinesis for CDC / event streaming.
(Original author here)
That setup is of course an option. The point of the article is to not have javascript pixels in the website for tracking, and it's easy to have server side-tracking. So anything that's not based on web browser based tracking is fine.
Currently there is a hard dependency to Postgres for the bookkeeping tables of mara. I'm working on dockerizing the example project to make the setup easier.
For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add more).
It intentionally doesn't have a scheduler, just definition and parallel execution of pipelines. For scheduling, use Jenkins, cron or Airflow.
Currently you can get notifications for failed runs in slack. Alerting itself is not really in the scope of this project, but it should be easy to implement in a project.
That's absolutely correct. Mara uses Python's multiprocessing [1] to parallelize pipeline execution [2] on a single node so it doesn't need a distributed task queue. Beyond that (and visualization) it can't do much. In fact it doesn't even have a scheduler (you can use jenkins or cron for that).