More

martin_loetzsch · on Dec 15, 2022

I think this summarizes the topic quite well: https://pyfound.blogspot.com/2022/05/the-2022-python-languag...

martin_loetzsch · on March 1, 2021

Project A | (Senior) Data Engineer | Lake Bodensee, Austria or Stuttgart, Germany | ONSITE (REMOTE during covid), Full-time | https://www.project-a.com/careers/data-engineer-mfd-49896240...

Work where others make vacation on building a modern green-field data infrastructure with Open Source technologies. Python, PostgreSQL, BigQuery as Data Lake, Kinesis for CDC / event streaming.

martin_loetzsch · on April 17, 2020

(Original author here) That setup is of course an option. The point of the article is to not have javascript pixels in the website for tracking, and it's easy to have server side-tracking. So anything that's not based on web browser based tracking is fine.

martin_loetzsch · on May 10, 2018

Also weird: it's the name of a giant ugly guinea pig: https://en.wikipedia.org/wiki/Mara_(mammal)

martin_loetzsch · on May 10, 2018

GNU Make is indeed the least verbose/ boilerplate-heavy tool and I use it for a lot of things.

The problem with Make is lacking acceptance amongst younger programmers who always want to work with the latest technologies.

martin_loetzsch · on May 10, 2018

(author here)

Currently there is a hard dependency to Postgres for the bookkeeping tables of mara. I'm working on dockerizing the example project to make the setup easier.

For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add more).

random4369 · on May 10, 2018

I'm a bit confused about this. What if the target is HDFS? Why this dependency on SQL databases for ETL?

martin_loetzsch · on May 9, 2018

(author here)

The mara example project [1] does exactly that. It combines PuPI download stats with Github repo activity data.

[1] https://github.com/mara/mara-example-project

endlessvoid94 · on May 9, 2018

Thanks! Just took a look.

The file directory structure is a bit confusing -- could you point me to the file that performs this transformation?

martin_loetzsch · on May 9, 2018

For example the PyPI download stats pipeline is here: https://github.com/mara/mara-example-project/tree/master/app...

The __init__.py contains the pipeline, and the rest is the SQL files that do the transformations

endlessvoid94 · on May 10, 2018

Thank you!

martin_loetzsch · on May 9, 2018

(author here)

It intentionally doesn't have a scheduler, just definition and parallel execution of pipelines. For scheduling, use Jenkins, cron or Airflow.

Currently you can get notifications for failed runs in slack. Alerting itself is not really in the scope of this project, but it should be easy to implement in a project.

martin_loetzsch · on May 9, 2018

(author here)

That's absolutely correct. Mara uses Python's multiprocessing [1] to parallelize pipeline execution [2] on a single node so it doesn't need a distributed task queue. Beyond that (and visualization) it can't do much. In fact it doesn't even have a scheduler (you can use jenkins or cron for that).

[1] https://docs.python.org/3.6/library/multiprocessing.html

[2] https://github.com/mara/data-integration/blob/master/data_in...

martin_loetzsch · on May 9, 2018

(author here)

Connection information is configured in code through [1], see [2] for an example.

It's very easy to run other workloads. Either by directly invoking Python functions from tasks or by writing own commands (operators)[3].

There is a command line. It's the interface for running from external schedulers (jenkins, cron)[4] & [5]

[1] https://github.com/mara/mara-db

[2] https://github.com/mara/mara-example-project/blob/master/app...

[3] https://github.com/mara/data-integration/blob/master/data_in...

[4] https://github.com/mara/data-integration/raw/master/docs/exa...

[5] https://github.com/mara/data-integration/raw/master/docs/exa...

neuromantik8086 · on May 9, 2018

Perhaps this is addressed elsewhere, but do you have any plans to support Common Workflow Language?