Hacker Newsnew | past | comments | ask | show | jobs | submit | martin_loetzsch's commentslogin

I think this summarizes the topic quite well: https://pyfound.blogspot.com/2022/05/the-2022-python-languag...


Project A | (Senior) Data Engineer | Lake Bodensee, Austria or Stuttgart, Germany | ONSITE (REMOTE during covid), Full-time | https://www.project-a.com/careers/data-engineer-mfd-49896240...

Work where others make vacation on building a modern green-field data infrastructure with Open Source technologies. Python, PostgreSQL, BigQuery as Data Lake, Kinesis for CDC / event streaming.


(Original author here) That setup is of course an option. The point of the article is to not have javascript pixels in the website for tracking, and it's easy to have server side-tracking. So anything that's not based on web browser based tracking is fine.


Also weird: it's the name of a giant ugly guinea pig: https://en.wikipedia.org/wiki/Mara_(mammal)


GNU Make is indeed the least verbose/ boilerplate-heavy tool and I use it for a lot of things.

The problem with Make is lacking acceptance amongst younger programmers who always want to work with the latest technologies.


(author here)

Currently there is a hard dependency to Postgres for the bookkeeping tables of mara. I'm working on dockerizing the example project to make the setup easier.

For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add more).


I'm a bit confused about this. What if the target is HDFS? Why this dependency on SQL databases for ETL?


(author here)

The mara example project [1] does exactly that. It combines PuPI download stats with Github repo activity data.

[1] https://github.com/mara/mara-example-project


Thanks! Just took a look.

The file directory structure is a bit confusing -- could you point me to the file that performs this transformation?


For example the PyPI download stats pipeline is here: https://github.com/mara/mara-example-project/tree/master/app...

The __init__.py contains the pipeline, and the rest is the SQL files that do the transformations


Thank you!


(author here)

It intentionally doesn't have a scheduler, just definition and parallel execution of pipelines. For scheduling, use Jenkins, cron or Airflow.

Currently you can get notifications for failed runs in slack. Alerting itself is not really in the scope of this project, but it should be easy to implement in a project.


(author here)

That's absolutely correct. Mara uses Python's multiprocessing [1] to parallelize pipeline execution [2] on a single node so it doesn't need a distributed task queue. Beyond that (and visualization) it can't do much. In fact it doesn't even have a scheduler (you can use jenkins or cron for that).

[1] https://docs.python.org/3.6/library/multiprocessing.html

[2] https://github.com/mara/data-integration/blob/master/data_in...


(author here)

Connection information is configured in code through [1], see [2] for an example.

It's very easy to run other workloads. Either by directly invoking Python functions from tasks or by writing own commands (operators)[3].

There is a command line. It's the interface for running from external schedulers (jenkins, cron)[4] & [5]

[1] https://github.com/mara/mara-db

[2] https://github.com/mara/mara-example-project/blob/master/app...

[3] https://github.com/mara/data-integration/blob/master/data_in...

[4] https://github.com/mara/data-integration/raw/master/docs/exa...

[5] https://github.com/mara/data-integration/raw/master/docs/exa...


Perhaps this is addressed elsewhere, but do you have any plans to support Common Workflow Language?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: