If anyone here is using DuckDB in production i'd love to hear what your stack lo...

zlurker · on Sept 20, 2023

We orchestrate our ETL pipelines with dagster. We only use duckdb in a few of them but are slowly replacing pandas etls with it. For some of our bigger jobs we use spark instead.

Essentially it's: 1. Data sources from places such as s3, sftp, rds 2. Use duckdb to load most of these with only extensions (I dont believe there's one for sftp, so we just have some python code to pull the files out.) 3. transform the data however we'd like with duckdb. 4. convert the duckdb table to pyarrow 5. Save to s3 with delta-rs

FWIW, we also have this all execute externally from our orchestration on an EC2 instance. This allows us to scale vertically.

quadrature · on Sept 20, 2023

This is very cool!.

Last time I checked duckdb didn't have the concept of a metastore so do you have an internal convention for table locations and folder structure ?.

What do you use for reports/visualizations? notebooks ?.

zlurker · on Sept 20, 2023

Yeah, dagster has a concept of metadata and assets so we have some code that'll map dagster's own logical representation to physical s3 locations.

Reports and viz varies a lot, the finance department uses tableau where as for more 'data sciencey' stuff we normally just use notebooks.

quadrature · on Sept 21, 2023

Hadn't heard of that dagster feature, thats really cool.

tconbeer · on Sept 20, 2023

It's great as: 1. An ephemeral processing engine. For example, I have a machine learning pipeline where I load data into a DataFrame, and then I can use DuckDB to execute SQL on my DataFrame (I prefer both the syntax and performance to Pandas). 2. A data lake processing engine. DuckDB makes it very easy to interact with partitioned files. 3. A lightweight datastore. I have one ETL pipeline where I need to cache the data if an API is unavailable. I just write the DataFrame to a DuckDB database that is on a mounted network filesystem, and read it back when I need it.

0cf8612b2e1e · on Sept 20, 2023

On a similar point, are people using the actual duck database format or sticking with Parquet? I love everything about DuckDB, but I feel more comfortable keeping things in an existing format.

My only work with it to date has been to load-analyze-usuallydelete to refresh, so I do not require any db mutability. Outside of mutability, not sure if there are any obvious wins with the format.

tconbeer · on Sept 20, 2023

It's a bit faster and easier than parquet, but right now the format is unstable, which is a huge downside and makes it unsuitable for medium/long-term storage. After DuckDB v1, they'll keep the format stable and then I think its popularity will increase dramatically.

pbowyer · on Sept 21, 2023

I'm using it for site-search analytics. Originally I was using MariaDB but because we do search-as-you-type there's a complex aggregation query to identify the "real" searches, and with 1000 entries that was taking 10s of seconds to run. Materialized views would've fixed that but...

My stack is now JSON files containing the search data. At present I'm loading these into DuckDB and pre-aggregating due to a bug I found in DuckDB, not because it was needed for performance.

thenipper · on Sept 20, 2023

I've been using it for taking output from our data validation steps and bundling that up with the data that was validated into one neat artifact we can download if there is an issue and explore manually.

souldeux · on Sept 21, 2023

duckDB is one of the built-ins for count.co, which i've come to love

whoopdeepoo · on Sept 21, 2023

We use it to sort parquets out of core and then the arrow interface to read into Python and export as geoparquets