Hacker Newsnew | past | comments | ask | show | jobs | submit | chrisjc's commentslogin

Any thought about a pass-through server for various non-local databases?

It should be extremely simple for databases that support ADBC (for example Snowflake, PostgreSQL).

For others it might just be a matter of mapping DDL, DML, DQL, etc to a supported database protocol driver (JDBC, ODBC, etc). Of course this is where things may get challenging as it would become the responsibility of your server to convert result to Arrows (tables/streams/etc). But could potentially be delegated to "worker" Flight servers (not a Flight SQL server) and then the server could return/forward their Arrow results (Flight results).

Of course some of this is to some degree already possible through DuckDB's MySQL/Postgres Extensions.

I imagine this could also be useful for developing/testing locally?

It might also provide a way to interchange databases while potentially easing database migrations (vendor to vendor) if ADBC isn't supported by the vendor.

Another potential value-addition could be to provide SQL dialect management by providing Substrait conversions (or sqlglot but looks like the server is Java, so unsure if possible, maybe Graal?).


Hi chrisjc, these are interesting thoughts. I was mainly interested in providing a server interface for DuckDB b/c of the performance I observed when running large OLAP workloads, and especially the cost savings vs cloud SaaS offerings. I think the big benefit here is that you can run the same workload at roughly equivalent performance - but for 80% less - from my tests...

I like your ideas, though. The server is written in C++.

I hope this helps...

Thanks!


duckdb -cmd 'CREATE TABLE my_data AS FROM READ_PARQUET($$data.parquet$$)' -ui

`duckdb -ui sqlitedb.db` should work bc duckdb can read sqlite files. If it doesn't autoload extension, you can INSTALL/LOAD in to your ~/.duckdbrc


    duckdb -ui pre_seeded_db.db

    duckdb -ui -init startup.sql
where startup.sql contains various statements/commands like `.open pre_seeded_db.db`

Alternatively place statements/commands in `~/.duckdbrc` and just run `duckdb -iu`.


> In the worst-case scenario, they would be 2-3 years behind the cutting edge, which is not mission-critical.

It's also worth considering how such an event might affect the US and allies. Would it slowdown, perhaps even halt certain operations/efforts for the US. For instance, all those chips the US needs to build supercomputers for "weather research". ;)


What allies will the US have in a few years?


Haven't tried it out yet, but the release notes look very promising esp relating to Arrow interactions

- Push dynamically generated join filters through UNION, UNNEST and AGGREGATE

- Fix arrow table filters

- [Arrow] Fix scan of an object providing the PyCapsuleInterface when projection pushdown is possible.

- DuckDB Arrow Non Canonical Extensions to use arrow.opaque

- Arrow Extension Type to be registered in DuckDB Extensions

- [Python] Use an ArrowQueryResult in FetchArrowTable when possible.

https://github.com/duckdb/duckdb/releases/tag/v1.2.0


Perhaps I don't fully understand what you're saying

    CREATE TABLE person (name VARCHAR, age BIGINT);
    INSERT INTO person VALUES ('Alice', 37), ('Ana', 35), ('Bob', 41), ('Bea', 25);
    CREATE OR REPLACE MACRO my_query_table(t) AS TABLE (SELECT * FROM query_table(t));
    SELECT * FROM my_query_table('person');
Or do you mean that you can't use `query_table` with a macro identifier in place of the table identifier as input (after all, where would put the macro args)? Perhaps using `query(...)` instead would work?

    CREATE OR REPLACE MACRO query_macro(m, i) AS TABLE (SELECT * FROM query(CONCAT('SELECT * FROM ', m::STRING, $$($$, i::STRING, $$)$$)));
    SELECT * FROM query_macro('my_query_table', 'person');
Or maybe through some other kind of gymnastics https://duckdb.org/2024/03/01/sql-gymnastics.html


Here is a contrived example:

    CREATE MACRO foo(bar, baz) AS TABLE
    WITH
        base AS (
            FROM query_table(bar) t
        ),
        refined AS (
            FROM base
            JOIN query_table(baz) u ON base.id = u.id
        )
    SELECT * FROM refined;
I want to test many user-defined combinations of (bar, baz). How can I pass a macro instance to query_table like

    FROM (
        (FROM foo(mybar1(some_arg1, some_arg2), mybaz101(500)))
        UNION ALL
        (FROM foo(mybar2(some_arg3, some_arg4), mybaz201(300)))
        UNION ALL
        ...
    )
    SELECT *;
Your second example using the 'query_macro' looks like something I was looking for. But it doesn't seem to be of general use that supports an arbitrary number of arguments.


  FROM (
        (with
          baz as (from baz101(500)),
          bar as (from bar1(arg1, arg2))
          from foo
        )
        UNION ALL
        (with
          baz as (from baz201(300)),
          bar as (from bar2(arg3, arg4))
          from foo
        )
        UNION ALL
        ...
    )
    SELECT *;
When you pass table arguments to a macro, don't pass them as a parameter to the macro, instead make a subquery and give it the name that `foo` expects.


Not as terse as I imagined it could be, but I really didn't expect that it could be done this way. Thanks!


I've been eagerly awaiting this for a couple of months now. And I've long wondered why there hasn't been such an implementation/extension for Flight especially as there are extensions for mysql, postgres, sqlite...

Only seems natural that an extension could developed for ATTACHing to a Flight server.

Looking forward to watching your talk.


For read-oriented interactions...

    ADBC --sql--> Flight SQL Server --flight--> Flight Servers (plural) --> datasources
or

    SELECT * FROM <relation> WHERE ...
where the "relation" is a collection of Arrow stream from a cluster of Flight Servers.

Remember that Flight/Flight-SQL is not a query engine replacement, but rather a way to communicate and exchange data between components with different responsibilities efficiently.


Thank you for all the work you guys do. The Arrow ecosystem is just absolutely incredible.

My few gripes related to interop with duckdb are related to Arrow scanning/pushdowns. And this extends to interop with other projects like pyiceberg too.

Registering an Arrow Dataset (or pyiceberg scan) as a "duckdb relation" (virtual view) is still a little problematic. Querying these "relations" does not always result in an optimal outcome.

For Arrow datasets, you can intercept the duckdb pushdown, but duckdb will have already "optimized" the plan to its liking, and any scanning restrictions that may have been more advantageous based on the nuances of the dataset might have been lost. Eg:

    WHERE A IN (3, 5, 7)
is presented to the Arrow scanner (pushdown) as "A is between 3 and 7 inclusive" (https://duckdb.org/docs/guides/performance/indexing.html#zon...).

Perhaps in a similar way, turning an pyiceberg scan into a relation for duckdb effectively takes the entire scan and creates an Arrow Table rather than some kind of pushdown/"scan plan" for duckdb to potentially make more efficient with its READ_PARQUET() functionality.

Most of this is probably dependent on duckdb development, but all of the incredible interop work done across communities/ecosystems so far gives me a lot of confidence that these will soon be matters of the past.


IN list filter predicate pushdown is much improved in DuckDB 1.2, coming in about a week! I am not sure if it applies to Arrow yet or not. Disclaimer: I work at MotherDuck and DuckDB Labs


@1egg0myegg0 that's great to hear. I'll check to see if it applies to Arrow.

Another performance issue with DuckDB/Arrow integration that we've been working to solve is that Arrow lacked a canonical way to pass statistics along with a stream of data. So for example if you're reading Parquet files and passing them to DuckDB, you would lose the ability to pass the Parquet column statistics to DuckDB for things like join order optimization. We recently added an API to Arrow to enable passing statistics, and the DuckDB devs are working to implement this. Discussion at https://github.com/apache/arrow/issues/38837.


And all the Arrow parts work together quite nicely.

    ADBC client --> Flight SQL (duckdb/whatever) --> Flight --> ?
The result highlights your exact point: why take your data and transpose it twice?

It's quite an exciting space, and lots of projects popping up around Arrow Flight and duckdb.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: