Hacker Newsnew | past | comments | ask | show | jobs | submit | anotherpaul's commentslogin

What's the advantage over using Polars for the same task? It seems to me the natural competitor here and I vastly prefer the Polars syntax over SQL any day. So I was curious if I should try duckdb or stick with polars

Polars would be better in some ways for sure. It was in one of my early prototypes. What put me off was that I was essentially designing my own database which I didn't trust as much as something like DuckDB.

Polars would let me have a lot of luxuries that are lost at the boundaries between my application and DuckDB, but those are weighed in the tradeoffs I was talking about. I do a lot of parsing at the boundaries to ensure data structures are sound, and otherwise DuckDB is enforcing strict schemas at runtime which provides as much safety as a dataset's schema requires. I do a lot of testing to ensure that I can trust how schemas are built and enforced as well.

Things like foreign keys, expressions that span multiple tables effortlessly, normalization, check constraints, unique constraints, and primary keys work perfectly right off the shelf. It's kind of perfect because the spec I'm supporting is fundamentally about normalized relational data.

Another consideration was that while Polars is a bit faster, we don't encounter datasets that require more speed. The largest dataset I've processed, including extensive transformations and complex validations (about as complex as they get in this spec), takes ~3 seconds for around 580k rows. That's on an M1 Max with 16GB of RAM, for what it's worth.

Our teams have written countless R scripts to do the same work with less assurance that the outputs are correct, having to relearn the spec each time, and with much worse performance (these people are not developers). So, we're very happy with DuckDB's performance despite that Polars would probably let us do it faster.

Having said that, if someone built the same project and chose Polars I wouldn't think they were wrong to do so. It's a great choice too, which is why your question is a good one.


Familiarity with SQL is a plus in my opinion. Also, DuckDB has SDKs in more languages compared to Polars.

I wasn't all that excited about SQL at first, but I've come around to it. Initially I really wanted to keep all of my data and operations in the application layer, and I'd gone to great lengths to model that to make it possible. I had this vision of all types of operations, queries, and so on being totally type safe and kept in a code-based registry such that I could do things like provide a GUI on top of data and functions I knew were 100% valid an compile-time. The only major drawback was that some kinds of changes to the application would require updating the repository.

I still love that idea but SQL turns out to be so battle-proven, reliable, flexible, capable, and well-documented that it's really hard to beat. After giving it a shot for a couple of weeks it became clear that it would yield a way more flexible and capable application. I'm confident enough that I can overcome the rough edges with the right abstractions and some polish over time.


Polars has all of the benefits of DuckDB (to some degree), but also allows for larger-than-memory datasets.


Interesting, I wasn't aware; thanks for that. I will say, Polars' implementation is much more centered on out-of-core processing, and bypasses some of DuckDB's limitations ("DuckDB cannot yet offload some complex intermediate aggregate states to disk"). Both incredible pieces of software.

To expand on this, Polars' `LazyFrame` implementation allows for simple addition of new backends like GPU, streaming, and now distributed computing (though it's currently locked to a vendor). The DuckDB codebase just doesn't have this flexibility, though there are ways to get it to run on GPU using external software.


Thanks for that insight as well! My needs don't tend to be so demanding so I've gotten away without knowing these details, but I suspect I the not-so-distant future this could be useful to know.

Being able to use distributed backends to process frames sounds kind of incredible, but I can't imagine my little projects ever making use of it. Still, very cool stuff.


Have you seen Ibis[1]? It's a dataframe API that translates calls to it into various backends, including Polars and DuckDB. I've messed around with it a little for cases where data engineering transforms had to use pyspark but I wanted to do exploratory analysis in an environment that didn't have pyspark.

[1] https://ibis-project.org/


Wait this looks interesting. I am a biologist so I might get the terminology wrong. Would this allow me to run a ipv4 to ipv6 and back service?

I got some services with only ipv6 addresses and want clients with only ipv4 (sadly still exists) to at least be able to reach them. So could I dedicate a machine to translating for them using this tool?


Yes, translating packets between IPv6 and IPv4 is precisely what Jool does.

From what you're describing I think you have to options: if you have enough IPv4 addresses at your disposal to cover your IPv6-only machines, you can use the so called "SIIT-DC" mode [1].

Otherwise, if you have less IPv4 addresses, say just one on your router, and multiple IPv6 machine you can setup a stateful NAT64 [2] with some static BIB entries. NAT64 is basically the familiar NAT, just with IPv6 in the LAN instead of private IPv4 addresses (say 192.168.1.0), and static BIB entries are the equivalent of port forwarding. In this case you would run Jool on your router.

[1]: https://www.jool.mx/en/siit-dc.html

[2]: https://www.jool.mx/en/run-nat64.html

[3]: https://www.jool.mx/en/usr-flags-bib.html


I appreciate your reply. Thank you.

I am using socat right now to achieve this translation but it is rather slow. So o hope a proper solution using tool might be more powerful. But it seems it requires at least a bit more networking insight than what I have at this moment. It's an opportunity to learn something new for me

Right now I simply rent a hetzner machine including a v4 ip to route the traffic to my V6 services.


I think this would allow that, yes.

However, I personally would just do it in userspace, especially for that simple of a use. I'm doing the opposite; I have a webapp that somehow doesn't handle IPv6, so to access it over a pure-v6 network I just run this on the same host:

  socat TCP6-LISTEN:8002,fork TCP4:127.0.0.1:8000
I believe you could trivially reverse this;

  socat TCP4-LISTEN:8002,fork TCP6:[::1]:8000
should serve [::1]:8000 as 0.0.0.0:8002 (I don't remember if changing ports was strictly required; that may be a quirk of my exact setup).

I did that with forwarding to another host but it's super slow (10 Mbit) on a cheap hetzner box. So I am looking for this functionality but faster

The point of Jool and similar tools (there is also one called Tayga that runs in userspace, if you want) is to translate network traffic between multiple hosts, where some only have IPv6 and others only IPv4 addresses.

If your machine has both IPv6 and IPv4 addresses you don't need to any translation.


Sure, but if your goal is

> I got some services with only ipv6 addresses and want clients with only ipv4 (sadly still exists) to at least be able to reach them.

then that seems like overkill. Although it depends on your network, of course.


I interpreted "services with only IPv6 addresses" as IPv6-only servers, in which case some sort of translation is needed, but if these are just processes in a dual stack server, then yes.

you could also try using 6to4 or somesuch but this is new to me as well. Interested!

6to4 solves a different problem: it's a way to provide IPv6 internet access to some host with only IPv4 internet access. It's basically a VPN you need to configure on the client.

NAT64 and SIIT (what Jool and af-to are implementing) instead are a way to let (potentially) any IPv4-only client to connect to some IPv6-only machine you control. The client need to be aware its actually talking to an IPv6 machine, because there is a translator (typicall a router between them) that transparently translate the packet so they understand each other.


Thanks!

I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc) Once you do 20k times a few post translational modifications you can come to a few columns quickly.

Usually this would be stored in a sparse long form though. So I might be wrong.


If you want to do that why not just do an EVA pattern or something else that can translate rows to columns?

I agree, I used omz a while now but I have since also realised that the features I uses are so basic, it really does not warrant a whole software project as a dependency.

So I went and had Gemini make me a zsh config with the features I actually use. Took 15 minutes to get all the autocompelte, aliases and search functionality and done.


Oh I hate it it's so brain rotty. Well done. Well done indeed.

I don't quite understand: Instead of using the phones GPS to let me simply chat with people around me, which would be great during traveling or commute, I need to choose the place I chat at?

This seems super counter productive in my opinion. It creates way more friction that I want.

Maybe I want to save a location I have been to as a chatroom, sure but my primary interest would be to have my location determine the chat. So if I enter a university building: boom university chat. I enter Cern: boom Cern chat.

The hard part would be to not just use rectangles but actually make the shapes meaningful. I don't want to walk past a high school or live next to one and then be included in that chat. So yeah. Tricky


One obvious feature would be to provide geo fenced Wikipedia or news feed.

Like what is the highest rated/longest Wikipedia article in the area.

Or maybe what's the 3 top radio stations and a link to them.

There is plenty of local content that Google does not surface


Geo fenced Wikipedia exists already, as a number of apps are available which offer regional maps with localized Wikipedia additions, see for example https://wiki.openstreetmap.org/wiki/OsmAnd


I could do that. Have additional information about the area based on the perimeter.


These are great ideas!


100% agree, I am still shocked that the models are not open sourced. It's the data from the community and I feel it goes very much against the spirit of the community to keep the machine learning part, which is very central to the app, so secret.


As far as I understand they do try to keep the heat around for the next decompression. As of course they need it. But I could not find what type of heat storage they use. Ultimately they "only" seem to need to store it for 12h, right?


Yes in figure 2 it's 3 mice, next figure 3 they also have 5 (panel e)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: