Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pandas is generally awful unless you're just living in a notebook (and even then it's probably least favorite implementation of the 'data frame' concept).

Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.

Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.

Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.

What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?



I initially considered using Pandas to work with community collections of Elite: Dangerous game data, specifically those published first by EDDB (RIP) and now by Spansh. However, I quickly hit the maximum process memory limits because my naïve attempts at manipulating even the smallest of those collections resulted in Pandas loading GB-scale JSON data files into RAM. I'm intrigued by Polars stated support for data streaming. More professionally, I support the work of bioinformaticians, statisticians, and data scientists, so I like to stay informed.

I like how in Pandas (and in R), I can quickly load data sets up in a manner that lets me do relational queries using familiar syntax. For my Elite: Dangerous project, because I couldn't get Pandas to work for me (which the reader should chalk up to my ignorance and not any deficiency of Pandas itself), I ended up using the SQLAlchemy ORM with Marshmallow to load the data into SQLite or PostgreSQL. Looking back at the work, I probably ought to have thrown it into a JSON-aware data warehouse somehow, which I think is how the guy behind Spansh does it, but I'm not a big data guy (yet) and have a lot to learn about what's possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: