Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
RipTable – multi-threaded Python data analytics tools for numpy arrays/datasets (github.com/rtosholdings)
79 points by aldanor on Nov 28, 2020 | hide | past | favorite | 14 comments


Does anyone know how this compares to something like Dask?


Riptable can only distribute at the low level of primitives it sees, so it must split tasks and collect results at the level of single function calls. Dask builds up graphs of operations on a higher level. There are advantages and disadvantages to both approaches, depending on the actual task.


it's pretty crazy what people are willing to do to make Python faster: Pyston, Numba, Cython, RipTable. Instead of just biting the bullet and adandoning Python, they spend thousands of manhours trying to hack a fundamentally slow and broken language into something that is acceptable.

I see these hacks as a cognitive bias and rarely think they are the rational and correct decision.

Just use Julia and call it a day. It really is that easy.


So, perhaps it makes sense in your industry to go all-in on Julia.

Personally, I'm still burned by the time they announced 1.0 mere days after announcing 0.7, and thus meaning that literally nothing in the ecosystem worked at 1.0.

If that transition had been managed better, then I would be much more comfortable evangelising Julia now, but it didn't and the decisions behind this make me incredibly uncomfortable in adopting it anywhere where I'm on the hook for failures.

Additionally, Python has libraries for everything, and is known (badly at least), by almost everyone in Data Science/programming. When your modelling is a part of a larger project, this matters a lot (and is why I no longer write my production code in R).

That being said, I really like Julia, and look forward to checking it out again, hopefully with less breakages in the future.


It's not the Python itself but the ecosystem and tooling around it, especially in machine learning / data science area. I find it strange that many people don't understand it and instead compare languages is. "Just use Julia" is not really an option in enterprise world either, when you already have hundreds of kloc, hundreds/thousands of packages and hundreds of people (not all of them professional developers, e.g. various analysts and quants) working on / using that on daily basis in production, plus all that code almost certainly depends on external packages (eg machine learning libraries) that are simply not available in Julia.

For personal projects - sure, go use Julia or ditch it and go straight for Rust.


They don't need to be available on Julia, calling Python libraries from Julia is incredibly painful. I'm not saying you should migrate your whole data analysis pipeline over, I'm saying you should write new services/analysis in Julia.

There are production trading systems that are written in Python. And everyday, the quants and devs fruitlessly try and squeeze out more performance from the system instead of just accepting that Python is inappropriate.


Well, this repo is definitely about data analysis pipelines and not about production trading systems, no one is arguing that writing live prod trading systems in Python is inappropriate.

And in order to write "new analysis" in another language, you would need to have access to the ecosystem that's been already developed in Python (both internally and externally), even to just get as far as loading all the data from all the data sources you need.

Also worth a mention, "Python" often really means "C++ wrapped in pybind or the like"; again, it's pretty hard to match if there's a developed ecosystem of C/C++ libraries and extensions.


> RipTable has been in development for 3 years and tested by dozens of quants at a large financial firm.

I guess it's Susquehanna.


That seems almost certainly true, as at least four of the contributors seem to work there (Amarildo Zeneli, Thomas McClintock, Thomas Dimitri, Jack Papas).

I wonder why they don't just release it as a Susquehanna (rather than "RTOS Holdings") project, as their involvement isn't exactly secret.


Why would this be preferred over using a GPU for calculations?


Your GPU has a few hundred gb of ram?


Well, multiple GPUs can easily have this, yes. Surely the mining rigs they use in crypto can also be used in data science. Plus the new Nvidia GPU has 80gb.


It's mostly not about "we need to mine these 100gb of data", but rather pandas-like comfort of messing around with humongous amount of data that is stored in ram, applying some common aggregations, filters, searches etc, and possibly calling some c functions if needed. This just not the use case for gpus.


Looks interesting, a benchmark page would help in promoting it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: