RipTable – multi-threaded Python data analytics tools for numpy arrays/datasets

abalaji · on Nov 28, 2020

Does anyone know how this compares to something like Dask?

mattip · on Nov 28, 2020

Riptable can only distribute at the low level of primitives it sees, so it must split tasks and collect results at the level of single function calls. Dask builds up graphs of operations on a higher level. There are advantages and disadvantages to both approaches, depending on the actual task.

smabie · on Nov 28, 2020

it's pretty crazy what people are willing to do to make Python faster: Pyston, Numba, Cython, RipTable. Instead of just biting the bullet and adandoning Python, they spend thousands of manhours trying to hack a fundamentally slow and broken language into something that is acceptable.

I see these hacks as a cognitive bias and rarely think they are the rational and correct decision.

Just use Julia and call it a day. It really is that easy.

disgruntledphd2 · on Nov 29, 2020

So, perhaps it makes sense in your industry to go all-in on Julia.

Personally, I'm still burned by the time they announced 1.0 mere days after announcing 0.7, and thus meaning that literally nothing in the ecosystem worked at 1.0.

If that transition had been managed better, then I would be much more comfortable evangelising Julia now, but it didn't and the decisions behind this make me incredibly uncomfortable in adopting it anywhere where I'm on the hook for failures.

Additionally, Python has libraries for everything, and is known (badly at least), by almost everyone in Data Science/programming. When your modelling is a part of a larger project, this matters a lot (and is why I no longer write my production code in R).

That being said, I really like Julia, and look forward to checking it out again, hopefully with less breakages in the future.

aldanor · on Nov 29, 2020

It's not the Python itself but the ecosystem and tooling around it, especially in machine learning / data science area. I find it strange that many people don't understand it and instead compare languages is. "Just use Julia" is not really an option in enterprise world either, when you already have hundreds of kloc, hundreds/thousands of packages and hundreds of people (not all of them professional developers, e.g. various analysts and quants) working on / using that on daily basis in production, plus all that code almost certainly depends on external packages (eg machine learning libraries) that are simply not available in Julia.

For personal projects - sure, go use Julia or ditch it and go straight for Rust.

smabie · on Nov 30, 2020

They don't need to be available on Julia, calling Python libraries from Julia is incredibly painful. I'm not saying you should migrate your whole data analysis pipeline over, I'm saying you should write new services/analysis in Julia.

There are production trading systems that are written in Python. And everyday, the quants and devs fruitlessly try and squeeze out more performance from the system instead of just accepting that Python is inappropriate.

aldanor · on Nov 30, 2020

Well, this repo is definitely about data analysis pipelines and not about production trading systems, no one is arguing that writing live prod trading systems in Python is inappropriate.

And in order to write "new analysis" in another language, you would need to have access to the ecosystem that's been already developed in Python (both internally and externally), even to just get as far as loading all the data from all the data sources you need.

Also worth a mention, "Python" often really means "C++ wrapped in pybind or the like"; again, it's pretty hard to match if there's a developed ecosystem of C/C++ libraries and extensions.

Tinyyy · on Nov 28, 2020

> RipTable has been in development for 3 years and tested by dozens of quants at a large financial firm.

I guess it's Susquehanna.

jamessb · on Nov 28, 2020

That seems almost certainly true, as at least four of the contributors seem to work there (Amarildo Zeneli, Thomas McClintock, Thomas Dimitri, Jack Papas).

I wonder why they don't just release it as a Susquehanna (rather than "RTOS Holdings") project, as their involvement isn't exactly secret.

arbol · on Nov 28, 2020

Why would this be preferred over using a GPU for calculations?

aldanor · on Nov 28, 2020

Your GPU has a few hundred gb of ram?

arbol · on Nov 29, 2020

Well, multiple GPUs can easily have this, yes. Surely the mining rigs they use in crypto can also be used in data science. Plus the new Nvidia GPU has 80gb.

aldanor · on Nov 29, 2020

It's mostly not about "we need to mine these 100gb of data", but rather pandas-like comfort of messing around with humongous amount of data that is stored in ram, applying some common aggregations, filters, searches etc, and possibly calling some c functions if needed. This just not the use case for gpus.

antman · on Nov 28, 2020

Looks interesting, a benchmark page would help in promoting it