Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> But for your typical spark application, you have one main writer (the spark driver) appending or merging a large number of records...

The multi-writer architecture can't be proven scalable because a single writer doesn't cause it to fall over.

I have caused issues by using 500 concurrent writers on embarrassingly parallel workloads. I have watched people choose sharding schemes to accommodate Iceberg's metadata throughput NOT the natural/logical sharding of the underlying data.

Last I half-knew (so check me), Spark may have done some funky stuff to workaround the Iceberg shortcomings. That is useless if you're not using Spark. If scalability of the architecture requires a funky client in one language and a cooperative backend, we might as well be sticking HDF5 on Lustre. HDF5 on Lustre never fell over for me in the 1000+ embarrassingly parallel concurrent writer use case (massive HPC turbulence restart files with 32K concurrent writers per https://ieeexplore.ieee.org/abstract/document/6799149 )



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: