> But for your typical spark application, you have one main writer (the spark driver) appending or merging a large number of records...
The multi-writer architecture can't be proven scalable because a single writer doesn't cause it to fall over.
I have caused issues by using 500 concurrent writers on embarrassingly parallel workloads. I have watched people choose sharding schemes to accommodate Iceberg's metadata throughput NOT the natural/logical sharding of the underlying data.
Last I half-knew (so check me), Spark may have done some funky stuff to workaround the Iceberg shortcomings. That is useless if you're not using Spark. If scalability of the architecture requires a funky client in one language and a cooperative backend, we might as well be sticking HDF5 on Lustre. HDF5 on Lustre never fell over for me in the 1000+ embarrassingly parallel concurrent writer use case (massive HPC turbulence restart files with 32K concurrent writers per https://ieeexplore.ieee.org/abstract/document/6799149 )
The multi-writer architecture can't be proven scalable because a single writer doesn't cause it to fall over.
I have caused issues by using 500 concurrent writers on embarrassingly parallel workloads. I have watched people choose sharding schemes to accommodate Iceberg's metadata throughput NOT the natural/logical sharding of the underlying data.
Last I half-knew (so check me), Spark may have done some funky stuff to workaround the Iceberg shortcomings. That is useless if you're not using Spark. If scalability of the architecture requires a funky client in one language and a cooperative backend, we might as well be sticking HDF5 on Lustre. HDF5 on Lustre never fell over for me in the 1000+ embarrassingly parallel concurrent writer use case (massive HPC turbulence restart files with 32K concurrent writers per https://ieeexplore.ieee.org/abstract/document/6799149 )