This is very silly. You're not doing the challenge if you do the work up front. The idea is that you start with a file and the goal is to get the result as fast as possible.
How long did it take to distribute and import the data to all workers, what is the total time from file to result?
I can do this a million times faster on one machine, it just depends on what work I do up front.
Nobody cares if I can do it a million times faster, everyone can. It's cheating.
The whole reason you have to account for the time you spend setting it up is so that all work spent processing the data is timed. Otherwise we can just precomputed the answer and print it on demand, that is very fast and easy.
Just getting it into memory is a large bottleneck in the actual challenge.
If I first put it into a DB with statistics that tracks the needed min/max/mean then it's basically instant to retrieve, but also slower to set up because that work needs to be done somewhere. That's why the challenge is time from file to result.
How long did it take to distribute and import the data to all workers, what is the total time from file to result?
I can do this a million times faster on one machine, it just depends on what work I do up front.