What's the "conveyor belt" between vertex and fragment shader? Doesn't communication between these two shaders happen via GPU memory in discrete GPUs as well as in integrated/mobile ones?
Or is that something that only happened after the switch to the unified shader model?
It depends on the GPU, in some cases there is a direct pipeline from vertex shading to fragment shading. But typically (even on mobile GPUs) there is some sort of cache to exploit the fact that often the same vertex gets shaded multiple times. In tile-based rendering typically the shaded vertex data is all partitioned into "bins" that are used to do things like depth sorting, hidden surface removal etc.
There's often a memory hierarchy as well - for example the "tile" memory might be special dedicated high-speed memory, as it was on the XBox 360. So you could end up with one type of memory being used for the input vertex/index data and a different kind used for the "bins" containing shaded vertex data.
(Despite being a high-power desktop-grade GPU part, the XBox 360 had "predicated tiling" mode which would partition the screen up into big tiles, much like described in ARM's PDF. They did this to support multisampling at high resolutions.)
Also, unless things have changed, modern NVIDIA GPUs are sort of tile based, albeit not in the same way as an ARM mobile GPU. See https://www.youtube.com/watch?v=Nc6R1hwXhL8 for a demo of this. This type of rasterization implies buffering the shaded vertex data in order to be able to rasterize multiple vertices in a single "tile" at once.
Or is that something that only happened after the switch to the unified shader model?